Ingo Molnar wrote:
i'm pleased to announce the first release of paravirtualized KVM (Linux
under Linux), which includes support for the hardware cr3-cache feature
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
http://redhat.com/~mingo/kvm-paravirt-patches/
Some aspects of the code are still a bit ad-hoc and incomplete, but the
code is stable enough in my testing and i'd like to have some feedback.
Firstly, here are some numbers:
2-task context-switch performance (in microseconds, lower is better):
native: 1.11
----------------------------------
Qemu: 61.18
KVM upstream: 53.01
KVM trunk: 6.36
KVM trunk+paravirt/cr3: 1.60
i.e. 2-task context-switch performance is faster by a factor of 4, and
is now quite close to native speed!
Very impressive! The gain probably comes not only from avoiding the
vmentry/vmexit, but also from avoiding the flushing of the global page
tlb entries.
"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
native: 0.25
----------------------------------
Qemu: 7.8
KVM upstream: 2.8
KVM trunk: 0.55
KVM paravirt/cr3: 0.36
almost twice as fast.
"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
native: 0.9
----------------------------------
Qemu: 35.2
KVM upstream: 9.4
KVM trunk: 2.8
KVM paravirt/cr3: 2.2
still a 30% improvement - which isnt too bad considering that 200 tasks
are context-switching in this workload and the cr3 cache in current CPUs
is only 4 entries.
This is a little too good to be true. Were both runs with the same
KVM_NUM_MMU_PAGES?
I'm also concerned that at this point in time the cr3 optimizations will
only show an improvement in microbenchmarks. In real life workloads a
context switch is usually preceded by an I/O, and with the current sorry
state of kvm I/O the context switch time would be dominated by the I/O time.
the patchset does the following:
- it provides an ad-hoc paravirtualization hypercall API between a Linux
guest and a Linux host. (this will be replaced with a proper
hypercall later on.)
- using the hypercall API it utilizes the "cr3 target cache" feature in
Intel VMX CPUs, and extends KVM to make use of that cache. This
feature allows the avoidance of expensive VM exits into hypervisor
context. (The guest needs to be 'aware' and the cache has to be
shared between the guest and the hypervisor. So fully emulated OSs
wont benefit from this feature.)
- a few simpler paravirtualization changes are done for Linux guests: IO
port delays do not cause a VM exit anymore, the i8259A IRQ controller
code got simplified (this will be replaced with a proper, hypercall
based and host-maintained IRQ controller implementation) and TLB
flushes are more efficient, because no cr3 reads happen which would
otherwise cause a VM exit. These changes have a visible effect
already: they reduce qemu's CPU usage when a guest idles in HLT, by
about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has
HZ=1000)
Paravirtualization is triggered via the kvm_paravirt=1 boot option (for
now, this too is ad-hoc) - if that is passed then the KVM guest will
probe for paravirtualization availability on the hypervisor side - and
will use it if found. (If the guest does not find KVM-paravirt support
on the hypervisor side then it will continue as a fully emulated guest.)
Issues: i only tested this on 32-bit VMX. (64-bit should work with not
too many changes, the paravirt.c bits can be carried over to 64-bit
almost as-is. But i didnt want to spread the code too wide.)
Comments, suggestions are welcome!
Ingo
Some comments below on the code.
+
+/*
+ * Special, register-to-cr3 instruction based hypercall API
+ * variant to the KVM host. This utilizes the cr3 filter capability
+ * of the hardware - if this works out then no VM exit happens,
+ * if a VM exit happens then KVM will get the virtual address too.
+ */
+static void kvm_write_cr3(unsigned long guest_cr3)
+{
+ struct kvm_vcpu_para_state *para_state = &get_cpu_var(para_state);
+ struct kvm_cr3_cache *cache = ¶_state->cr3_cache;
+ int idx;
+
+ /*
+ * Check the cache (maintained by the host) for a matching
+ * guest_cr3 => host_cr3 mapping. Use it if found:
+ */
+ for (idx = 0; idx < cache->max_idx; idx++) {
+ if (cache->entry[idx].guest_cr3 == guest_cr3) {
+ /*
+ * Cache-hit: we load the cached host-CR3 value.
+ * This never causes any VM exit. (if it does then the
+ * hypervisor could do nothing with this instruction
+ * and the guest OS would be aborted)
+ */
+ asm volatile("movl %0, %%cr3"
+ : : "r" (cache->entry[idx].host_cr3));
+ goto out;
+ }
+ }
+
+ /*
+ * Cache-miss. Load the guest-cr3 value into cr3, which will
+ * cause a VM exit to the hypervisor, which then loads the
+ * host cr3 value and updates the cr3_cache.
+ */
+ asm volatile("movl %0, %%cr3" : : "r" (guest_cr3));
+out:
+ put_cpu_var(para_state);
+}
Well, you did say it was ad-hoc. For reference, this is how I see the
hypercall API:
- A virtual pci device exports a page through the pci rom interface.
The page contains the hypercall code approriate for the current cpu.
This allows migration to work across different cpu vendors.
- In case the pci rom is discovered too late in the boot process, the
address (gpa) can also be exported via a kvm-specific msr.
- Guest/host communications is by guest physical addressed, as the
virtual->physical translation is much cheaper on the guest (__pa() vs a
page table walk).
+
+/*
+ * Simplified i8259A controller handling:
+ */
+static void mask_and_ack_kvm(unsigned int irq)
+{
+ unsigned int irqmask = 1 << irq;
+ unsigned long flags;
+
+ spin_lock_irqsave(&i8259A_lock, flags);
+ cached_irq_mask |= irqmask;
+
+ if (irq & 8) {
+ outb(cached_slave_mask, PIC_SLAVE_IMR);
+ outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
+ outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI'
to master-IRQ2 */
+ } else {
+ outb(cached_master_mask, PIC_MASTER_IMR);
+ /* 'Specific EOI' to master: */
+ outb(0x60+irq, PIC_MASTER_CMD);
+ }
+ spin_unlock_irqrestore(&i8259A_lock, flags);
+}
Any reason this can't be applied to mainline? There's probably no
downside to native, and it would benefit all virtualization solutions
equally.
Index: linux/drivers/kvm/kvm.h
===================================================================
--- linux.orig/drivers/kvm/kvm.h
+++ linux/drivers/kvm/kvm.h
@@ -165,7 +165,7 @@ struct kvm_mmu {
int root_level;
int shadow_root_level;
- u64 *pae_root;
+ u64 *pae_root[KVM_CR3_CACHE_SIZE];
hmm. wouldn't it be simpler to have pae_root always point at the
current root?
Index: linux/drivers/kvm/mmu.c
===================================================================
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -779,7 +779,7 @@ static int nonpaging_map(struct kvm_vcpu
static void mmu_free_roots(struct kvm_vcpu *vcpu)
{
- int i;
+ int i, j;
struct kvm_mmu_page *page;
#ifdef CONFIG_X86_64
@@ -793,21 +793,40 @@ static void mmu_free_roots(struct kvm_vc
return;
}
#endif
- for (i = 0; i < 4; ++i) {
- hpa_t root = vcpu->mmu.pae_root[i];
+ /*
+ * Skip to the next cr3 filter entry and free it (if it's occupied):
+ */
+ vcpu->cr3_cache_idx++;
+ if (unlikely(vcpu->cr3_cache_idx >= vcpu->cr3_cache_limit))
+ vcpu->cr3_cache_idx = 0;
- ASSERT(VALID_PAGE(root));
- root &= PT64_BASE_ADDR_MASK;
- page = page_header(root);
- --page->root_count;
- vcpu->mmu.pae_root[i] = INVALID_PAGE;
+ j = vcpu->cr3_cache_idx;
+ /*
+ * Clear the guest-visible entry:
+ */
+ if (vcpu->para_state) {
+ vcpu->para_state->cr3_cache.entry[j].guest_cr3 = 0;
+ vcpu->para_state->cr3_cache.entry[j].host_cr3 = 0;
+ }
+ ASSERT(vcpu->mmu.pae_root[j]);
+ if (VALID_PAGE(vcpu->mmu.pae_root[j][0])) {
+ vcpu->guest_cr3_gpa[j] = INVALID_PAGE;
+ for (i = 0; i < 4; ++i) {
+ hpa_t root = vcpu->mmu.pae_root[j][i];
+
+ ASSERT(VALID_PAGE(root));
+ root &= PT64_BASE_ADDR_MASK;
+ page = page_header(root);
+ --page->root_count;
+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
+ }
}
vcpu->mmu.root_hpa = INVALID_PAGE;
}
You keep the page directories pinned here. This can be a problem if a
guest frees a page directory, and then starts using it as a regular
page. kvm sometimes chooses not to emulate a write to a guest page
table, but instead to zap it, which is impossible when the page is
freed. You need to either unpin the page when that happens, or add a
hypercall to let kvm know when a page directory is freed.
There is also a minor problem that changes to the pgd aren't caught by
kvm. It doesn't hurt much as this is PV and we can relax the guest/host
contract a little.
static int alloc_mmu_pages(struct kvm_vcpu *vcpu)
{
struct page *page;
- int i;
+ int i, j;
ASSERT(vcpu);
@@ -1227,17 +1261,22 @@ static int alloc_mmu_pages(struct kvm_vc
++vcpu->kvm->n_free_mmu_pages;
}
- /*
- * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
- * Therefore we need to allocate shadow page tables in the first
- * 4GB of memory, which happens to fit the DMA32 zone.
- */
- page = alloc_page(GFP_KERNEL | __GFP_DMA32);
- if (!page)
- goto error_1;
- vcpu->mmu.pae_root = page_address(page);
- for (i = 0; i < 4; ++i)
- vcpu->mmu.pae_root[i] = INVALID_PAGE;
+ for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
+ /*
+ * When emulating 32-bit mode, cr3 is only 32 bits even on
+ * x86_64. Therefore we need to allocate shadow page tables
+ * in the first 4GB of memory, which happens to fit the DMA32
+ * zone:
+ */
+ page = alloc_page(GFP_KERNEL | __GFP_DMA32);
+ if (!page)
+ goto error_1;
+
+ ASSERT(!vcpu->mmu.pae_root[j]);
+ vcpu->mmu.pae_root[j] = page_address(page);
+ for (i = 0; i < 4; ++i)
+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
+ }
Since a pae root uses just 32 bytes, you can store all cache entries in
a single page. Not that it matters much.
#define ENABLE_INTERRUPTS_SYSEXIT \
Index: linux/include/linux/kvm.h
===================================================================
--- linux.orig/include/linux/kvm.h
+++ linux/include/linux/kvm.h
@@ -238,4 +238,44 @@ struct kvm_dirty_log {
#define KVM_DUMP_VCPU _IOW(KVMIO, 250, int /* vcpu_slot */)
+
+#define KVM_CR3_CACHE_SIZE 4
+
+struct kvm_cr3_cache_entry {
+ u64 guest_cr3;
+ u64 host_cr3;
+};
+
+struct kvm_cr3_cache {
+ struct kvm_cr3_cache_entry entry[KVM_CR3_CACHE_SIZE];
+ u32 max_idx;
+};
+
+/*
+ * Per-VCPU descriptor area shared between guest and host. Writable to
+ * both guest and host. Registered with the host by the guest when
+ * a guest acknowledges paravirtual mode.
+ */
+struct kvm_vcpu_para_state {
+ /*
+ * API version information for compatibility. If there's any support
+ * mismatch (too old host trying to execute too new guest) then
+ * the host will deny entry into paravirtual mode. Any other
+ * combination (new host + old guest and new host + new guest)
+ * is supposed to work - new host versions will support all old
+ * guest API versions.
+ */
+ u32 guest_version;
+ u32 host_version;
+ u32 size;
+ u32 __pad_00;
+
+ struct kvm_cr3_cache cr3_cache;
+
+} __attribute__ ((aligned(PAGE_SIZE)));
+
+#define KVM_PARA_API_VERSION 1
+
+#define KVM_API_MAGIC 0x87654321
+
<linux/kvm.h> is the vmm userspace interface. The guest/host interface
should probably go somewhere else.
--
error compiling committee.c: too many arguments to function
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]