Re: [PATCH 6/13] KVM: memory slot management

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Arnd Bergmann wrote:
On Thursday 26 October 2006 19:27, Avi Kivity wrote:
kvm defines memory in "slots", more or less corresponding to the DIMM
slots.

this allows us to:
 - avoid the VGA hole at 640K
 - add a pci framebuffer at runtime
 - hotplug memory

Signed-off-by: Yaniv Kamay <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>

To bring up the discussion about guest memory allocation again,
I'd like to make a case for using defining guest real memory
as host user, not a special in-kernel address space.

You're probably aware of many of these points, but I'd like
to list all that I can think of, in case you haven't thought
of them:

- no need to preallocate memory that the guest doesn't actually use.

Well, a fully vitrualized guest will likely use all the memory it gets. Linux certainly will.

- guest memory can be paged to disk.
- you can mmap files into multiple guest for fast communication
- you can use mmap host files as backing store for guest blockdevices,
  including ext2 with the -o xip mount option to avoid double paging

What do you mean exactly? to respond to a block device read by mmap()ing the backing file into the pages the host requested?

(e.g. turn a host bio read into a guest mmap)

If we allow the pages to be writable, the guest could write into the virtual block device just by modifying a read page (which might have be discarded and no longer related to the block device)

- you can mmap packet sockets or similar to provide virtual networking
  devices.
- you can use hugetlbfs inside of guests

This one is especially important (it also allows large pages to be used for kernel mapping)

- you can mmap simple devices (e.g. frame buffer) directly into
  trusted guests without HW emulation.
- you can use gdb to debug the running guest address space

We already do that now.

- no need for ioctl to access or allocate guest memory
- you can mmap a kernel image (MAP_PRIVATE) into multiple guests
  and share instruction cache lines.
- the kernel code doesn't need special accessors, but can use
  asm/uaccess.h.
- may be able to avoid a bunch of TLB flushes with nested page tables.

This one is problematic.  More below.

On the downside, I can see these points:

- As you mentioned guest size on 32 bit hosts is limited to around 1G.
- you probably have to rewrite your virtual MMU from scratch

It's not that bad.  More below.

- for optimal performance, pageable guests need something like the s390
  pagex/pfault mechanism in the guest kernel.
- if you want a guest not to be paged out, you need privileges to do mlock.

I don't see that as a deficiency.

- you can't use swap space in the guest if you want to avoid the
  double paging problem (host needs to read a page from disk for the guest
  to swap it out), or you'd have to implement a mechanism like Martin
  Schwidefsky's page hints (cmm2) for s390.

The crux of the problem is communicating host kernel vm decisions to the guest tlb. There are three cases to consider:

1. The current mmu implementation, based on emulating invlpg, mov cr3 (the tlb flush instruction), and page faults.

Here, it is quite easy to use user pages for the guest. On a page fault, get_user_page() and install the pte. On a tlbflush or invlpg, transfer the dirty bit and put_page(). This automatically deals with multiple guest mappings for a page.

A potential problem is that the guest will lock many pages into memory, without any control by the kernel. In practice, because of frequent guest tlb flushes, the pages will be released very often.

2. The next mmu implementation, which caches guest translations.

The potential problem above now becomes acute. The guest will have kernel mappings for every page, and after a short while they'll all be faulted in and locked. This defeats the swap integration which is IMO a very strong point.

We can work around that by periodically forcing out translations (some kind of clock algorithm) at some rate so the host vm can have a go at them. That can turn out to be expensive as we'll need to interrupt all running vcpus to flush (real) tlb entries.

3. Nested page table translations, effectively doing the guest mmu in hardware.

I only have details of amd's implementation (I don't actually know that intel will have an implementation, but I can't imagine they won't compete).

AMD allows you to define a guest physical -> host physical mapping using the regular page table format. Guest accesses are interpreted as user-mode accesses. There is no other translation involved, so an access to guest physical address 0 will be translated as a user access to virtual address 0.

This poses a few problems to interpreting a portion of userspace as guest physical memory:

a. we need the guest physical memory to start at virtual offset 0 (can probably be achieved by dynamic linker tricks) b. we need to hide the userspace portion of the monitor from the guest physical address space
 c.  we need to extend host tlb invalidations to invalidate tlbs on guests

items a and b can probably be taken care of by dedicating a few pgd entries to the guest, and hooking host pgd entry modification to transfer the modified entries to a guest pgd (where they can start at offset 0, and not include monitor pages).

We will need to see the intel implementation when it comes out to see how that fits in.

Another option altogether is to allow a task to have two mm_structs, one for userspace and one for the guest. This allows less messing up with core page table manipulations, but will require adding ways to manipulate this extra address space.

So, in summary:
- definitely looks like a good direction
- will require seeing intel nested page tables specs
- will require intrusive changes to the Linux vm

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux