On Fri, 2006-09-01 at 09:37 -0700, Dave Hansen wrote:
> Can you give me the sequence of events that occur so that we need to
> set, then check PG_discarded? I'm not getting it.
>
> 1. there is good data in a page
> ...
> 50. ... and PG_discarded gets set
> ...
> 99. We check PG_discarded and ...
Ok, here we go:
0) there is good data in a page
1) the host scans for pages to reclaim and selects a page of a
particular guest
2) the host checks the page state and decides to either swap the page or
discard it
3) nothing happens for a long time
4) the guest comes around and tries to access the long gone page
5) the host gets a fault because the page is gone from the hosts page
table for the guest system
6) the host delivers a discard fault to the guest
7) the architecture dependent fault handler gets a page reference for
the discarded page (tricky for s390)
8) page_discard is called which locks the page and does a
TestSetPageDiscarded. If the bit has not been set yet the page is
removed from the page cache. There can still be page references around.
Concurrent to 5-8 another cpu could be just be removing the page from
page cache as well. Without the check for the discarded bit the page
would get removed twice. This does nasty things to reference counting,
mapping->nrpages, ...
> > NUMA node is not granular enough, mem_section is probably doable. I do
> > not understand the part about the bit in the mem_map[] area, a bit in
> > the page->flags is exactly that, isn't it?
>
> No, I'm being tricky. There are struct pages for all memory, including
> kernel memory. mem_map[] is in kernel memory. So, the memory for the
> mem_map[] has struct pages, which themselves are in the mem_map[].
>
> void lock_page_for_state_change(struct page *page)
> {
> struct pages_backing_page = virt_to_page(page);
> lock_page(pages_backing_page)
> }
>
> We did this for a bit with sparsemem, I think. That is, until Andy came
> up with something even more clever.
Ouch, I understand what you are trying to tell me. The struct page
entries that cover the mem_map array itself has free bits we could try
to cannibalize.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]