Test mail with my signature, mail content is based on the second quilt patch (Linux 2.6.16.29), only two key files are re-sent 1) Documentation/vm_pps.txt 2) mm/vmscan.c Index: test.signature/Documentation/vm_pps.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ test.signature/Documentation/vm_pps.txt 2007-01-06 07:00:18.146480584 +0800 @@ -0,0 +1,214 @@ + Pure Private Page System (pps) + Copyright by Yunfeng Zhang on GFDL 1.2 + [email protected] + December 24-26, 2006 + +// Purpose <([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section <How to Reclaim +Pages more Efficiently> and how I patch it into Linux 2.6.16.29 in section +<Pure Private Page System -- pps>. +// }])> + +// How to Reclaim Pages more Efficiently <([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) PTE, zone/memory inode layer (architecture-dependent). +4) Maybe it makes you sense that Page should be placed on the 3rd layer, but + here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. + +Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a +system on zone::active_list/inactive_list. + +I've finished a patch, see section <Pure Private Page System -- pps>. Note, it +ISN'T perfect. +// }])> + +// Pure Private Page System -- pps <([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if all are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages in SwapDaemon mm/vmscan.c:shrink_private_vma, the whole process is +divided into six stages -- <Stage Definition>. Other sections show the remain +aspects of pps +1) <Data Definition> is basic data definition. +2) <Concurrent racers of Shrinking pps> is focused on synchronization. +3) <Private Page Lifecycle of pps> -- how private pages enter in/go off pps. +4) <VMA Lifecycle of pps> which VMA is belonging to pps. + +PPS uses init_mm.mm_list list to enumerate all swappable UserSpace +(shrink_private_vma). + +A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to +execute the stages of pps periodically, note an appropriate timeout ticks is +necessary so we can give application a chance to re-map back its PrivatePage +from UnmappedPTE to PTE, that is, show their conglomeration affinity. +scan_control::pps_cmd field is used to control the behavior of kppsd, = 1 for +accelerating scanning process and reclaiming pages, it's used in balance_pgdat. + +PPS statistic data is appended to /proc/meminfo entry, its prototype is in +include/linux/mm.h. + +I'm also glad to highlight my a new idea -- dftlb which is described in +section <Delay to Flush TLB>. +// }])> + +// Delay to Flush TLB (dftlb) <([{ +Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in +brief, when we want to unmap a page from the page table of a process, why we +send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we +can insert flushing tasks into timer interrupt route to implement a +free-charged TLB flushing. + +The trick is implemented in +1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c. +2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute + flushing tasks. +3) all data are defined in include/linux/mm.h. + +The restriction of dftlb. Following conditions must be met +1) atomic cmpxchg instruction. +2) atomically set the access bit after they touch a pte firstly. +3) To some architectures, vma parameter of flush_tlb_range is maybe important, + if it's true, since it's possible that the vma of a TLB flushing task has + gone when a CPU starts to execute the task in timer interrupt, so don't use + dftlb. +combine stage 1 with stage 2, and send IPI immediately in fill_in_tlb_tasks. + +dftlb increases mm_struct::mm_users to prevent the mm from being freed when +other CPU works on it. +// }])> + +// Stage Definition <([{ +The whole process of private page page-out is divided into six stages, as +showed in shrink_pvma_scan_ptes of mm/vmscan.c, the code groups the similar +pages to a series. +1) PTE to untouched PTE (access bit is cleared), append flushing tasks to dftlb. +2) Convert untouched PTE to UnmappedPTE. +3) Link SwapEntry to every UnmappedPTE. +4) Flush PrivatePage of UnmappedPTE to its disk SwapPage. +5) Reclaimed the page and shift UnmappedPTE to SwappedPTE. +6) SwappedPTE stage. +// }])> + +// Data Definition <([{ +New VMA flag (VM_PURE_PRIVATE) is appended into VMA in include/linux/mm.h. + +New PTE type (UnmappedPTE) is appended into PTE system in +include/asm-i386/pgtable.h. Its prototype is +struct UnmappedPTE { + int present : 1; // must be 0. + ... + int pageNum : 20; +}; +The new PTE has a feature, it keeps a link to its PrivatePage and prevent the +page from being visited by CPU, so you can use it in <Stage Definition> as a +middleware. +// }])> + +// Concurrent Racers of Shrinking pps <([{ +shrink_private_vma of mm/vmscan.c uses init_mm.mmlist to scan all swappable +mm_struct instances, during the process of scaning and reclaiming process, it +readlockes every mm_struct object, which brings some potential concurrent +racers +1) mm/swapfile.c pps_swapoff (swapoff API). +2) mm/memory.c do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page + (page-fault). + +The VMAs of pps can coexist with madvise, mlock, mprotect, mmap and munmap, +that is why new VMA created from mmap.c:split_vma can re-enter into pps. +// }])> + +// Private Page Lifecycle of pps <([{ +All pages belonging to pps are called as pure private page, its PTE type is PTE +or UnmappedPTE. + +IN (NOTE, when a pure private page enters into pps, it's also trimmed from +Linux legacy page system by commeting lru_cache_add_active clause) +1) fs/exec.c install_arg_pages (argument pages). +2) mm/memory do_anonymous_page, do_wp_page, do_swap_page (page fault). +3) mm/swap_state.c read_swap_cache_async (swap pages). + +OUT +1) mm/vmscan.c shrink_pvma_scan_ptes (stage 6, reclaim a private page). +2) mm/memory zap_pte_range (free a page). +3) kernel/fork.c dup_mmap (if someone uses fork, migrate all pps pages + back to let Linux legacy page system manage them). + +When a pure private page is in pps, it can be visited simultaneously by +page-fault and SwapDaemon. +// }])> + +// VMA Lifecycle of pps <([{ +When a PrivateVMA enters into pps, it's or-ed a new flag -- VM_PURE_PRIVATE in +memory.c:enter_pps, you can also find which VMA is fit with pps in it, the flag +is used in the shrink_private_vma of mm/vmscan.c. Other fields are left +untouched. + +IN. +1) fs/exec.c setup_arg_pages (StackVMA). +2) mm/mmap.c do_mmap_pgoff, do_brk (DataVMA). +3) mm/mmap.c split_vma, copy_vma (in some cases, we need copy a VMA from an + exist VMA). + +OUT. +1) kernel/fork.c dup_mmap (if someone uses fork, return the vma back to + Linux legacy system). +2) mm/mmap.c remove_vma, vma_adjust (destroy VMA). +3) mm/mmap.c do_mmap_pgoff (delete VMA when some errors occur). +// }])> + +// Postscript <([{ +Note, some circumstances aren't tested due to hardware restriction e.g. SMP +dftlb. + +Here are some improvements about pps +1) In fact, I recommend one-to-one private model -- PrivateVMA, (PTE, + UnmappedPTE) and PrivatePage (SwapPage) which is described in my OS and the + aboved hyperlink of Linux kernel mail list. So it's a compromise to use + Linux legacy SwapCache in my pps. +2) SwapSpace should provide more flexible interfaces, shrink_pvma_scan_ptes + need allocate swap entries in batch, exactly, allocate a batch of fake + continual swap entries, see mm/pps_swapin_readahead. + +If Linux kernel group can't make a schedule to re-write their memory code, +however, pps maybe is the best solution until now. +// }])> +// vim: foldmarker=<([{,}])> foldmethod=marker et Index: test.signature/mm/vmscan.c =================================================================== --- test.signature.orig/mm/vmscan.c 2007-01-06 07:00:11.799445480 +0800 +++ test.signature/mm/vmscan.c 2007-01-06 07:00:23.326693072 +0800 @@ -79,6 +79,9 @@ * In this context, it doesn't matter that we scan the * whole list at once. */ int swap_cluster_max; + + /* pps control command, 0: do stage 1-4, kppsd only; 1: full stages. */ + int pps_cmd; }; /* @@ -1514,6 +1517,428 @@ return ret; } +// pps fields. +static wait_queue_head_t kppsd_wait; +static struct scan_control wakeup_sc; +struct pps_info pps_info = { + .total = ATOMIC_INIT(0), + .pte_count = ATOMIC_INIT(0), // stage 1 and 2. + .unmapped_count = ATOMIC_INIT(0), // stage 3 and 4. + .swapped_count = ATOMIC_INIT(0) // stage 6. +}; +// pps end. + +struct series_t { + pte_t orig_ptes[MAX_SERIES_LENGTH]; + pte_t* ptes[MAX_SERIES_LENGTH]; + struct page* pages[MAX_SERIES_LENGTH]; + int series_length; + int series_stage; +} series; + +static int get_series_stage(pte_t* pte, int index) +{ + series.orig_ptes[index] = *pte; + series.ptes[index] = pte; + if (pte_present(series.orig_ptes[index])) { + struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index])); + series.pages[index] = page; + if (page == ZERO_PAGE(addr)) // reserved page is exclusive from us. + return 7; + if (pte_young(series.orig_ptes[index])) { + return 1; + } else + return 2; + } else if (pte_unmapped(series.orig_ptes[index])) { + struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index])); + series.pages[index] = page; + if (!PageSwapCache(page)) + return 3; + else { + if (PageWriteback(page) || PageDirty(page)) + return 4; + else + return 5; + } + } else // pte_swapped -- SwappedPTE + return 6; +} + +static void find_series(pte_t** start, unsigned long* addr, unsigned long end) +{ + int i; + int series_stage = get_series_stage((*start)++, 0); + *addr += PAGE_SIZE; + + for (i = 1; i < MAX_SERIES_LENGTH && *addr < end; i++, (*start)++, *addr += PAGE_SIZE) { + if (series_stage != get_series_stage(*start, i)) + break; + } + series.series_stage = series_stage; + series.series_length = i; +} + +struct delay_tlb_task delay_tlb_tasks[32] = { [0 ... 31] = {0} }; + +void timer_flush_tlb_tasks(void* data) +{ + int i; +#ifdef CONFIG_X86 + int flag = 0; +#endif + for (i = 0; i < 32; i++) { + if (delay_tlb_tasks[i].mm != NULL && + cpu_isset(smp_processor_id(), delay_tlb_tasks[i].mm->cpu_vm_mask) && + cpu_isset(smp_processor_id(), delay_tlb_tasks[i].cpu_mask)) { +#ifdef CONFIG_X86 + flag = 1; +#elif + // smp::local_flush_tlb_range(delay_tlb_tasks[i]); +#endif + cpu_clear(smp_processor_id(), delay_tlb_tasks[i].cpu_mask); + } + } +#ifdef CONFIG_X86 + if (flag) + local_flush_tlb(); +#endif +} + +static struct delay_tlb_task* delay_task = NULL; +static int vma_index = 0; + +static struct delay_tlb_task* search_free_tlb_tasks_slot(void) +{ + struct delay_tlb_task* ret = NULL; + int i; +again: + for (i = 0; i < 32; i++) { + if (delay_tlb_tasks[i].mm != NULL) { + if (cpus_empty(delay_tlb_tasks[i].cpu_mask)) { + mmput(delay_tlb_tasks[i].mm); + delay_tlb_tasks[i].mm = NULL; + ret = &delay_tlb_tasks[i]; + } + } else + ret = &delay_tlb_tasks[i]; + } + if (!ret) { // Force flush TLBs. + on_each_cpu(timer_flush_tlb_tasks, NULL, 0, 1); + goto again; + } + return ret; +} + +static void init_delay_task(struct mm_struct* mm) +{ + cpus_clear(delay_task->cpu_mask); + vma_index = 0; + delay_task->mm = mm; +} + +/* + * We will be working on the mm, so let's force to flush it if necessary. + */ +static void start_tlb_tasks(struct mm_struct* mm) +{ + int i, flag = 0; +again: + for (i = 0; i < 32; i++) { + if (delay_tlb_tasks[i].mm == mm) { + if (cpus_empty(delay_tlb_tasks[i].cpu_mask)) { + mmput(delay_tlb_tasks[i].mm); + delay_tlb_tasks[i].mm = NULL; + } else + flag = 1; + } + } + if (flag) { // Force flush TLBs. + on_each_cpu(timer_flush_tlb_tasks, NULL, 0, 1); + goto again; + } + BUG_ON(delay_task != NULL); + delay_task = search_free_tlb_tasks_slot(); + init_delay_task(mm); +} + +static void end_tlb_tasks(void) +{ + atomic_inc(&delay_task->mm->mm_users); + delay_task->cpu_mask = delay_task->mm->cpu_vm_mask; + delay_task = NULL; +#ifndef CONFIG_SMP + timer_flush_tlb_tasks(NULL); +#endif +} + +static void fill_in_tlb_tasks(struct vm_area_struct* vma, unsigned long addr, + unsigned long end) +{ + struct mm_struct* mm; + // First, try to combine the task with the previous. + if (vma_index != 0 && delay_task->vma[vma_index - 1] == vma && + delay_task->end[vma_index - 1] == addr) { + delay_task->end[vma_index - 1] = end; + return; + } +fill_it: + if (vma_index != 32) { + delay_task->vma[vma_index] = vma; + delay_task->start[vma_index] = addr; + delay_task->end[vma_index] = end; + vma_index++; + return; + } + mm = delay_task->mm; + end_tlb_tasks(); + + delay_task = search_free_tlb_tasks_slot(); + init_delay_task(mm); + goto fill_it; +} + +static void shrink_pvma_scan_ptes(struct scan_control* sc, struct mm_struct* + mm, struct vm_area_struct* vma, pmd_t* pmd, unsigned long addr, + unsigned long end) +{ + int i, statistic; + spinlock_t* ptl = pte_lockptr(mm, pmd); + pte_t* pte = pte_offset_map(pmd, addr); + int anon_rss = 0; + struct pagevec freed_pvec; + int may_enter_fs = (sc->gfp_mask & (__GFP_FS | __GFP_IO)); + struct address_space* mapping = &swapper_space; + + pagevec_init(&freed_pvec, 1); + do { + memset(&series, 0, sizeof(struct series_t)); + find_series(&pte, &addr, end); + if (sc->pps_cmd == 0 && series.series_stage == 5) + continue; + switch (series.series_stage) { + case 1: // PTE -- untouched PTE. + for (i = 0; i < series.series_length; i++) { + struct page* page = series.pages[i]; + lock_page(page); + spin_lock(ptl); + if (unlikely(pte_same(*series.ptes[i], series.orig_ptes[i]))) { + if (pte_dirty(*series.ptes[i])) + set_page_dirty(page); + set_pte_at(mm, addr + i * PAGE_SIZE, series.ptes[i], + pte_mkold(pte_mkclean(*series.ptes[i]))); + } + spin_unlock(ptl); + unlock_page(page); + } + fill_in_tlb_tasks(vma, addr, addr + (PAGE_SIZE * series.series_length)); + break; + case 2: // untouched PTE -- UnmappedPTE. + /* + * Note in stage 1, we've flushed TLB in fill_in_tlb_tasks, so + * if it's still clear here, we can shift it to Unmapped type. + * + * If some architecture doesn't support atomic cmpxchg + * instruction or can't atomically set the access bit after + * they touch a pte at first, combine stage 1 with stage 2, and + * send IPI immediately in fill_in_tlb_tasks. + */ + spin_lock(ptl); + statistic = 0; + for (i = 0; i < series.series_length; i++) { + if (unlikely(pte_same(*series.ptes[i], series.orig_ptes[i]))) { + pte_t pte_unmapped = series.orig_ptes[i]; + pte_unmapped.pte_low &= ~_PAGE_PRESENT; + pte_unmapped.pte_low |= _PAGE_UNMAPPED; + if (cmpxchg(&series.ptes[i]->pte_low, + series.orig_ptes[i].pte_low, + pte_unmapped.pte_low) != + series.orig_ptes[i].pte_low) + continue; + page_remove_rmap(series.pages[i]); + anon_rss--; + statistic++; + } + } + atomic_add(statistic, &pps_info.unmapped_count); + atomic_sub(statistic, &pps_info.pte_count); + spin_unlock(ptl); + break; + case 3: // Attach SwapPage to PrivatePage. + /* + * A better arithmetic should be applied to Linux SwapDevice to + * allocate fake continual SwapPages which are close to each + * other, the offset between two close SwapPages is less than 8. + */ + if (sc->may_swap) { + for (i = 0; i < series.series_length; i++) { + lock_page(series.pages[i]); + if (!PageSwapCache(series.pages[i])) { + if (!add_to_swap(series.pages[i], GFP_ATOMIC)) { + unlock_page(series.pages[i]); + break; + } + } + unlock_page(series.pages[i]); + } + } + break; + case 4: // SwapPage isn't consistent with PrivatePage. + /* + * A mini version pageout(). + * + * Current swap space can't commit multiple pages together:( + */ + if (sc->may_writepage && may_enter_fs) { + for (i = 0; i < series.series_length; i++) { + struct page* page = series.pages[i]; + int res; + + if (!may_write_to_queue(mapping->backing_dev_info)) + break; + lock_page(page); + if (!PageDirty(page) || PageWriteback(page)) { + unlock_page(page); + continue; + } + clear_page_dirty_for_io(page); + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + .nr_to_write = SWAP_CLUSTER_MAX, + .nonblocking = 1, + .for_reclaim = 1, + }; + page_cache_get(page); + SetPageReclaim(page); + res = swap_writepage(page, &wbc); + if (res < 0) { + handle_write_error(mapping, page, res); + ClearPageReclaim(page); + page_cache_release(page); + break; + } + if (!PageWriteback(page)) + ClearPageReclaim(page); + page_cache_release(page); + } + } + break; + case 5: // UnmappedPTE -- SwappedPTE, reclaim PrivatePage. + statistic = 0; + for (i = 0; i < series.series_length; i++) { + struct page* page = series.pages[i]; + lock_page(page); + spin_lock(ptl); + if (unlikely(!pte_same(*series.ptes[i], series.orig_ptes[i]))) { + spin_unlock(ptl); + unlock_page(page); + continue; + } + statistic++; + swp_entry_t entry = { .val = page_private(page) }; + swap_duplicate(entry); + pte_t pte_swp = swp_entry_to_pte(entry); + set_pte_at(mm, addr + i * PAGE_SIZE, series.ptes[i], pte_swp); + spin_unlock(ptl); + if (PageSwapCache(page) && !PageWriteback(page)) + delete_from_swap_cache(page); + unlock_page(page); + + if (!pagevec_add(&freed_pvec, page)) + __pagevec_release_nonlru(&freed_pvec); + sc->nr_reclaimed++; + } + atomic_add(statistic, &pps_info.swapped_count); + atomic_sub(statistic, &pps_info.unmapped_count); + atomic_sub(statistic, &pps_info.total); + break; + case 6: + // NULL operation! + break; + } + } while (addr < end); + add_mm_counter(mm, anon_rss, anon_rss); + if (pagevec_count(&freed_pvec)) + __pagevec_release_nonlru(&freed_pvec); +} + +static void shrink_pvma_pmd_range(struct scan_control* sc, struct mm_struct* + mm, struct vm_area_struct* vma, pud_t* pud, unsigned long addr, + unsigned long end) +{ + unsigned long next; + pmd_t* pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (pmd_none_or_clear_bad(pmd)) + continue; + shrink_pvma_scan_ptes(sc, mm, vma, pmd, addr, next); + } while (pmd++, addr = next, addr != end); +} + +static void shrink_pvma_pud_range(struct scan_control* sc, struct mm_struct* + mm, struct vm_area_struct* vma, pgd_t* pgd, unsigned long addr, + unsigned long end) +{ + unsigned long next; + pud_t* pud = pud_offset(pgd, addr); + do { + next = pud_addr_end(addr, end); + if (pud_none_or_clear_bad(pud)) + continue; + shrink_pvma_pmd_range(sc, mm, vma, pud, addr, next); + } while (pud++, addr = next, addr != end); +} + +static void shrink_pvma_pgd_range(struct scan_control* sc, struct mm_struct* + mm, struct vm_area_struct* vma) +{ + unsigned long next; + unsigned long addr = vma->vm_start; + unsigned long end = vma->vm_end; + pgd_t* pgd = pgd_offset(mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) + continue; + shrink_pvma_pud_range(sc, mm, vma, pgd, addr, next); + } while (pgd++, addr = next, addr != end); +} + +static void shrink_private_vma(struct scan_control* sc) +{ + struct vm_area_struct* vma; + struct list_head *pos; + struct mm_struct *prev, *mm; + + prev = mm = &init_mm; + pos = &init_mm.mmlist; + atomic_inc(&prev->mm_users); + spin_lock(&mmlist_lock); + while ((pos = pos->next) != &init_mm.mmlist) { + mm = list_entry(pos, struct mm_struct, mmlist); + if (!atomic_add_unless(&mm->mm_users, 1, 0)) + continue; + spin_unlock(&mmlist_lock); + mmput(prev); + prev = mm; + start_tlb_tasks(mm); + if (down_read_trylock(&mm->mmap_sem)) { + for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) { + if (!(vma->vm_flags & VM_PURE_PRIVATE)) + continue; + if (vma->vm_flags & VM_LOCKED) + continue; + shrink_pvma_pgd_range(sc, mm, vma); + } + up_read(&mm->mmap_sem); + } + end_tlb_tasks(); + spin_lock(&mmlist_lock); + } + spin_unlock(&mmlist_lock); + mmput(prev); +} + /* * For kswapd, balance_pgdat() will work across all this node's zones until * they are all at pages_high. @@ -1557,6 +1982,10 @@ sc.may_swap = 1; sc.nr_mapped = read_page_state(nr_mapped); + wakeup_sc = sc; + wakeup_sc.pps_cmd = 1; + wake_up_interruptible(&kppsd_wait); + inc_page_state(pageoutrun); for (i = 0; i < pgdat->nr_zones; i++) { @@ -1693,6 +2122,33 @@ return total_reclaimed; } +static int kppsd(void* p) +{ + struct task_struct *tsk = current; + int timeout; + DEFINE_WAIT(wait); + daemonize("kppsd"); + tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE; + struct scan_control default_sc; + default_sc.gfp_mask = GFP_KERNEL; + default_sc.may_writepage = 1; + default_sc.may_swap = 1; + default_sc.pps_cmd = 0; + + while (1) { + try_to_freeze(); + prepare_to_wait(&kppsd_wait, &wait, TASK_INTERRUPTIBLE); + timeout = schedule_timeout(2000); + finish_wait(&kppsd_wait, &wait); + + if (timeout) + shrink_private_vma(&wakeup_sc); + else + shrink_private_vma(&default_sc); + } + return 0; +} + /* * The background pageout daemon, started as a kernel thread * from the init process. @@ -1837,6 +2293,15 @@ } #endif /* CONFIG_HOTPLUG_CPU */ +static int __init kppsd_init(void) +{ + init_waitqueue_head(&kppsd_wait); + kernel_thread(kppsd, NULL, CLONE_KERNEL); + return 0; +} + +module_init(kppsd_init) + static int __init kswapd_init(void) { pg_data_t *pgdat;
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
- Follow-Ups:
- Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
- From: "yunfeng zhang" <[email protected]>
- Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
- Prev by Date: Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
- Next by Date: [PATCH] increment pos before looking for the next cap in __pci_find_next_ht_cap
- Previous by thread: Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
- Next by thread: Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem
- Index(es):