This is a zone-based approach to fragmentation reduction. This is posted
in light of the discussions related to the list-based (sometimes dubbed
as sub-zones) approach where the prevailing opinion was that zones were
the answer. The patches are based on linux-2.6.16-rc1-mm1 and has been
successfully tested on x86 and ppc64. The patches are as follows;
Patches 1-4: These patches are related to the adding of the zone and setting
up the callers
Patch 5: This is only for testing. It stops the OOM killer hitting everything
in sight while stress-testing high-order allocations. To have comparable
results during the high-order stress test allocation, this patch is applied
to both the stock -mm kernel and the kernel using the zone-based approach
to anti-fragmentation.
The usage scenario I set up to test out the patches is;
1. Test machine: 4-way x86 machine with 1.5GiB physical RAM
2. Boot with kernelcore=512MB . This gives the kernel 512MB to work with and
the rest is placed in ZONE_EASYRCLM. (see patch 3 for more comments about
the value of kernelcore)
3. Benchmark kbuild, aim9 and high order allocations
An alternative scenario has been tested that produces similar figures. The
scenario is;
1. Test machine: 4-way x86 machine with 1.5GiB physical RAM
2. Boot with mem=512MB
3. Hot-add the remaining memory
4. Benchmark kbuild, aim9 and high order allocations
The alternative scenario requires two more patches related to hot-adding on
the x86. I can post them if people want to take a look or experiment with
hot-add instead of using kernelcore= .
With zone-based anti-fragmentation, the usage of zones changes slightly on
the x86. The HIGHMEM zone is effectively split into two, with allocations
destined for this area split between HIGHMEM and EASYRCLM. GFP_HIGHUSER pages
such as PTE's are passed to HIGHMEM and the remainder (mostly user pages)
are passed to EASYRCLM. Note that if kernelcore is less than the maximum size
of ZONE_NORMAL, GFP_HIGHMEM allocations will use ZONE_NORMAL, not the reachable
portion of ZONE_EASYRCLM.
I have tested with booting a kernel with no mem= or kernelcore= to make sure
there are no normal performance regressions. On ppc64, a 2GiB system was
booted with kernelcore=896MB and dbench run as a regression test. It was
confirmed that ZONE_EASYRCLM was created and was being used.
Benchmark comparison between -mm+NoOOM tree and with the new zones
KBuild
2.6.16-rc1-mm1-clean 2.6.16-rc1-mm1-zbuddy-v3
Time taken to extract kernel: 14 14
Time taken to build kernel: 741 738
(Performance is about the same, what you would expect really. To see a
regression, you would have to have kernelcore=TooSmallANumber)
Aim9
2.6.16-rc1-mm1-clean 2.6.16-rc1-mm1-zbuddy-v3
1 creat-clo 12273.11 12235.72 -37.39 -0.30% File Creations and Closes/second
2 page_test 131762.75 132946.18 1183.43 0.90% System Allocations & Pages/second
3 brk_test 586206.90 603298.90 17092.00 2.92% System Memory Allocations/second
4 jmp_test 4375520.75 4376557.81 1037.06 0.02% Non-local gotos/second
5 signal_test 79436.76 81086.49 1649.73 2.08% Signal Traps/second
6 exec_test 62.90 62.81 -0.09 -0.14% Program Loads/second
7 fork_test 1211.92 1212.52 0.60 0.05% Task Creations/second
8 link_test 4332.30 4346.60 14.30 0.33% Link/Unlink Pairs/second
(Again, performance is about the same. The differences are about the same
as what you would see between runs)
High order allocations under load
2.6.16-rc1-mm1-clean 2.6.16-rc1-mm1-zbuddy-v3
Order 10 10
Allocation type HighMem HighMem
Attempted allocations 275 275
Success allocs 60 106
Failed allocs 215 169
DMA zone allocs 1 1
Normal zone allocs 5 8
HighMem zone allocs 54 0
EasyRclm zone allocs 0 97
% Success 21 38
HighAlloc Under Load Test Results Pass 2
2.6.16-rc1-mm1-clean 2.6.16-rc1-mm1-zbuddy-v3
Order 10 10
Allocation type HighMem HighMem
Attempted allocations 275 275
Success allocs 101 154
Failed allocs 174 121
DMA zone allocs 1 1
Normal zone allocs 5 8
HighMem zone allocs 95 0
EasyRclm zone allocs 0 145
% Success 36 56
HighAlloc Test Results while Rested
2.6.16-rc1-mm1-clean 2.6.16-rc1-mm1-zbuddy-v3
Order 10 10
Allocation type HighMem HighMem
Attempted allocations 275 275
Success allocs 141 212
Failed allocs 134 63
DMA zone allocs 1 1
Normal zone allocs 16 8
HighMem zone allocs 124 0
EasyRclm zone allocs 0 203
% Success 51 77
The use of ZONE_EASYRCLM pushes up the success rate for HugeTLB-sized
allocations by 46 huge pages which is a big improvement. To compare, the
list-based approach gave an additional 19. At rest, an additional 71 pages
were available although this varies depending on the location of per-cpu pages
(patch available that drains them). To compare, at rest, the list-based
approach was able to allocate an additional 192 huge pages. It is important
to note that the value of kernelcore at boot time can have a big impact on
the these stress test. Again, to compare, list-based anti-fragmentation had
no tunables.
In terms of performance, the kernel with the additional zone performs as
well as the standard kernel with variances between runs typically around
+/- 2% on each test in aim9. If the zone is not sized at all, there is no
measurable performance difference and the patches. The zone-based approach is
a lot less invasive of the core paths than the list-based approach was. The
final diffstat is;
arch/i386/kernel/setup.c | 28 +++++++++++++++++++++++++++-
arch/powerpc/mm/numa.c | 37 ++++++++++++++++++++++++++++++++++---
fs/compat.c | 2 +-
fs/exec.c | 2 +-
fs/inode.c | 2 +-
include/asm-i386/page.h | 3 ++-
include/linux/gfp.h | 3 +++
include/linux/highmem.h | 2 +-
include/linux/mmzone.h | 14 ++++++++------
mm/memory.c | 4 ++--
mm/page_alloc.c | 27 +++++++++++++++++++--------
mm/shmem.c | 4 ++++
mm/swap_state.c | 2 +-
13 files changed, 104 insertions(+), 26 deletions(-)
Unlike the list-based (or sub-zones if you prefer) approach, the zone-based
approach does not not help high-order kernel allocations but it can help
huge pages. Huge pages are currently allocated from ZONE_HIGHMEM as they
are not "easily reclaimable". However, if the HugeTLB page is the same size
as a sparsemem section size (the smallest unit that can be hot-removed)
we could use ZONE_EASYRCLM. If huge pages are the same size as a sparsemem
section they cause no fragmentation with that section. On ppc64 this is
typically the case, but not so on 86. One possibility is to have an
architecture-specific option that determines if ZONE_EASYRCLM is used or not.
Comments?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]