Re: [PATCH 0/11] Avoiding fragmentation with subzone groupings v26

On Thu, 2 Nov 2006, Andi Kleen wrote:

Mel Gorman <[email protected]> writes:


Our tests show that about 60-70% of physical memory can be allocated on
a desktop after a few days uptime. In benchmarks and stress tests, we are
finding that 80% of memory is available as contiguous blocks at the end of
the test. To compare, a standard kernel was getting < 1% of memory as large
pages on a desktop and about 8-12% of memory as large pages at the end of
stress tests.


If you don't have a fixed limit on the unreclaimable memory you could
still get into a situation where all memory is fragmented and unreclaimable,
right?

Right, it's just considerably harder so there will be adverse workloadsthat will break it (heavy IO on very large numbers of files under highload with reiserfs is one). I don't have a list of real workloads thatbreak anti-frag yet so so I want to get anti-frag out there and see doesit help people who really care about hugepages or not.

I've included a script below that tries to get as many hugepages aspossible via the proc interface. What I usually do is run it after aseries of stress tests or sometimes one a desktop after a few days to seehow it gets on in comparison to the standard allocator. A test I ran theregot 73% of memory as huge pages on a system with 19 days uptime. However,the machine wasn't heavily stressed during that time and I had configuredmin_free_kbytes to be 10% as suggested in the CONFIG help.

Generally anti-frag gets you way more hugepages, but not necessarily thewhole systems worth. To get all free memory as huge pages, I'd need to bemoving memory around and that would be very invasive. It gets betterresults with linear-reclaim or lumpy-reclaim patches applied.

For people to get 100% expected results, they still will need to size thehugepages pool at boot-time or set aside a zone of reclaimable pages atboot time. This patch is aimed at relaxing the restriction of sizing thepool up while the system is in use. For example, take a batch-scheduledmachine running HPC jobs. I want it to be able to get more or lesshugepages between jobs without requiring reboots. I'd like to hear frompeople who try resizing the pool what sort of success they have and whatsort of workloads broke the strategy on them.

It might be much harder to hit, but we have so many users that at least
a few will eventually.

This is true. There are additional steps that could be taken that wouldmake it even harder to break down but I'd like to get more data on whatsort of workloads break this strategy before I complicate things more.

Performance tests are within 0.1% for kbuild on a number of test machines. aim9
is usually within 1%


1% is a lot.

Well, yes, but two things. First, aim9 is a microbenchmark. Smalldifferences in aim9 seem to make very little difference to otherbenchmarks like kbuild. On some arches, aim9 results vary widely betweensubsequent runs making it very sensitive. I used aim9 initially because ifit showed *large* regressions, something was usually up.

Second, I didn't say it was always a 1% regression, just that it generallywithin 1%. Here are the last aim9 result comparison on the x86_64


                 2.6.19-rc4-mm1-clean  2.6.19-rc4-mm1-list-based
 1 creat-clo                150666.67                  157083.33    6416.66  4.26% File Creations and Closes/second
 2 page_test                186915.00                  189065.16    2150.16  1.15% System Allocations & Pages/second
 3 brk_test                1863739.38                 1972521.25  108781.87  5.84% System Memory Allocations/second
 4 jmp_test               16388101.98                16381716.67   -6385.31 -0.04% Non-local gotos/second
 5 signal_test              464500.00                  501649.73   37149.73  8.00% Signal Traps/second
 6 exec_test                   165.17                     162.59      -2.58 -1.56% Program Loads/second
 7 fork_test                  4283.57                    4365.21      81.64  1.91% Task Creations/second
 8 link_test                 50129.19                   47658.31   -2470.88 -4.93% Link/Unlink Pairs/second

It's actally showing some performance improvements there according to aim9

Here are the aim9 results on a ppc64 LPAR

                 2.6.19-rc4-mm1-clean  2.6.19-rc4-mm1-list-based
 1 creat-clo                134460.92                  134816.67     355.75  0.26% File Creations and Closes/second
 2 page_test                307473.33                  304900.85   -2572.48 -0.84% System Allocations & Pages/second
 3 brk_test                1547025.50                 1565439.09   18413.59  1.19% System Memory Allocations/second
 4 jmp_test               10353816.67                10211531.41 -142285.26 -1.37% Non-local gotos/second
 5 signal_test              257007.17                  257066.67      59.50  0.02% Signal Traps/second
 6 exec_test                   108.61                     108.76       0.15  0.14% Program Loads/second
 7 fork_test                  3276.12                    3289.45      13.33  0.41% Task Creations/second
 8 link_test                 47225.33                   48289.50    1064.17  2.25% Link/Unlink Pairs/second

And here is the comparison on a numaq

                 2.6.19-rc4-mm1-clean  2.6.19-rc4-mm1-list-based
 1 creat-clo                 46660.00                   48609.03    1949.03  4.18% File Creations and Closes/second
 2 page_test                 47555.81                   47588.68      32.87  0.07% System Allocations & Pages/second
 3 brk_test                 247910.77                  254179.15    6268.38  2.53% System Memory Allocations/second
 4 jmp_test                2276287.29                 2275924.69    -362.60 -0.02% Non-local gotos/second
 5 signal_test               65561.48                   64778.41    -783.07 -1.19% Signal Traps/second
 6 exec_test                    21.32                      21.31      -0.01 -0.05% Program Loads/second
 7 fork_test                   880.79                     906.36      25.57  2.90% Task Creations/second
 8 link_test                 19058.50                   18726.81    -331.69 -1.74% Link/Unlink Pairs/second

These results tend to vary by a few percent in each run, even onsubsequent runs so I consider the results to be very noisy and I haven'tdone the legwork yet to get an average over multiple runs. To give an ideaof how mad the results can be, this is an older set of results on anx86_64. Look at the brk_test results even. Between 2.6.19-rc2-mm2-cleanand 2.6.19-rc3-mm1, there is a 12% regression apparently, but it'sunlikely to be reflected in "real" benchmarks.


                 2.6.19-rc2-mm2-clean  2.6.19-rc2-mm2-list-based
 1 creat-clo                142759.54                  170083.33   27323.79 19.14% File Creations and Closes/second
 2 page_test                187305.90                  179716.71   -7589.19 -4.05% System Allocations & Pages/second
 3 brk_test                2139943.34                 2377053.82  237110.48 11.08% System Memory Allocations/second
 4 jmp_test               16387850.00                16380453.26   -7396.74 -0.05% Non-local gotos/second
 5 signal_test              536933.33                  495550.74  -41382.59 -7.71% Signal Traps/second
 6 exec_test                   166.17                     162.39      -3.78 -2.27% Program Loads/second
 7 fork_test                  4201.23                    4261.91      60.68  1.44% Task Creations/second
 8 link_test                 48980.64                   58369.22    9388.58 19.17% Link/Unlink Pairs/second

Hence, I'd like to get a better idea of what sort of performance effectother people see on the benchmarks they care about.


Here is the script I use to grab hugepages;

#!/bin/bash
# This benchmark checks how many hugepages can be allocated in the hugepage
# pool

P=hugepages_get-bench
SLEEP_INTERVAL=3
FAIL_AFTER_NO_CHANGE_ATTEMPTS=20

# Args
while [ "$1" != "" ]; do
	case "$1" in
		-s)		export SLEEP_INTERVAL=$2; shift 2;;
		-f)		export FAIL_AFTER_NO_CHANGE_ATTEMPTS=$2; shift 2;;
	esac
done

# Check proc entry exists
if [ ! -e /proc/sys/vm/nr_hugepages ]; then
	echo Attempting load of hugetlbfs module
	modprobe hugetlbfs
	if [ ! -e /proc/sys/vm/nr_hugepages ]; then
		echo ERROR: /proc/sys/vm/nr_hugepages does not exist
		exit $EXIT_TERMINATE
	fi
fi

echo Allocating hugepages test
echo -------------------------

# Disable OOM killed
echo Disabling OOM Killer for current test process
echo -17 > /proc/self/oom_adj

# Record existing hugepage count
STARTING_COUNT=`cat /proc/sys/vm/nr_hugepages`
echo Starting page count: $STARTING_COUNT

# Ensure we have permission to write
echo $STARTING_COUNT > /proc/sys/vm/nr_hugepages || {
	echo ERROR: Do not have permission to adjust nr_hugepages count
	exit $EXIT_TERMINATE
}

# Start test
CURRENT_COUNT=$STARTING_COUNT
LAST_COUNT=$STARTING_COUNT
NOCHANGE_COUNT=0
ATTEMPT=0

while [ $NOCHANGE_COUNT -ne $FAIL_AFTER_NO_CHANGE_ATTEMPTS ]; do
	ATTEMPT=$((ATTEMPT+1))
	PAGES_COUNT=$(($CURRENT_COUNT+100))
	echo $PAGES_COUNT > /proc/sys/vm/nr_hugepages

	CURRENT_COUNT=`cat /proc/sys/vm/nr_hugepages`
	PROGRESS=
	if [ "$CURRENT_COUNT" = "$LAST_COUNT" ]; then
		NOCHANGE_COUNT=$(($NOCHANGE_COUNT+1))
	else
		NOCHANGE_COUNT=0
		PROGRESS="Progress made with $(($CURRENT_COUNT-$LAST_COUNT)) pages"
	fi

	echo Attempt $ATTEMPT: $CURRENT_COUNT pages $PROGRESS
	LAST_COUNT=$CURRENT_COUNT
	sleep $SLEEP_INTERVAL
done

echo Final page count: $CURRENT_COUNT
echo $STARTING_COUNT > /proc/sys/vm/nr_hugepages
exit $EXIT_SUCCESS


--
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- [PATCH 0/11] Avoiding fragmentation with subzone groupings v26
  - From: Mel Gorman <[email protected]>
- Re: [PATCH 0/11] Avoiding fragmentation with subzone groupings v26
  - From: Andi Kleen <[email protected]>

Prev by Date: Re: [PATCH 4/7] Allow selected bug checks to be skipped by paravirt kernels
Next by Date: Re: [PATCH 2/2] IDE: Add the support of nvidia PATA controllers of MCP67 to amd74xx.c & pata_amd.c
Previous by thread: Re: [PATCH 0/11] Avoiding fragmentation with subzone groupings v26
Next by thread: [take22 0/4] kevent: Generic event handling mechanism.
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]