Re: [PATCH 0/11] Avoiding fragmentation with subzone groupings v26

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2 Nov 2006, Andi Kleen wrote:

Mel Gorman <[email protected]> writes:

Our tests show that about 60-70% of physical memory can be allocated on
a desktop after a few days uptime. In benchmarks and stress tests, we are
finding that 80% of memory is available as contiguous blocks at the end of
the test. To compare, a standard kernel was getting < 1% of memory as large
pages on a desktop and about 8-12% of memory as large pages at the end of
stress tests.

If you don't have a fixed limit on the unreclaimable memory you could
still get into a situation where all memory is fragmented and unreclaimable,
right?


Right, it's just considerably harder so there will be adverse workloads that will break it (heavy IO on very large numbers of files under high load with reiserfs is one). I don't have a list of real workloads that break anti-frag yet so so I want to get anti-frag out there and see does it help people who really care about hugepages or not.

I've included a script below that tries to get as many hugepages as possible via the proc interface. What I usually do is run it after a series of stress tests or sometimes one a desktop after a few days to see how it gets on in comparison to the standard allocator. A test I ran there got 73% of memory as huge pages on a system with 19 days uptime. However, the machine wasn't heavily stressed during that time and I had configured min_free_kbytes to be 10% as suggested in the CONFIG help.

Generally anti-frag gets you way more hugepages, but not necessarily the whole systems worth. To get all free memory as huge pages, I'd need to be moving memory around and that would be very invasive. It gets better results with linear-reclaim or lumpy-reclaim patches applied.

For people to get 100% expected results, they still will need to size the hugepages pool at boot-time or set aside a zone of reclaimable pages at boot time. This patch is aimed at relaxing the restriction of sizing the pool up while the system is in use. For example, take a batch-scheduled machine running HPC jobs. I want it to be able to get more or less hugepages between jobs without requiring reboots. I'd like to hear from people who try resizing the pool what sort of success they have and what sort of workloads broke the strategy on them.

It might be much harder to hit, but we have so many users that at least
a few will eventually.


This is true. There are additional steps that could be taken that would make it even harder to break down but I'd like to get more data on what sort of workloads break this strategy before I complicate things more.

Performance tests are within 0.1% for kbuild on a number of test machines. aim9
is usually within 1%

1% is a lot.


Well, yes, but two things. First, aim9 is a microbenchmark. Small differences in aim9 seem to make very little difference to other benchmarks like kbuild. On some arches, aim9 results vary widely between subsequent runs making it very sensitive. I used aim9 initially because if it showed *large* regressions, something was usually up.

Second, I didn't say it was always a 1% regression, just that it generally within 1%. Here are the last aim9 result comparison on the x86_64

                 2.6.19-rc4-mm1-clean  2.6.19-rc4-mm1-list-based
 1 creat-clo                150666.67                  157083.33    6416.66  4.26% File Creations and Closes/second
 2 page_test                186915.00                  189065.16    2150.16  1.15% System Allocations & Pages/second
 3 brk_test                1863739.38                 1972521.25  108781.87  5.84% System Memory Allocations/second
 4 jmp_test               16388101.98                16381716.67   -6385.31 -0.04% Non-local gotos/second
 5 signal_test              464500.00                  501649.73   37149.73  8.00% Signal Traps/second
 6 exec_test                   165.17                     162.59      -2.58 -1.56% Program Loads/second
 7 fork_test                  4283.57                    4365.21      81.64  1.91% Task Creations/second
 8 link_test                 50129.19                   47658.31   -2470.88 -4.93% Link/Unlink Pairs/second

It's actally showing some performance improvements there according to aim9

Here are the aim9 results on a ppc64 LPAR

                 2.6.19-rc4-mm1-clean  2.6.19-rc4-mm1-list-based
 1 creat-clo                134460.92                  134816.67     355.75  0.26% File Creations and Closes/second
 2 page_test                307473.33                  304900.85   -2572.48 -0.84% System Allocations & Pages/second
 3 brk_test                1547025.50                 1565439.09   18413.59  1.19% System Memory Allocations/second
 4 jmp_test               10353816.67                10211531.41 -142285.26 -1.37% Non-local gotos/second
 5 signal_test              257007.17                  257066.67      59.50  0.02% Signal Traps/second
 6 exec_test                   108.61                     108.76       0.15  0.14% Program Loads/second
 7 fork_test                  3276.12                    3289.45      13.33  0.41% Task Creations/second
 8 link_test                 47225.33                   48289.50    1064.17  2.25% Link/Unlink Pairs/second

And here is the comparison on a numaq

                 2.6.19-rc4-mm1-clean  2.6.19-rc4-mm1-list-based
 1 creat-clo                 46660.00                   48609.03    1949.03  4.18% File Creations and Closes/second
 2 page_test                 47555.81                   47588.68      32.87  0.07% System Allocations & Pages/second
 3 brk_test                 247910.77                  254179.15    6268.38  2.53% System Memory Allocations/second
 4 jmp_test                2276287.29                 2275924.69    -362.60 -0.02% Non-local gotos/second
 5 signal_test               65561.48                   64778.41    -783.07 -1.19% Signal Traps/second
 6 exec_test                    21.32                      21.31      -0.01 -0.05% Program Loads/second
 7 fork_test                   880.79                     906.36      25.57  2.90% Task Creations/second
 8 link_test                 19058.50                   18726.81    -331.69 -1.74% Link/Unlink Pairs/second

These results tend to vary by a few percent in each run, even on subsequent runs so I consider the results to be very noisy and I haven't done the legwork yet to get an average over multiple runs. To give an idea of how mad the results can be, this is an older set of results on an x86_64. Look at the brk_test results even. Between 2.6.19-rc2-mm2-clean and 2.6.19-rc3-mm1, there is a 12% regression apparently, but it's unlikely to be reflected in "real" benchmarks.

                 2.6.19-rc2-mm2-clean  2.6.19-rc2-mm2-list-based
 1 creat-clo                142759.54                  170083.33   27323.79 19.14% File Creations and Closes/second
 2 page_test                187305.90                  179716.71   -7589.19 -4.05% System Allocations & Pages/second
 3 brk_test                2139943.34                 2377053.82  237110.48 11.08% System Memory Allocations/second
 4 jmp_test               16387850.00                16380453.26   -7396.74 -0.05% Non-local gotos/second
 5 signal_test              536933.33                  495550.74  -41382.59 -7.71% Signal Traps/second
 6 exec_test                   166.17                     162.39      -3.78 -2.27% Program Loads/second
 7 fork_test                  4201.23                    4261.91      60.68  1.44% Task Creations/second
 8 link_test                 48980.64                   58369.22    9388.58 19.17% Link/Unlink Pairs/second

Hence, I'd like to get a better idea of what sort of performance effect other people see on the benchmarks they care about.

Here is the script I use to grab hugepages;

#!/bin/bash
# This benchmark checks how many hugepages can be allocated in the hugepage
# pool

P=hugepages_get-bench
SLEEP_INTERVAL=3
FAIL_AFTER_NO_CHANGE_ATTEMPTS=20

# Args
while [ "$1" != "" ]; do
	case "$1" in
		-s)		export SLEEP_INTERVAL=$2; shift 2;;
		-f)		export FAIL_AFTER_NO_CHANGE_ATTEMPTS=$2; shift 2;;
	esac
done

# Check proc entry exists
if [ ! -e /proc/sys/vm/nr_hugepages ]; then
	echo Attempting load of hugetlbfs module
	modprobe hugetlbfs
	if [ ! -e /proc/sys/vm/nr_hugepages ]; then
		echo ERROR: /proc/sys/vm/nr_hugepages does not exist
		exit $EXIT_TERMINATE
	fi
fi

echo Allocating hugepages test
echo -------------------------

# Disable OOM killed
echo Disabling OOM Killer for current test process
echo -17 > /proc/self/oom_adj

# Record existing hugepage count
STARTING_COUNT=`cat /proc/sys/vm/nr_hugepages`
echo Starting page count: $STARTING_COUNT

# Ensure we have permission to write
echo $STARTING_COUNT > /proc/sys/vm/nr_hugepages || {
	echo ERROR: Do not have permission to adjust nr_hugepages count
	exit $EXIT_TERMINATE
}

# Start test
CURRENT_COUNT=$STARTING_COUNT
LAST_COUNT=$STARTING_COUNT
NOCHANGE_COUNT=0
ATTEMPT=0

while [ $NOCHANGE_COUNT -ne $FAIL_AFTER_NO_CHANGE_ATTEMPTS ]; do
	ATTEMPT=$((ATTEMPT+1))
	PAGES_COUNT=$(($CURRENT_COUNT+100))
	echo $PAGES_COUNT > /proc/sys/vm/nr_hugepages

	CURRENT_COUNT=`cat /proc/sys/vm/nr_hugepages`
	PROGRESS=
	if [ "$CURRENT_COUNT" = "$LAST_COUNT" ]; then
		NOCHANGE_COUNT=$(($NOCHANGE_COUNT+1))
	else
		NOCHANGE_COUNT=0
		PROGRESS="Progress made with $(($CURRENT_COUNT-$LAST_COUNT)) pages"
	fi

	echo Attempt $ATTEMPT: $CURRENT_COUNT pages $PROGRESS
	LAST_COUNT=$CURRENT_COUNT
	sleep $SLEEP_INTERVAL
done

echo Final page count: $CURRENT_COUNT
echo $STARTING_COUNT > /proc/sys/vm/nr_hugepages
exit $EXIT_SUCCESS


--
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux