On Thu, 2 Nov 2006, Andi Kleen wrote:
Mel Gorman <[email protected]> writes:
Our tests show that about 60-70% of physical memory can be allocated on
a desktop after a few days uptime. In benchmarks and stress tests, we are
finding that 80% of memory is available as contiguous blocks at the end of
the test. To compare, a standard kernel was getting < 1% of memory as large
pages on a desktop and about 8-12% of memory as large pages at the end of
stress tests.
If you don't have a fixed limit on the unreclaimable memory you could
still get into a situation where all memory is fragmented and unreclaimable,
right?
Right, it's just considerably harder so there will be adverse workloads
that will break it (heavy IO on very large numbers of files under high
load with reiserfs is one). I don't have a list of real workloads that
break anti-frag yet so so I want to get anti-frag out there and see does
it help people who really care about hugepages or not.
I've included a script below that tries to get as many hugepages as
possible via the proc interface. What I usually do is run it after a
series of stress tests or sometimes one a desktop after a few days to see
how it gets on in comparison to the standard allocator. A test I ran there
got 73% of memory as huge pages on a system with 19 days uptime. However,
the machine wasn't heavily stressed during that time and I had configured
min_free_kbytes to be 10% as suggested in the CONFIG help.
Generally anti-frag gets you way more hugepages, but not necessarily the
whole systems worth. To get all free memory as huge pages, I'd need to be
moving memory around and that would be very invasive. It gets better
results with linear-reclaim or lumpy-reclaim patches applied.
For people to get 100% expected results, they still will need to size the
hugepages pool at boot-time or set aside a zone of reclaimable pages at
boot time. This patch is aimed at relaxing the restriction of sizing the
pool up while the system is in use. For example, take a batch-scheduled
machine running HPC jobs. I want it to be able to get more or less
hugepages between jobs without requiring reboots. I'd like to hear from
people who try resizing the pool what sort of success they have and what
sort of workloads broke the strategy on them.
It might be much harder to hit, but we have so many users that at least
a few will eventually.
This is true. There are additional steps that could be taken that would
make it even harder to break down but I'd like to get more data on what
sort of workloads break this strategy before I complicate things more.
Performance tests are within 0.1% for kbuild on a number of test machines. aim9
is usually within 1%
1% is a lot.
Well, yes, but two things. First, aim9 is a microbenchmark. Small
differences in aim9 seem to make very little difference to other
benchmarks like kbuild. On some arches, aim9 results vary widely between
subsequent runs making it very sensitive. I used aim9 initially because if
it showed *large* regressions, something was usually up.
Second, I didn't say it was always a 1% regression, just that it generally
within 1%. Here are the last aim9 result comparison on the x86_64
2.6.19-rc4-mm1-clean 2.6.19-rc4-mm1-list-based
1 creat-clo 150666.67 157083.33 6416.66 4.26% File Creations and Closes/second
2 page_test 186915.00 189065.16 2150.16 1.15% System Allocations & Pages/second
3 brk_test 1863739.38 1972521.25 108781.87 5.84% System Memory Allocations/second
4 jmp_test 16388101.98 16381716.67 -6385.31 -0.04% Non-local gotos/second
5 signal_test 464500.00 501649.73 37149.73 8.00% Signal Traps/second
6 exec_test 165.17 162.59 -2.58 -1.56% Program Loads/second
7 fork_test 4283.57 4365.21 81.64 1.91% Task Creations/second
8 link_test 50129.19 47658.31 -2470.88 -4.93% Link/Unlink Pairs/second
It's actally showing some performance improvements there according to aim9
Here are the aim9 results on a ppc64 LPAR
2.6.19-rc4-mm1-clean 2.6.19-rc4-mm1-list-based
1 creat-clo 134460.92 134816.67 355.75 0.26% File Creations and Closes/second
2 page_test 307473.33 304900.85 -2572.48 -0.84% System Allocations & Pages/second
3 brk_test 1547025.50 1565439.09 18413.59 1.19% System Memory Allocations/second
4 jmp_test 10353816.67 10211531.41 -142285.26 -1.37% Non-local gotos/second
5 signal_test 257007.17 257066.67 59.50 0.02% Signal Traps/second
6 exec_test 108.61 108.76 0.15 0.14% Program Loads/second
7 fork_test 3276.12 3289.45 13.33 0.41% Task Creations/second
8 link_test 47225.33 48289.50 1064.17 2.25% Link/Unlink Pairs/second
And here is the comparison on a numaq
2.6.19-rc4-mm1-clean 2.6.19-rc4-mm1-list-based
1 creat-clo 46660.00 48609.03 1949.03 4.18% File Creations and Closes/second
2 page_test 47555.81 47588.68 32.87 0.07% System Allocations & Pages/second
3 brk_test 247910.77 254179.15 6268.38 2.53% System Memory Allocations/second
4 jmp_test 2276287.29 2275924.69 -362.60 -0.02% Non-local gotos/second
5 signal_test 65561.48 64778.41 -783.07 -1.19% Signal Traps/second
6 exec_test 21.32 21.31 -0.01 -0.05% Program Loads/second
7 fork_test 880.79 906.36 25.57 2.90% Task Creations/second
8 link_test 19058.50 18726.81 -331.69 -1.74% Link/Unlink Pairs/second
These results tend to vary by a few percent in each run, even on
subsequent runs so I consider the results to be very noisy and I haven't
done the legwork yet to get an average over multiple runs. To give an idea
of how mad the results can be, this is an older set of results on an
x86_64. Look at the brk_test results even. Between 2.6.19-rc2-mm2-clean
and 2.6.19-rc3-mm1, there is a 12% regression apparently, but it's
unlikely to be reflected in "real" benchmarks.
2.6.19-rc2-mm2-clean 2.6.19-rc2-mm2-list-based
1 creat-clo 142759.54 170083.33 27323.79 19.14% File Creations and Closes/second
2 page_test 187305.90 179716.71 -7589.19 -4.05% System Allocations & Pages/second
3 brk_test 2139943.34 2377053.82 237110.48 11.08% System Memory Allocations/second
4 jmp_test 16387850.00 16380453.26 -7396.74 -0.05% Non-local gotos/second
5 signal_test 536933.33 495550.74 -41382.59 -7.71% Signal Traps/second
6 exec_test 166.17 162.39 -3.78 -2.27% Program Loads/second
7 fork_test 4201.23 4261.91 60.68 1.44% Task Creations/second
8 link_test 48980.64 58369.22 9388.58 19.17% Link/Unlink Pairs/second
Hence, I'd like to get a better idea of what sort of performance effect
other people see on the benchmarks they care about.
Here is the script I use to grab hugepages;
#!/bin/bash
# This benchmark checks how many hugepages can be allocated in the hugepage
# pool
P=hugepages_get-bench
SLEEP_INTERVAL=3
FAIL_AFTER_NO_CHANGE_ATTEMPTS=20
# Args
while [ "$1" != "" ]; do
case "$1" in
-s) export SLEEP_INTERVAL=$2; shift 2;;
-f) export FAIL_AFTER_NO_CHANGE_ATTEMPTS=$2; shift 2;;
esac
done
# Check proc entry exists
if [ ! -e /proc/sys/vm/nr_hugepages ]; then
echo Attempting load of hugetlbfs module
modprobe hugetlbfs
if [ ! -e /proc/sys/vm/nr_hugepages ]; then
echo ERROR: /proc/sys/vm/nr_hugepages does not exist
exit $EXIT_TERMINATE
fi
fi
echo Allocating hugepages test
echo -------------------------
# Disable OOM killed
echo Disabling OOM Killer for current test process
echo -17 > /proc/self/oom_adj
# Record existing hugepage count
STARTING_COUNT=`cat /proc/sys/vm/nr_hugepages`
echo Starting page count: $STARTING_COUNT
# Ensure we have permission to write
echo $STARTING_COUNT > /proc/sys/vm/nr_hugepages || {
echo ERROR: Do not have permission to adjust nr_hugepages count
exit $EXIT_TERMINATE
}
# Start test
CURRENT_COUNT=$STARTING_COUNT
LAST_COUNT=$STARTING_COUNT
NOCHANGE_COUNT=0
ATTEMPT=0
while [ $NOCHANGE_COUNT -ne $FAIL_AFTER_NO_CHANGE_ATTEMPTS ]; do
ATTEMPT=$((ATTEMPT+1))
PAGES_COUNT=$(($CURRENT_COUNT+100))
echo $PAGES_COUNT > /proc/sys/vm/nr_hugepages
CURRENT_COUNT=`cat /proc/sys/vm/nr_hugepages`
PROGRESS=
if [ "$CURRENT_COUNT" = "$LAST_COUNT" ]; then
NOCHANGE_COUNT=$(($NOCHANGE_COUNT+1))
else
NOCHANGE_COUNT=0
PROGRESS="Progress made with $(($CURRENT_COUNT-$LAST_COUNT)) pages"
fi
echo Attempt $ATTEMPT: $CURRENT_COUNT pages $PROGRESS
LAST_COUNT=$CURRENT_COUNT
sleep $SLEEP_INTERVAL
done
echo Final page count: $CURRENT_COUNT
echo $STARTING_COUNT > /proc/sys/vm/nr_hugepages
exit $EXIT_SUCCESS
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]