Adaptive read-ahead with look-ahead

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Here is a more complete version of the adaptive read-ahead logic.

There are mainly two changes:

1. DELAYED ACTIVATION
activate_page() now simply sets a PG_activate flag; the activation is
delayed to be handled by shrink_list().
That change effectively eliminates the disturbing problem, for the
timing information that lie in the chunks inside inactive_list are
preserved. But I'm not sure if the change will cause any negative
effects.

2. LOOK AHEAD
The new logic now mimics the ahead window behavior of the current logic,
by setting a PG_readahead flag on some pages as look-ahead reminders.
The added overhead is to find out where to start the next readahead.
Typical values would be two or more radix tree lookups for each
readahead IO, and one lru lock when in grow up phase.

Now the two logics behave roughly the same. It makes benchmarks more
meaningful and comparable. A rerun of the tests show that:
1. System time reports are still higher for the new logic.
2. The overall samples reported by oprofile are not distinguishable,
sometimes one counts more and sometimes the other.

TIME OUTPUTS
=====================================================================
# ./test_ra.sh 4k 0 50 100
readahead_ratio = 0
79.88s real  9.09s system  0.47s user  13651+130 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
146.75s real  1.33s system  21.44s user  13459+264 cs  grep -r 'asdfghjkl;' linux-2.6.13
181.19s real  3.80s system  1.58s user  32744+83 cs  diff -r linux-2.6.13 linux-2.6.13ra > /dev/null
181.48s real  9.64s system  0.51s user  25141+181 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
readahead_ratio = 50
79.47s real  9.32s system  0.52s user  13928+107 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
141.93s real  1.46s system  21.78s user  13565+327 cs  grep -r 'asdfghjkl;' linux-2.6.13
178.20s real  4.04s system  1.56s user  32144+83 cs  diff -r linux-2.6.13 linux-2.6.13ra > /dev/null
181.87s real  9.83s system  0.52s user  24836+150 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
readahead_ratio = 100
79.92s real  9.21s system  0.48s user  13668+134 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
144.82s real  1.47s system  21.61s user  13405+264 cs  grep -r 'asdfghjkl;' linux-2.6.13
179.62s real  3.90s system  1.48s user  32336+67 cs  diff -r linux-2.6.13 linux-2.6.13ra > /dev/null
182.00s real  9.91s system  0.50s user  25723+145 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null

# ./test_ra.sh 32k 0 50 100
readahead_ratio = 0
79.80s real  8.07s system  0.09s user  14250+96 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
146.20s real  1.44s system  24.17s user  12954+299 cs  grep -r 'asdfghjkl;' linux-2.6.13
181.49s real  3.83s system  1.57s user  32098+72 cs  diff -r linux-2.6.13 linux-2.6.13ra > /dev/null
182.00s real  8.63s system  0.08s user  25673+181 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
readahead_ratio = 50
79.74s real  8.24s system  0.08s user  13904+127 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
145.22s real  1.44s system  24.87s user  13182+303 cs  grep -r 'asdfghjkl;' linux-2.6.13
180.60s real  4.12s system  1.47s user  32096+53 cs  diff -r linux-2.6.13 linux-2.6.13ra > /dev/null
181.78s real  8.75s system  0.12s user  25843+188 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
readahead_ratio = 100
80.08s real  8.38s system  0.07s user  13806+131 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
147.72s real  1.74s system  23.23s user  12986+299 cs  grep -r 'asdfghjkl;' linux-2.6.13
179.43s real  9.21s system  0.10s user  26010+142 cs  dd if=$FILE of=/dev/null bs=$bs 2> /dev/null
180.20s real  4.10s system  1.41s user  32616+72 cs  diff -r linux-2.6.13 linux-2.6.13ra > /dev/null


OPROFILE RESULTS
=====================================================================

  oprofile.0.4k                                |  oprofile.100.4k
  ---------------------------------------------+------------------------------------------------
  samples  %        symbol name                |  samples  %        symbol name
  35783    14.3839  __copy_to_user_ll          |  34962    13.9808  __copy_to_user_ll
  15432     6.2033  ll_rw_block                |  15608     6.2414  ll_rw_block
  12055     4.8458  shrink_list                |  12115     4.8446  shrink_list
  10373     4.1697  system_call                |  10746     4.2972  system_call
  10037     4.0346  __find_get_block           |  10199     4.0784  __find_get_block
  7661      3.0795  delay_pmtmr                |  7716      3.0855  delay_pmtmr
  6671      2.6816  add_to_page_cache          |  6853      2.7404  add_to_page_cache
  5981      2.4042  radix_tree_delete          |  6155      2.4613  do_generic_mapping_read
  5433      2.1839  isolate_lru_pages          |  5796      2.3177  radix_tree_delete
  5346      2.1490  __brelse                   |  5362      2.1442  __brelse
  5268      2.1176  dnotify_parent             |  5298      2.1186  isolate_lru_pages
  4986      2.0042  unlock_buffer              |  5252      2.1002  dnotify_parent
  4884      1.9632  inotify_dentry_parent_queue|  5085      2.0334  inotify_dentry_parent_queue_event
  4779      1.9210  do_generic_mapping_read    |  4942      1.9762  unlock_buffer
  4509      1.8125  find_get_page              |  4869      1.9470  find_get_page
  4206      1.6907  do_mpage_readpage          |  4237      1.6943  do_mpage_readpage
  4074      1.6376  default_idle               |  4125      1.6495  default_idle
  4007      1.6107  __do_page_cache_readahead  |  3986      1.5939  __do_page_cache_readahead
  3242      1.3032  kmap_atomic                |  3159      1.2632  kmap_atomic
  2847      1.1444  __read_lock_failed         |  2923      1.1689  __read_lock_failed
  2748      1.1046  free_pages_bulk            |  2772      1.1085  free_pages_bulk
  2713      1.0906  unlock_page                |  2709      1.0833  unlock_page
  2582      1.0379  __wake_up_bit              |  2641      1.0561  mpage_end_io_read
  2534      1.0186  mpage_end_io_read          |  2502      1.0005  __wake_up_bit
  2468      0.9921  __pagevec_lru_add          |  2463      0.9849  __pagevec_lru_add
  2199      0.8839  __mod_page_state           |  2223      0.8889  __mod_page_state
  2142      0.8610  __alloc_pages              |  1964      0.7854  __alloc_pages
  1935      0.7778  bad_range                  |  1907      0.7626  bad_range
  1775      0.7135  mark_page_accessed         |  1771      0.7082  mark_page_accessed
  1765      0.7095  __write_lock_failed        |  1748      0.6990  __rmqueue
  1705      0.6854  find_busiest_group         |  1682      0.6726  zone_watermark_ok
  1634      0.6568  zone_watermark_ok          |  1677      0.6706  find_busiest_group
  1618      0.6504  buffered_rmqueue           |  1619      0.6474  buffered_rmqueue
  1617      0.6500  radix_tree_lookup          |  1599      0.6394  __write_lock_failed
  1599      0.6428  __rmqueue                  |  1556      0.6222  mark_offset_pmtmr
  1587      0.6379  mark_offset_pmtmr          |  1524      0.6094  free_hot_cold_page
  1580      0.6351  free_hot_cold_page         |  1503      0.6010  radix_tree_lookup
  1383      0.5559  schedule                   |  1399      0.5594  schedule
  1314      0.5282  page_referenced            |  1363      0.5450  page_referenced


  oprofile.0.32k                               |  oprofile.50.32k
  ---------------------------------------------+------------------------------------------------
  samples  %        symbol name                |  samples  %        symbol name
  36914    16.3667  __copy_to_user_ll          |  36787    16.2290  __copy_to_user_ll
  15604     6.9184  ll_rw_block                |  15507     6.8411  ll_rw_block
  11881     5.2677  shrink_list                |  11975     5.2829  shrink_list
  9893      4.3863  __find_get_block           |  10370     4.5749  __find_get_block
  7865      3.4871  delay_pmtmr                |  8043      3.5483  delay_pmtmr
  6684      2.9635  add_to_page_cache          |  6740      2.9734  add_to_page_cache
  6225      2.7600  isolate_lru_pages          |  6162      2.7184  isolate_lru_pages
  5634      2.4980  find_get_page              |  5852      2.5817  find_get_page
  5603      2.4842  radix_tree_delete          |  5718      2.5226  radix_tree_delete
  5400      2.3942  __brelse                   |  5193      2.2910  __brelse
  4958      2.1982  unlock_buffer              |  4974      2.1943  unlock_buffer
  4268      1.8923  default_idle               |  4159      1.8348  do_mpage_readpage
  4241      1.8803  do_mpage_readpage          |  4103      1.8101  mpage_end_io_read
  4113      1.8236  mpage_end_io_read          |  4084      1.8017  default_idle
  3944      1.7487  __do_page_cache_readahead  |  4008      1.7682  __do_page_cache_readahead
  3141      1.3926  kmap_atomic                |  3515      1.5507  do_generic_mapping_read
  2758      1.2228  unlock_page                |  3031      1.3372  kmap_atomic
  2740      1.2148  __read_lock_failed         |  2916      1.2864  __read_lock_failed
  2692      1.1936  __wake_up_bit              |  2749      1.2128  unlock_page
  2631      1.1665  __rmqueue                  |  2654      1.1708  __rmqueue
  2391      1.0601  free_pages_bulk            |  2541      1.1210  __wake_up_bit
  2378      1.0543  __pagevec_lru_add          |  2520      1.1117  __pagevec_lru_add
  2310      1.0242  do_generic_mapping_read    |  2424      1.0694  free_pages_bulk
  2206      0.9781  __mod_page_state           |  2279      1.0054  __mod_page_state
  2046      0.9071  __alloc_pages              |  1972      0.8700  __alloc_pages
  1730      0.7670  mark_page_accessed         |  1806      0.7967  mark_page_accessed
  1639      0.7267  buffered_rmqueue           |  1726      0.7614  zone_watermark_ok
  1629      0.7223  zone_watermark_ok          |  1695      0.7478  radix_tree_lookup
  1615      0.7160  find_busiest_group         |  1689      0.7451  find_busiest_group
  1570      0.6961  radix_tree_lookup          |  1562      0.6891  mark_offset_pmtmr
  1499      0.6646  mark_offset_pmtmr          |  1514      0.6679  buffered_rmqueue
  1448      0.6420  free_hot_cold_page         |  1508      0.6653  free_hot_cold_page
  1438      0.6376  schedule                   |  1488      0.6564  schedule
  1437      0.6371  bad_range                  |  1434      0.6326  system_call
  1431      0.6345  __write_lock_failed        |  1412      0.6229  bad_range
  1426      0.6322  release_pages              |  1394      0.6150  __write_lock_failed
  1392      0.6172  system_call                |  1366      0.6026  release_pages
  1331      0.5901  page_waitqueue             |  1352      0.5965  page_referenced
  1307      0.5795  page_referenced            |  1279      0.5642  page_waitqueue

 oprofile.0.8k vs oprofile.50.8k with accumulated samples.
--------------------------------------------------------------------------------------------------------------------
  samples  cum. samples  %        cum. %     symbol name    |  samples  cum. samples  %        cum. %     symbol name
  38565    38565         14.4471  14.4471    __copy_to_user_|  36673    36673         15.0370  15.0370    __copy_to_user_
  16341    54906          6.1216  20.5687    ll_rw_block    |  15770    52443          6.4662  21.5032    ll_rw_block
  12407    67313          4.6479  25.2166    shrink_list    |  11999    64442          4.9199  26.4231    shrink_list
  10404    77717          3.8975  29.1141    __find_get_bloc|  11027    75469          4.5214  30.9445    __find_get_bloc
  8157     85874          3.0558  32.1699    delay_pmtmr    |  8003     83472          3.2815  34.2260    delay_pmtmr
  6961     92835          2.6077  34.7776    add_to_page_cac|  6836     90308          2.8030  37.0289    add_to_page_cac
  6546     99381          2.4522  37.2299    system_call    |  6362     96670          2.6086  39.6375    isolate_lru_pag
  6432     105813         2.4095  39.6394    isolate_lru_pag|  5882     102552         2.4118  42.0493    find_get_page
  6098     111911         2.2844  41.9238    find_get_page  |  5808     108360         2.3815  44.4308    radix_tree_dele
  5948     117859         2.2282  44.1520    radix_tree_dele|  5567     113927         2.2826  46.7134    system_call
  5743     123602         2.1514  46.3035    __brelse       |  5311     119238         2.1777  48.8911    __brelse
  5247     128849         1.9656  48.2691    unlock_buffer  |  5082     124320         2.0838  50.9748    unlock_buffer
  4505     133354         1.6877  49.9567    do_mpage_readpa|  4858     129178         1.9919  52.9668    do_generic_mapp
  4501     137855         1.6862  51.6429    mpage_end_io_re|  4252     133430         1.7434  54.7102    do_mpage_readpa
  4341     142196         1.6262  53.2691    __do_page_cache|  4189     137619         1.7176  56.4278    __do_page_cache
  4315     146511         1.6165  54.8856    default_idle   |  4159     141778         1.7053  58.1331    default_idle
  3655     150166         1.3692  56.2548    do_generic_mapp|  4127     145905         1.6922  59.8253    mpage_end_io_re
  3146     153312         1.1785  57.4333    kmap_atomic    |  3152     149057         1.2924  61.1177    kmap_atomic
  2942     156254         1.1021  58.5355    __read_lock_fai|  2947     152004         1.2084  62.3261    __read_lock_fai
  2918     159172         1.0931  59.6286    unlock_page    |  2787     154791         1.1428  63.4688    __rmqueue
  2900     162072         1.0864  60.7150    __rmqueue      |  2782     157573         1.1407  64.6095    unlock_page
  2839     164911         1.0635  61.7785    dnotify_parent |  2778     160351         1.1391  65.7486    dnotify_parent
  2814     167725         1.0542  62.8327    __wake_up_bit  |  2663     163014         1.0919  66.8405    __wake_up_bit
  2759     170484         1.0336  63.8663    inotify_dentry_|  2556     165570         1.0480  67.8886    inotify_dentry_
  2533     173017         0.9489  64.8152    free_pages_bulk|  2455     168025         1.0066  68.8952    free_pages_bulk
  2524     175541         0.9455  65.7607    __pagevec_lru_a|  2414     170439         0.9898  69.8850    __pagevec_lru_a
  2344     177885         0.8781  66.6388    __mod_page_stat|  2146     172585         0.8799  70.7649    __mod_page_stat
  2319     180204         0.8687  67.5076    __d_lookup     |  2060     174645         0.8447  71.6096    __alloc_pages
  2183     182387         0.8178  68.3253    __alloc_pages  |  1839     176484         0.7540  72.3636    mark_page_acces
  1952     184339         0.7313  69.0566    radix_tree_look|  1731     178215         0.7098  73.0734    radix_tree_look
  1918     186257         0.7185  69.7751    mark_page_acces|  1707     179922         0.6999  73.7733    zone_watermark_
  1886     188143         0.7065  70.4816    zone_watermark_|  1680     181602         0.6888  74.4621    find_busiest_gr
  1880     190023         0.7043  71.1859    schedule       |  1598     183200         0.6552  75.1174    __write_lock_fa
  1861     191884         0.6972  71.8831    find_busiest_gr|  1572     184772         0.6446  75.7619    buffered_rmqueu
  1837     193721         0.6882  72.5713    buffered_rmqueu|  1543     186315         0.6327  76.3946    mark_offset_pmt
  1644     195365         0.6159  73.1871    mark_offset_pmt|  1500     187815         0.6150  77.0097    free_hot_cold_p
  1582     196947         0.5926  73.7798    free_hot_cold_p|  1494     189309         0.6126  77.6222    schedule
  1481     198428         0.5548  74.3346    bad_range      |  1482     190791         0.6077  78.2299    radix_tree_inse
  1480     199908         0.5544  74.8890    release_pages  |  1376     192167         0.5642  78.7941    bad_range
......
  2        266707        7.5e-04  99.9131    generic_file_mm|  1        243885        4.1e-04  100.000    vma_link

Here the new logic gets fewer samples!


ABOUT DEBUGGING
=====================================================================
But there is a significant rise for do_generic_mapping_read().
The problem was tracked down to be TestClearPageReadahead(), and was
solved by changing it to PageReadahead() and move the test-clear thing
to page_cache_readahead_adaptive().

I used the following commands to check things out, are there other ways
to do this?

# opreport -cl linux-2.6.13ra/vmlinux
# opannotate --assembly linux-2.6.13ra/vmlinux
# make CONFIG_DEBUG_INFO=1 mm/filemap.o
# objdump -S mm/filemap.o > mm/filemap.asm 

First I found the current logic has one major hot spot:
    vma      samples  cum. samples  %        cum. %
    c013c84f 1479     3969          31.7586  85.2265

While the new one has two:
    c013c84f 1587     5769          24.6047  89.4419
    c013c6ce 1169     3248          18.1240  50.3566

The c013c84f address is found out to be:
  1520  0.5394 :c013c84f:       sets   %al
or:
--------------------------------------------------------------------------------
static inline void put_page(struct page *page)
{       
        if (!PageReserved(page) && put_page_testzero(page))
     ee1:       8b 43 04                mov    0x4(%ebx),%eax
     ee4:       40                      inc    %eax
     ee5:       74 3f                   je     f26 <do_generic_mapping_read+0x316>
static __inline__ int atomic_add_negative(int i, atomic_t *v)
{       
        unsigned char c;

        __asm__ __volatile__(
     ee7:       8b 75 9c                mov    0xffffff9c(%ebp),%esi
     eea:       f0 83 46 04 ff          lock addl $0xffffffff,0x4(%esi)
     eef:       0f 98 c0                sets   %al
     ef2:       84 c0                   test   %al,%al
     ef4:       75 27                   jne    f1d <do_generic_mapping_read+0x30d>
                page_cache_release(prev_page);
--------------------------------------------------------------------------------
And the c013c6ce address is
               :c013c6ce:       sbb    %eax,%eax
or: 
--------------------------------------------------------------------------------
static inline int test_and_clear_bit(int nr, volatile unsigned long * addr)
{       
        int oldbit;

        __asm__ __volatile__( LOCK_PREFIX
     d69:       f0 0f ba 37 15          lock btrl $0x15,(%edi)
     d6e:       19 c0                   sbb    %eax,%eax
     d70:       85 c0                   test   %eax,%eax
     d72:       0f 85 08 04 00 00       jne    1180 <do_generic_mapping_read+0x570>
                                page_cache_readahead_adaptive(mapping, &ra,
                                                filp, prev_page, NULL,
                                                *ppos >> PAGE_CACHE_SHIFT,
                                                index, last_index);
                                page = find_get_page(mapping, index);
                        } else if (TestClearPageReadahead(page)) {
                                page_cache_readahead_adaptive(mapping, &ra,
                                                filp, prev_page, page,
                                                *ppos >> PAGE_CACHE_SHIFT,
                                                index, last_index);
                        }
                }
--------------------------------------------------------------------------------
Then I knew that TestClearPageReadahead() can be a lot expensive than
PageReadahead()!



diff -rup linux-2.6.13/include/linux/mm.h linux-2.6.13ra/include/linux/mm.h
--- linux-2.6.13/include/linux/mm.h	2005-08-29 07:41:01.000000000 +0800
+++ linux-2.6.13ra/include/linux/mm.h	2005-09-14 22:15:09.000000000 +0800
@@ -875,7 +875,7 @@ extern int filemap_populate(struct vm_ar
 int write_one_page(struct page *page, int wait);
 
 /* readahead.c */
-#define VM_MAX_READAHEAD	128	/* kbytes */
+#define VM_MAX_READAHEAD	1024	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
 #define VM_MAX_CACHE_HIT    	256	/* max pages in a row in cache before
 					 * turning readahead off */
@@ -889,6 +889,12 @@ unsigned long  page_cache_readahead(stru
 			  struct file *filp,
 			  unsigned long offset,
 			  unsigned long size);
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+			struct file_ra_state *ra, struct file *filp,
+			struct page *prev_page, struct page *page,
+			unsigned long first_index,
+			unsigned long index, unsigned long last_index);
 void handle_ra_miss(struct address_space *mapping, 
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
diff -rup linux-2.6.13/include/linux/page-flags.h linux-2.6.13ra/include/linux/page-flags.h
--- linux-2.6.13/include/linux/page-flags.h	2005-08-29 07:41:01.000000000 +0800
+++ linux-2.6.13ra/include/linux/page-flags.h	2005-09-15 09:39:49.000000000 +0800
@@ -75,6 +75,8 @@
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
+#define PG_activate		20	/* delayed activate */
+#define PG_readahead		21	/* check readahead when reading this page */
 
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
@@ -305,6 +307,18 @@ extern void __mod_page_state(unsigned lo
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageActivate(page)	test_bit(PG_activate, &(page)->flags)
+#define SetPageActivate(page)	set_bit(PG_activate, &(page)->flags)
+#define ClearPageActivate(page)	clear_bit(PG_activate, &(page)->flags)
+#define TestClearPageActivate(page) test_and_clear_bit(PG_activate, &(page)->flags)
+#define TestSetPageActivate(page) test_and_set_bit(PG_activate, &(page)->flags)
+
+#define PageReadahead(page)	test_bit(PG_readahead, &(page)->flags)
+#define SetPageReadahead(page)	set_bit(PG_readahead, &(page)->flags)
+#define ClearPageReadahead(page) clear_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+#define TestSetPageReadahead(page) test_and_set_bit(PG_readahead, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
diff -rup linux-2.6.13/include/linux/sysctl.h linux-2.6.13ra/include/linux/sysctl.h
--- linux-2.6.13/include/linux/sysctl.h	2005-08-29 07:41:01.000000000 +0800
+++ linux-2.6.13ra/include/linux/sysctl.h	2005-09-13 17:04:42.000000000 +0800
@@ -180,6 +180,7 @@ enum
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_READAHEAD_RATIO=29, /* ratio of readahead request size to backward-looking size */
 };
 
 
diff -rup linux-2.6.13/kernel/sysctl.c linux-2.6.13ra/kernel/sysctl.c
--- linux-2.6.13/kernel/sysctl.c	2005-08-29 07:41:01.000000000 +0800
+++ linux-2.6.13ra/kernel/sysctl.c	2005-09-13 17:04:42.000000000 +0800
@@ -66,6 +66,7 @@ extern int min_free_kbytes;
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
 extern int pid_max_min, pid_max_max;
+extern int readahead_ratio;
 
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
 int unknown_nmi_panic;
@@ -851,6 +852,16 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_READAHEAD_RATIO,
+		.procname	= "readahead_ratio",
+		.data		= &readahead_ratio,
+		.maxlen		= sizeof(readahead_ratio),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
 	{ .ctl_name = 0 }
 };
 
diff -rup linux-2.6.13/mm/filemap.c linux-2.6.13ra/mm/filemap.c
--- linux-2.6.13/mm/filemap.c	2005-08-29 07:41:01.000000000 +0800
+++ linux-2.6.13ra/mm/filemap.c	2005-09-15 15:51:53.000000000 +0800
@@ -699,6 +699,8 @@ grab_cache_page_nowait(struct address_sp
 
 EXPORT_SYMBOL(grab_cache_page_nowait);
 
+extern int readahead_ratio;
+
 /*
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level
@@ -726,10 +728,12 @@ void do_generic_mapping_read(struct addr
 	unsigned long prev_index;
 	loff_t isize;
 	struct page *cached_page;
+	struct page *prev_page;
 	int error;
 	struct file_ra_state ra = *_ra;
 
 	cached_page = NULL;
+	prev_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	next_index = index;
 	prev_index = ra.prev_page;
@@ -758,16 +762,35 @@ void do_generic_mapping_read(struct addr
 		nr = nr - offset;
 
 		cond_resched();
-		if (index == next_index)
+		
+		if (readahead_ratio <= 9 && index == next_index)
 			next_index = page_cache_readahead(mapping, &ra, filp,
 					index, last_index - index);
 
 find_page:
 		page = find_get_page(mapping, index);
+		if (readahead_ratio > 9) {
+			if (unlikely(page == NULL)) {
+				page_cache_readahead_adaptive(mapping, &ra,
+						filp, prev_page, NULL,
+						*ppos >> PAGE_CACHE_SHIFT,
+						index, last_index);
+				page = find_get_page(mapping, index);
+			} else if (PageReadahead(page)) {
+				page_cache_readahead_adaptive(mapping, &ra,
+						filp, prev_page, page,
+						*ppos >> PAGE_CACHE_SHIFT,
+						index, last_index);
+			}
+		}
 		if (unlikely(page == NULL)) {
-			handle_ra_miss(mapping, &ra, index);
+			if (readahead_ratio <= 9)
+				handle_ra_miss(mapping, &ra, index);
 			goto no_cached_page;
 		}
+		if (prev_page)
+			page_cache_release(prev_page);
+		prev_page = page;
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -802,7 +825,6 @@ page_ok:
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 
-		page_cache_release(page);
 		if (ret == nr && desc->count)
 			continue;
 		goto out;
@@ -814,7 +836,6 @@ page_not_up_to_date:
 		/* Did it get unhashed before we got the lock? */
 		if (!page->mapping) {
 			unlock_page(page);
-			page_cache_release(page);
 			continue;
 		}
 
@@ -839,7 +860,6 @@ readpage:
 					 * invalidate_inode_pages got it
 					 */
 					unlock_page(page);
-					page_cache_release(page);
 					goto find_page;
 				}
 				unlock_page(page);
@@ -860,7 +880,6 @@ readpage:
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
-			page_cache_release(page);
 			goto out;
 		}
 
@@ -869,7 +888,6 @@ readpage:
 		if (index == end_index) {
 			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
 			if (nr <= offset) {
-				page_cache_release(page);
 				goto out;
 			}
 		}
@@ -879,7 +897,6 @@ readpage:
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
 		desc->error = error;
-		page_cache_release(page);
 		goto out;
 
 no_cached_page:
@@ -904,6 +921,9 @@ no_cached_page:
 		}
 		page = cached_page;
 		cached_page = NULL;
+		if (prev_page)
+			page_cache_release(prev_page);
+		prev_page = page;
 		goto readpage;
 	}
 
@@ -913,6 +933,8 @@ out:
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
 		page_cache_release(cached_page);
+	if (prev_page)
+		page_cache_release(prev_page);
 	if (filp)
 		file_accessed(filp);
 }
@@ -1210,19 +1232,33 @@ retry_all:
 	 *
 	 * For sequential accesses, we use the generic readahead logic.
 	 */
-	if (VM_SequentialReadHint(area))
+	if (readahead_ratio <= 9 && VM_SequentialReadHint(area))
 		page_cache_readahead(mapping, ra, file, pgoff, 1);
 
+
 	/*
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
 	page = find_get_page(mapping, pgoff);
+	if (VM_SequentialReadHint(area) && readahead_ratio > 9) {
+		if (!page) {
+			page_cache_readahead_adaptive(mapping, ra,
+						file, NULL, NULL,
+						pgoff, pgoff, pgoff + 1);
+			page = find_get_page(mapping, pgoff);
+		} else if (PageReadahead(page)) {
+			page_cache_readahead_adaptive(mapping, ra,
+						file, NULL, page,
+						pgoff, pgoff, pgoff + 1);
+		}
+	}
 	if (!page) {
 		unsigned long ra_pages;
 
 		if (VM_SequentialReadHint(area)) {
-			handle_ra_miss(mapping, ra, pgoff);
+			if (readahead_ratio <= 9)
+				handle_ra_miss(mapping, ra, pgoff);
 			goto no_cached_page;
 		}
 		ra->mmap_miss++;
diff -rup linux-2.6.13/mm/readahead.c linux-2.6.13ra/mm/readahead.c
--- linux-2.6.13/mm/readahead.c	2005-08-29 07:41:01.000000000 +0800
+++ linux-2.6.13ra/mm/readahead.c	2005-09-15 15:56:06.000000000 +0800
@@ -15,6 +15,9 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 
+int readahead_ratio = 0;
+EXPORT_SYMBOL(readahead_ratio);
+
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
 }
@@ -254,20 +257,24 @@ out:
  */
 static int
 __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			unsigned long offset, unsigned long nr_to_read)
+			unsigned long offset, unsigned long nr_to_read,
+			unsigned long lookahead_size)
 {
 	struct inode *inode = mapping->host;
-	struct page *page;
+	struct page *page = NULL;
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
 	int ret = 0;
 	loff_t isize = i_size_read(inode);
 
+	if (page && !TestClearPageReadahead(page))
+		return 0;
+
 	if (isize == 0)
 		goto out;
 
- 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -280,8 +287,12 @@ __do_page_cache_readahead(struct address
 			break;
 
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
-		if (page)
+		if (page) {
+			if (readahead_ratio > 9 &&
+					page_idx == nr_to_read - lookahead_size)
+				SetPageReadahead(page);
 			continue;
+		}
 
 		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
@@ -289,9 +300,16 @@ __do_page_cache_readahead(struct address
 		if (!page)
 			break;
 		page->index = page_offset;
+		if (readahead_ratio > 9 &&
+				page_idx == nr_to_read - lookahead_size)
+			SetPageReadahead(page);
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
+	if (page && readahead_ratio > 9 && (readahead_ratio % 3) == 0)
+		get_page(page);
+	else
+		page = NULL;
 	read_unlock_irq(&mapping->tree_lock);
 
 	/*
@@ -301,6 +319,11 @@ __do_page_cache_readahead(struct address
 	 */
 	if (ret)
 		read_pages(mapping, filp, &page_pool, ret);
+	if (page) {
+		lock_page(page);
+		__put_page(page);
+		unlock_page(page);
+	}
 	BUG_ON(!list_empty(&page_pool));
 out:
 	return ret;
@@ -326,7 +349,7 @@ int force_page_cache_readahead(struct ad
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
 		err = __do_page_cache_readahead(mapping, filp,
-						offset, this_chunk);
+						offset, this_chunk, 0);
 		if (err < 0) {
 			ret = err;
 			break;
@@ -373,7 +396,7 @@ int do_page_cache_readahead(struct addre
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
-	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 }
 
 /*
@@ -393,9 +416,15 @@ blockable_page_cache_readahead(struct ad
 	if (!block && bdi_read_congested(mapping->backing_dev_info))
 		return 0;
 
-	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+
+	if (readahead_ratio <= 9 && (readahead_ratio & 1)) {
+		printk(KERN_DEBUG
+			"blockable-readahead(ino=%lu, ra=%lu+%lu) = %d\n",
+			mapping->host->i_ino, offset, nr_to_read, actual);
+	}
 
-	return check_ra_success(ra, nr_to_read, actual);
+	return (readahead_ratio > 9) ? actual : check_ra_success(ra, nr_to_read, actual);
 }
 
 static int make_ahead_window(struct address_space *mapping, struct file *filp,
@@ -555,3 +584,349 @@ unsigned long max_sane_readahead(unsigne
 	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
+
+
+/* STATUS	VALUE		TYPE
+ *  ___ 	0		not in inactive list
+ *  L__ 	1		fresh
+ *  L_R 	2		stale
+ *  LA_ 	4		disturbed once
+ *  LAR 	8		disturbed twice
+ */
+static inline int get_sequential_type(struct page *page)
+{
+	if (page && PageLRU(page) && !PageActive(page)) {
+		if (!PageActivate(page))
+			return 1 + PageReferenced(page);
+		else
+			return 4 + 4*PageReferenced(page);
+	}
+	else
+		return 0;
+}
+
+/*
+ * Look back to estimate safe readahead size.
+ * It will normally be around min(nr_lookback, offset), unless either memory or
+ * read speed is low.  A rigid implementation would be a simple loop to scan
+ * page by page backward, though this may be unnecessary and inefficient, so
+ * the stepping forward scheme is used.
+ */
+static int count_sequential_pages(struct address_space *mapping,
+			unsigned long offset, unsigned long nr_lookback,
+			int sequential_type)
+{
+	int step;
+	int count;
+	unsigned long index;
+	struct page *page;
+
+	index = offset - nr_lookback;
+	if (unlikely(index > offset))
+		index = 0;
+
+	read_lock_irq(&mapping->tree_lock);
+	for(step = (offset - index + 3) / 4, count = 0;
+			index < offset;
+			index += step) {
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (get_sequential_type(page) >= sequential_type) {
+			if (++count >= 3)
+				break;
+		} else {
+			count = 0;
+			step = (offset - index + 3) / 4;
+		}
+	}
+	read_unlock_irq(&mapping->tree_lock);
+
+	return (offset > index) ? (4 * step) : 0;
+}
+
+/*
+ * Scan forward/backward for the first non-present page.
+ * It takes advantage of the adjacency of pages in inactive_list.
+ */
+static unsigned long lru_scan(struct page *page, int dir,
+				int nr_chunks, int nr_pages)
+{
+	unsigned long index = page->index;
+	struct address_space *mapping = page->mapping;
+	struct page *head_page = NULL;
+	struct zone *zone;
+
+	BUG_ON(nr_pages <= 0);
+	BUG_ON(nr_chunks <= 0);
+	BUG_ON(dir != 1 && dir != -1);
+
+	for(;;) {
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		if (!PageLRU(page))
+			goto out;
+
+		do {    
+			index += dir;
+			if (!--nr_pages)
+				goto out;
+
+			page = list_entry(dir == 1 ?
+					page->lru.prev : page->lru.next,
+					struct page, lru);
+		} while (page->mapping == mapping && page->index == index);
+
+		if (!--nr_chunks)
+			goto out;
+		spin_unlock_irq(&zone->lru_lock);
+
+		if (head_page)
+			page_cache_release(head_page);
+
+		head_page = page = find_get_page(mapping, index);
+		if (!page)
+			return index;
+	}
+
+out:    
+	spin_unlock_irq(&zone->lru_lock);
+	if (head_page)
+		page_cache_release(head_page);
+
+	return (nr_chunks && nr_pages) ? index : 0;
+}
+
+static unsigned long get_readahead_index(struct address_space *mapping,
+				unsigned long index, unsigned long ra_size)
+{
+	struct page *page;
+
+	page = find_get_page(mapping, index + ra_size);
+	if (page)
+		goto scan;
+
+	page = find_get_page(mapping, index + ra_size - 1);
+	if (page) {
+		page_cache_release(page);
+		return index + ra_size;
+	}
+
+	for(ra_size -= ra_size / 4;; ra_size = (ra_size + 1) / 2) {
+		if (ra_size == 1)
+			return index + 1;
+		page = find_get_page(mapping, index + ra_size);
+		if (page)
+			break;
+	}
+scan:
+	index = lru_scan(page, 1, ra_size, ra_size);
+	page_cache_release(page);
+	return index;
+}
+
+/*
+ * Rotate old cached pages in inactive_list to prevent readahead thrashing.
+ * Not expected to happen too much.
+ */
+static void rotate_old_pages(struct address_space *mapping,
+			unsigned long offset, unsigned long nr_scan)
+{
+	struct page *page;
+	struct zone *zone;
+
+	for (; nr_scan--; offset++) {
+		page = find_get_page(mapping, offset);
+		if (unlikely(!page))
+			break;
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+		if (PageLRU(page) && !PageLocked(page) && !PageActive(page)) {
+			list_del(&page->lru);
+			list_add(&page->lru, &zone->inactive_list);
+			/* inc_page_state(pgrotated); */
+		}
+		spin_unlock_irq(&zone->lru_lock);
+		page_cache_release(page);
+	}
+}
+
+
+/*
+ * ra_size is mainly determined by:
+ * 1. sequential-start: min(KB(32 + mem_mb/4), KB(128))
+ * 2. sequential-max:	min(KB(64 + mem_mb*8), KB(2048))
+ * 3. sequential:	(history pages) * (readahead_ratio / 100)
+ *
+ * Table of concrete numbers:
+ *  (inactive + free) (in MB):   8  16  32  64  128  256  1024  2048
+ *    initial ra_size (in KB):  34  36  40  48   64   96   128   128
+ *	  max ra_size (in KB): 128 192 320 576 1088 2048  2048  2048
+ */
+static inline void get_readahead_bounds(struct address_space *mapping,
+					unsigned long *ra_min,
+					unsigned long *ra_max)
+{
+	unsigned long mem_mb;
+
+#define KB(size)	(((size) * (1<<10)) / PAGE_CACHE_SIZE)
+	mem_mb = max_sane_readahead(KB(1024*1024)) * 2 *
+				PAGE_CACHE_SIZE / 1024 / 1024;
+
+	*ra_max = min(min(KB(64 + mem_mb*8), KB(2048)),
+			mapping->backing_dev_info->ra_pages); 
+
+	*ra_min = min(min(KB(32 + mem_mb/4), KB(128)), *ra_max);
+#undef KB
+}
+
+/* 
+ * Adaptive readahead based on page cache context
+ */
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+			struct file_ra_state *ra, struct file *filp,
+			struct page *prev_page, struct page *page,
+			unsigned long first_index,
+			unsigned long index, unsigned long last_index)
+{
+	unsigned long eof_index;
+	unsigned long ra_index = index;
+	unsigned long ra_size;
+	unsigned long la_size = 0;
+	unsigned long ra_min;
+	unsigned long ra_max;
+	int ret;
+	int release_prev_page = 0;
+	int sequential_type;
+
+	get_readahead_bounds(mapping, &ra_min, &ra_max);
+
+	/* do not run across EOF */
+	eof_index = ((i_size_read(mapping->host) - 1)
+			>> PAGE_CACHE_SHIFT) + 1;
+	if (last_index > eof_index)
+		last_index = eof_index;
+
+	/*
+	 * readahead disabled?
+	 */
+	if (unlikely(!ra_max)) {
+		ra_size = max_sane_readahead(last_index - index);
+		goto readit;
+	}
+	if (unlikely(!readahead_ratio))
+		goto read_random;
+
+	/*
+	 * Start of file.
+	 */
+	if (index == 0) {
+		if (eof_index >= 2 * ra_min)
+			ra_size = ra_min;
+		else
+			ra_size = eof_index;
+		la_size = ra_size / 2;
+		goto readit;
+	}
+
+	if (!prev_page) {
+		prev_page = find_get_page(mapping, index - 1);
+		release_prev_page = 1;
+	}
+
+	/*
+	 * Bare support for read backward.
+	 */
+	if (!prev_page && first_index == index &&
+			last_index - index < ra_min) {
+		struct page *last_page;
+
+		last_page = find_get_page(mapping, last_index);
+		if (get_sequential_type(page) == 2) {
+			ra_index = ((index > 8) ? (index - 8) : 0);
+			ra_size = last_index - ra_index;
+			page_cache_release(last_page);
+			goto readit;
+		}
+		if (last_page)
+			page_cache_release(last_page);
+	}
+
+	/*
+	 * Random read.
+	 */
+	if (!prev_page)
+		goto read_random;
+
+	/* 
+	 * Sequential readahead?
+	 */ 
+	sequential_type = get_sequential_type(prev_page);
+	if (sequential_type > 1 && (!page || !PageActive(page)) &&
+				get_sequential_type(page) < sequential_type) {
+		ra_size = count_sequential_pages(mapping, index,
+				ra_max * 100 / readahead_ratio,
+				sequential_type) * readahead_ratio / 100;
+		if (last_index - first_index < ra_max &&
+			ra_size < ra_min && sequential_type != 2)
+			goto read_random;
+
+		if (ra_size > ra_max)
+			ra_size = ra_max;
+		else if (ra_size < ra_min)
+			ra_size = ra_min;
+
+		if (page) { /* ahead window */
+			ra_index = get_readahead_index(mapping, index, ra_size);
+			if (unlikely(!ra_index))
+				goto out;
+			if (ra_index + ra_size + ra_min <= eof_index)
+				la_size = ra_size;
+			else
+				ra_size = eof_index - ra_index;
+		} else { /* current + ahead window */
+			if (ra_index + 2 * ra_size + ra_min <= eof_index) {
+				la_size = ra_size;
+				ra_size *= 2;
+			}
+			else
+				ra_size = eof_index - ra_index;
+		}
+
+		goto readit;
+	}
+
+read_random:
+	ra_size = min(last_index - index, ra_max);
+
+readit: 
+	ret = __do_page_cache_readahead(mapping, filp, ra_index, ra_size, la_size);
+
+	/* 
+	 * If found hole, rotate old pages to prevent readahead-thrashing.
+	 */
+	if (ret != la_size + ra_size && ra_size < ra_max)
+		rotate_old_pages(mapping, ra_index, ra_size);
+
+	/* if (la_size) {
+		page = find_get_page(mapping, ra_index + ra_size - la_size);
+		if (page) {
+			SetPageReadahead(page);
+			page_cache_release(page);
+		}
+	}                                                                       */
+
+	if (readahead_ratio & 1)
+		printk(KERN_DEBUG "readahead(ino=%lu, index=%lu-%lu-%lu, "
+				"ra=%lu+%lu-%lu) = %d\n",
+				mapping->host->i_ino,
+				first_index, index, last_index,
+				ra_index, ra_size, la_size, ret);
+
+out:
+	if (release_prev_page && prev_page)
+		page_cache_release(prev_page);
+
+	return ra_index + ra_size;
+}
+
diff -rup linux-2.6.13/mm/swap.c linux-2.6.13ra/mm/swap.c
--- linux-2.6.13/mm/swap.c	2005-08-29 07:41:01.000000000 +0800
+++ linux-2.6.13ra/mm/swap.c	2005-09-15 14:29:11.000000000 +0800
@@ -96,6 +96,8 @@ int rotate_reclaimable_page(struct page 
 	return 0;
 }
 
+extern int readahead_ratio;
+
 /*
  * FIXME: speed this up?
  */
@@ -103,6 +105,11 @@ void fastcall activate_page(struct page 
 {
 	struct zone *zone = page_zone(page);
 
+	if (readahead_ratio > 9 || (readahead_ratio & 1)) {
+		SetPageActivate(page);
+		return;
+	}
+
 	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page)) {
 		del_page_from_inactive_list(zone, page);
@@ -122,7 +129,8 @@ void fastcall activate_page(struct page 
  */
 void fastcall mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+	if (!PageActive(page) && !PageActivate(page) &&
+			PageReferenced(page) && PageLRU(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
diff -rup linux-2.6.13/mm/vmscan.c linux-2.6.13ra/mm/vmscan.c
--- linux-2.6.13/mm/vmscan.c	2005-08-29 07:41:01.000000000 +0800
+++ linux-2.6.13ra/mm/vmscan.c	2005-09-15 10:20:35.000000000 +0800
@@ -407,6 +407,9 @@ static int shrink_list(struct list_head 
 		if (PageWriteback(page))
 			goto keep_locked;
 
+		if (TestClearPageActivate(page))
+			goto activate_locked;
+
 		referenced = page_referenced(page, 1, sc->priority <= 0);
 		/* In active use or really unfreeable?  Activate it. */
 		if (referenced && page_mapping_inuse(page))

-- 
WU Fengguang
Dept. of Automation
University of Science and Technology of China
Hefei, Anhui
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]
  Powered by Linux