Re: sendfile() with 100 simultaneous 100MB files

On Fri, Jan 20, 2006 at 04:53:44PM -0500, Jon Smirl wrote:
> I was reading this blog post about the lighttpd web server.
> http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads
> It describes problems they are having downloading 100 simultaneous 100MB files.

    "more than 100 files of each more than 100 MB"

> In this post they complain about sendfile() getting into seek storms and
> ending up in 72% IO wait. As a result they built a user space
> mechanism to work around the problems.
> 
> I tried looking at how the kernel implements sendfile(), I have
> minimal understanding of how the fs code works but it looks to me like
> sendfile() is working a page at a time. I was looking for code that
> does something like this...
> 
> 1) Compute an adaptive window size and read ahead the appropriate
> number of pages.  A larger window would minimize disk seeks.

Or maybe not..   larger main memory would help more.  But there is
another issue...

> 2) Something along the lines of as soon as a page is sent age the page
> down in to the middle of page ages. That would allow for files that
> are repeatedly sent, but also reduce thrashing from files that are not
> sent frequently and shouldn't stay in the page cache.
> 
> Any other ideas why sendfile() would get into a seek storm?


Deep inside the  do_generic_mapping_read() there is a loop that
reads the source file with read-ahead processing, processes it
one page at the time, calls actor (which sends the file) and
releases the page cache of that page.  -- with convoluted things
done when page isn't in page cache, etc..


                /*
                 * Ok, we have the page, and it's up-to-date, so
                 * now we can copy it to user space...
                 *
                 * The actor routine returns how many bytes were actually used..
                 * NOTE! This may not be the same as how much of a user buffer
                 * we filled up (we may be padding etc), so we can only update
                 * "pos" here (the actor routine has to update the user buffer
                 * pointers and the remaining count).
                 */
                ret = actor(desc, page, offset, nr);
                offset += ret;
                index += offset >> PAGE_CACHE_SHIFT;
                offset &= ~PAGE_CACHE_MASK;

                page_cache_release(page);
                if (ret == nr && desc->count)
                        continue;


That is, if machine memory is so limited (file pages + network
tcp buffers!) that source file pages gets constantly purged out, 
there is not much that one can do.

That described workaround is essentially to read the file to server
process memory with half an MB sliding window, and then  writev()
from there to socket.  Most importantly it does the reading in _large_
chunks.

The read-ahead in sendfile is done by  page_cache_readahead(), and
via fairly complicated circumstances it ends up using 

        bdi = mapping->backing_dev_info;

        switch (advice) {
        case POSIX_FADV_NORMAL:
                file->f_ra.ra_pages = bdi->ra_pages;
                break;
        case POSIX_FADV_RANDOM:
                file->f_ra.ra_pages = 0;
                break;
        case POSIX_FADV_SEQUENTIAL:
                file->f_ra.ra_pages = bdi->ra_pages * 2;
                break;
	....


Default value for ra_pages is equivalent of  128 kB, which
should be enough...  

Why it goes to seek trashing ?   Because read-ahead buffer memory
space is being processed in very small fragments, and the sendpage
to socket writing logic pauses frequently, during which read-ahead
buffers become recycled...

In  writev()  solution the pausing in socket sending side does
not appear so heavily in source file reading side, as things
get buffered in non-discardable memory space of userspace process.

> --
> Jon Smirl
> [email protected]

/Matti Aarnio
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: sendfile() with 100 simultaneous 100MB files
  - From: Jon Smirl <[email protected]>

References:
- sendfile() with 100 simultaneous 100MB files
  - From: Jon Smirl <[email protected]>

Prev by Date: Re: ASUS A7V-E SE + Linux Kernel 2.6.15.1 = SATA Issues
Next by Date: Re: [Alsa-devel] RFC: OSS driver removal, a slightly different approach
Previous by thread: sendfile() with 100 simultaneous 100MB files
Next by thread: Re: sendfile() with 100 simultaneous 100MB files
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]