Re: using splice/vmsplice to improve file receive performance

On 1/4/07, Jens Axboe <[email protected]> wrote:

On Thu, Jan 04 2007, Jens Axboe wrote:
> On Wed, Jan 03 2007, saeed bishara wrote:
> > On 12/22/06, Jens Axboe <[email protected]> wrote:
> > >On Fri, Dec 22 2006, saeed bishara wrote:
> > >> On 12/22/06, Jens Axboe <[email protected]> wrote:
> > >> >On Fri, Dec 22 2006, saeed bishara wrote:
> > >> >> On 12/22/06, Jens Axboe <[email protected]> wrote:
> > >> >> >On Thu, Dec 21 2006, saeed bishara wrote:
> > >> >> >> Hi,
> > >> >> >> I'm trying to use the splice/vmsplice system calls to improve the
> > >> >> >> samba server write throughput, but before touching the smbd, I
> > >started
> > >> >> >> to improve the ttcp tool since it simple and has the same flow. I'm
> > >> >> >> expecting to avoid the "copy_from_user" path when using those
> > >> >> >> syscalls.
> > >> >> >> so far, I couldn't make any improvement, actually the throughput
> > >get
> > >> >> >> worst. the new receive flow looks like this (code also attached):
> > >> >> >> 1. read tcp packet (64 pages) to page aligned buffer.
> > >> >> >> 2. vmsplice the buffer to pipe with SPLICE_F_MOVE.
> > >> >> >> 3. splice the pipe to the file, also with SPLICE_F_MOVE.
> > >> >> >>
> > >> >> >> the strace shows that the splice takes a lot of time. also when
> > >> >> >> profiling the kernel, I found that the memcpy() called to often !!
> > >> >> >
> > >> >> >(didn't see this until now, [email protected] doesn't work anymore)
> > >> >> >
> > >> >> >I'm assuming that you mean you vmsplice with SPLICE_F_GIFT, to hand
> > >> >> >ownership of the pages to the kernel (in which case SPLICE_F_MOVE
> > >will
> > >> >> >work, otherwise you get a copy)? If not, that'll surely cost you a
> > >data
> > >> >> >copy
> > >> >>   I'll try the vmplice with SPLICE_F_GIFT and splice with MOVE. btw,
> > >> >> I noticed that the  splice system call takes the bulk of the time,
> > >> >> does it mean anything?
> > >> >
> > >> >Hard to say without seeing some numbers :-)
> > >> I'm out of the office, I'll send it later. btw, my test bed ( the
> > >> receiver side ) is arm9. does it matter?
> > >
> > >The vmsplice is basically vm intensive, so it could matter.
> > >
> > >> >> >This sounds remarkably like a recent thread on lkml, you may want to
> > >> >> >read up on that. Basically using splice for network receive is a bit
> > >of
> > >> >> >a work-around now, since you do need the one copy and then vmsplice
> > >that
> > >> >> >into a pipe. To realize the full potential of splice, we first need
> > >> >> >socket receive support so you can skip that step (splice from socket
> > >to
> > >> >> >pipe, splice pipe to file).
> > >> >> Ashwini Kulkarni posted patches that implements that, see
> > >> >> http://lkml.org/lkml/2006/9/20/272 .  is that right?
> > >> >> >
> > >> >> >There was no test code attached, btw.
> > >> >> sorry, here it is.
> > >> >> can you please add sample application to your test tools (splice,fio
> > >> >> ,,) that demonstrates my flow; socket to file using read & vmsplice?
> > >> >
> > >> >I didn't add such an example, since I had hoped that we would have
> > >> >splice from socket support sooner rather than later. But I can do so, of
> > >> >course.
> > >> do you any preliminary patches? I can start playing with it.
> > >
> > >I don't, Intel posted a set of patches a few months ago though. I didn't
> > >have time to look that at the time being, but you should be able to find
> > >them in the archives.
> > >
> > >> >I'll try your test. One thing that sticks out initially is that you
> > >> >should be using full pages, the splice pipe will not merge page
> > >> >segments. So don't use a buflen less than the page size.
> > >>
> > >> yes, actually I  run the ttcp with -l65536 ( 64KB ), and the buffer is
> > >> always page aligned.also, the splice/vmsplice with MOVE or GIFT will
> > >> fail if the buffer is not a whole pages. am I rigth?
> > >
> > >Yes.
> > >
> > >I added a simple splice-fromnet example in the splice git repo, see if
> > >you can repeat your results with that. Doing:
> > >
> > ># ./splice-fromnet -g 2001 | ./splice-out -m /dev/null
> > >
> > >and
> > >
> > ># cat /dev/zero | netcat localhost 2001
> > >
> > >gets me about 490MiB/sec, using a recv/write loop is around 413MiB/sec.
> > >Not migrating pages gets me around 422MiB/sec.
> > >
> > >--
> > >Jens Axboe
> > >
> > >
> > I've done some investigation in the splice flow and found the following:
> > even when using vmsplice with GIFT and splice with MOVE, the user
> > buffers still copied, I see that the memcpy from pipe_to_file() is
> > called.
> > I added debug messages in this function and here what I got:
> > 1. the  generic_pipe_buf_steal always fails, this is because the
> > page_count is 2.
> > 2. after then, the find_lock_page fails as well.
> > 3. page_cache_alloc_cold succeeds.
> > 4. but, since the buf->page is differs from the page (returned by
> > page_cache_alloc_cold) the memcpy function is called.
> >
> > this behavior true for all the buffers that vmspliced to ext3 file.
> > is this the expected behavior? is there any way to make the steal
> > operation return with success?
>
> It works for me, with most pages. Using the vmsplice/splice-out from the
> splice tools, doing
>
> $ ./vmsplice -g |  ./splice-out -m g
>
> about half of the pages have count==1 and the steal suceeds.
>
> find_lock_page() will only suceed, if the file exists and is cached
> already. splice-out will truncate the file, so it should never suceed
> for that case. For both the find_lock_page() success and failure case
> (page being allocated), it's a given that we need to copy the data.

Testing a simpler case (not switching buffers), all but one page was
stolen. I tested with on-stack and posix_memalign returned buffers.

--
Jens Axboe


your test (./vmsplice -g |  ./splice-out -m) works for me. but I'm
trying to do the vmsplice and the splice in the same process. so I
modified the vmsplice test (attached)  to do the splice to file. in
this case no pages are stolen.
IMHO, when doing it in two process as your example, then the count of
some of the pages decreased when the first process exits, this why
those pages can be stolen.

saeed

/*
 * Use vmsplice to fill some user memory into a pipe. vmsplice writes
 * to stdout, so that must be a pipe.
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <limits.h>
#include <string.h>
#include <getopt.h>
#include <sys/poll.h>
#include <sys/types.h>
#include <fcntl.h>

#include "splice.h"

#define ALIGN(buf)	(void *) (((unsigned long) (buf) + align_mask) & ~align_mask)

static int do_clear;
static int align_mask = 65535;
static int force_unalign;
static int splice_flags;
unsigned int count = 1;

int do_vmsplice(int *pfd, int fd, void *b1, void *b2, int len)
{
//	struct pollfd pfd = { .fd = fd, .events = POLLOUT, };
	struct iovec iov[] = {
		{
			.iov_base = b1,
			.iov_len = len / 2,
		},
		{
			.iov_base = b2,
			.iov_len = len / 2,
		},
	};
	int written, idx = 0, spliced = 0;

	while (len) {
		/*
		 * in a real app you'd be more clever with poll of course,
		 * here we are basically just blocking on output room and
		 * not using the free time for anything interesting.
		 */
		//	if (poll(&pfd, 1, -1) < 0)
		//return error("poll");

		written = vmsplice(pfd[1], &iov[idx], 2 - idx, splice_flags);

		if (written <= 0)
			return error("vmsplice");

		spliced = 0;
		while(spliced < written)
		{
			int ret;
			ret = splice(pfd[0], NULL, fd, NULL, written, SPLICE_F_MOVE);
			spliced += ret;
		}
		len -= written;
		if ((size_t) written >= iov[idx].iov_len) {
			int extra = written - iov[idx].iov_len;

			idx++;
			iov[idx].iov_len -= extra;
			iov[idx].iov_base += extra;
		} else {
			iov[idx].iov_len -= written;
			iov[idx].iov_base += written;
		}
	}

	return 0;
}

static int usage(char *name)
{
	fprintf(stderr, "%s: [-c(lear)] [-u(nalign)] [-g(ift)]| ...\n", name);
	return 1;
}

static int parse_options(int argc, char *argv[])
{
	int c, index = 1;

	while ((c = getopt(argc, argv, "n:cug")) != -1) {
		switch (c) {
		case 'c':
			do_clear = 1;
			index++;
			break;
		case 'u':
			force_unalign = 1;
			index++;
			break;
		case 'g':
			splice_flags = SPLICE_F_GIFT;
			index++;
			break;
		case 'n':
			count = atoi(optarg);
			index++;
			break;
		default:
			return -1;
		}
	}

	return index;
}

int main(int argc, char *argv[])
{
	unsigned char *b1, *b2;
	int fd, index;
	int pfd[2];


	index = parse_options(argc, argv);
	if (index < 0 || index + 1 > argc)
		return usage(argv[0]);

//	if (check_output_pipe())
//	return usage(argv[0]);
	if(pipe(pfd) < 0)
		return error("pipe");

	fd = open(argv[index], O_WRONLY | O_CREAT | O_TRUNC, 0644);

        if (fd < 0)
                return error("open");

	b1 = ALIGN(malloc(SPLICE_SIZE + align_mask));
	b2 = ALIGN(malloc(SPLICE_SIZE + align_mask));

	if (force_unalign) {
		b1 += 1024;
		b2 += 1024;
	}

	memset(b1, 0xaa, SPLICE_SIZE);
	memset(b2, 0xbb, SPLICE_SIZE);

	do {
		int half = SPLICE_SIZE / 2;

		/*
		 * vmsplice the first half of the buffer into the pipe
		 */
		if (do_vmsplice(pfd, fd, b1, b2, SPLICE_SIZE))
			break;

		/*
		 * first half is now in pipe, but we don't quite know when
		 * we can reuse it.
		 */

		/*
		 * vmsplice second half
		 */
		if (do_vmsplice(pfd, fd, b1 + half, b2 + half, SPLICE_SIZE))
			break;

		/*
		 * We still don't know when we can reuse the second half of
		 * the buffer, but we do now know that all parts of the first
		 * half have been consumed from the pipe - so we can reuse that.
		 */

		/*
		 * Test option - clear the first half of the buffer, should
		 * be safe now
		 */
		if (do_clear) {
			memset(b1, 0x00, SPLICE_SIZE);
			memset(b2, 0x00, SPLICE_SIZE);
		}
	} while (count--);

	return 0;
}

Follow-Ups:
- Re: using splice/vmsplice to improve file receive performance
  - From: "saeed bishara" <[email protected]>

References:
- Re: using splice/vmsplice to improve file receive performance
  - From: "saeed bishara" <[email protected]>
- Re: using splice/vmsplice to improve file receive performance
  - From: Jens Axboe <[email protected]>
- Re: using splice/vmsplice to improve file receive performance
  - From: Jens Axboe <[email protected]>

Prev by Date: [PATCH 1/1] show_mem() for IA64 sparsemem NUMA
Next by Date: Re: [PATCH, RFC] reimplement flush_workqueue()
Previous by thread: Re: using splice/vmsplice to improve file receive performance
Next by thread: Re: using splice/vmsplice to improve file receive performance
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]