Re: Kernel Development & Objective-C

Willy Tarreau wrote:

With 10Gbit/s ethernet working you start to care about every cycle.
If you have 10M packets/sec no amount of cycle-saving will help you.You need high level optimizations like TSO. I'm not saying we shouldsacrifice cycles like there's no tomorrow, but the big wins are elsewhere.
Huh? At 4 GHz, you have 400 cycles to process each packet. If you need to
route those packets, those cycles may just be what you need to lookup a
forwarding table and perform a few MMIO on an accelerated chip which will
take care of the transfer. But you need those cycles. If you start to waste
them 30 by 30, the performance can drop by a critical factor.

I really doubt Linux spends 400 cycles routing a packet. Look what anskbuff looks like.

A flood ping to localhost on a 2GHz system takes 8 microseconds, that's16,000 cycles. Sure it involves userspace, but you're about two ordersof magnitude off. And the localhost interface is nicely cached in L1without mmio at all, unlike real devices.

Another simple noticeable case is Unix
sockets and your X server communication.
Your reflexes are *much* better than mine if you can measure half ananosecond on X.


It just depends how many times a second it happens. For instance, consider
this trivial loop (fct is a two-function array which just return 1 or 2) :

        i = 0;
        for (j = 0; j < (1 << 28); j++) {
                k = (j >> 8) & 1;
                i += fct[k]();
        }

It takes 1.6 seconds to execute on my athlon-xp 1.5 GHz. If, instead of
changing the function once every 256 calls, you change it to every call :

        i = 0;
        for (j = 0; j < (1 << 28); j++) {
                k = (j >> 0) & 1;
                i += fct[k]();
        }

Then it only takes 4.3 seconds, which is about 3 times slower. The number
of calls per function remains the same (128M calls each), it's just the
branch prediction which is wrong every time. The very few nanoseconds added
at each call are enough to slow down a program from 1.6 to 4.3 seconds while
it executes the exact same code (it may even save one shift). If you have
such stupid code, say, to compute the color or alpha of each pixel in an
image, you will certainly notice the difference.

This happens very often in HPC, and when it does, it is often worthwhileto invest in manual optimizations or even assembly coding.Unfortunately it is very rare in the kernel (memcmp, raid xor, whatelse?). Loops with high iteration counts are very rare, so anyattention you give to the loop body is not amortized over a large numberof executions.

And such poorly efficient code may happen very often when you blindly rely
on function pointers instead of explicit calls.

Using an indirect call where a direct call is sufficient will alsoreduce the compiler's optimization opportunities. However, I don't seeanyone recommending it in the context of systems programming.

It is not true that the number of indirect calls necessarily increasesif you use a language other than C.


(Actually, with templates you can reduce the number of indirect calls)

Here, it's scheduling that matters, avoiding large transfers, andavoiding ping-pongs, not some cycles on the unix domain socket. Youalready paid 150 cycles or so by issuing the syscall and thousands forcopying the data, 50 more won't be noticeable except in nanobenchmarks.
You are forgetting something very important : once you start stacking
functions to perform the dirty work for you, you end up with so much
abstraction that even new stupid code cannot be written at all without
relying on them, and it's where the problem takes its roots, because
when you need to write a fast function and you notice that you cannot
touch a variable without passing through a slow pinhole, your fast
function will remain slow whatever you do, and the worst of all is that
you will think that it is normally fast and that it cannot be written
faster.


I don't understand.  Can you give an example?

There are two cases where abstraction hurts performance: the first iswhere the mechanisms used to achieve the abstraction (functions insteadof direct access to variables, function pointers instead of duplicatingthe caller) introduce performance overhead. I don't think C has anyadvantage here -- actually a disadvantage as it lacks templates and isforced to use function pointers for nontrivial cases. Usually theabstraction penalty is nil with modern compilers.

The second case is where too much abstraction clouds the programmer'smind. But this is independent of the programming language.

And there are some special cases where block IO is also pretty critical.
A popular one is TPC-* benchmarking, but there are also others and itlooks likely in the future that this will become more criticalas block devices become faster (e.g. highend SSDs)
And again the key is batching, improving cpu affinity, and caching, notlooking for a faster instruction sequence.
Every cycle burned is definitely lost. The time cannot go backwards. So
for each cycle that you lose to laziness, you have to become more and more
clever to find out how to write an alternative. Lazy people simply put
caches everywhere and after that they find normal that "hello world" requires
2 Gigs of RAM to be displayed.

A 100 byte program will print "hello world" on a UART and stop. Amodern program will load a vector description of a font, scale it to thedesired size, render it using anti aliasing and sub-pixel positioning,lay it out according to the language rules of whereever you live, andplace it on a multi-megabyte frame buffer. Yes it needs hundreds ofmegabytes and lots of nasty algorithms to do that.

The only true solution is to create better
algorithms, but you will find even less people capable of creating efficient
algorithms than you will find capable of coding correctly.

That is true, that is why we see a lot more microoptimizations thanalgorithmic progress.

But if you want a fast streaming filesystem you choose XFS over ext3,even though the latter is much smaller and easier to optimize. If youwrite a network server you choose epoll() instead of trying to optimizeselect() somehow. True algorithmic improvements are rare but they arethe ones that are actually measurable.

For example there are some CPUs who are relatively slow at indirect
function calls and there are actually cases where this can be measured.
That is true. But any self-respecting systems language will let youchoose between direct and indirect calls.
If adding an indirect call allows you to avoid even 1% of I/O, you savemuch more than you lose, so again the high level optimizations win.
It depends which type of I/O. If the I/O is non-blocking, you end up doing
something else instead of actively burning cycles.

Unless you are I/O bound, which is usually the case when you have 2GHzcpus driving 200Hz disks.

Nanooptimizations are fun (I do them myself, I admit) but that's notwhere performance as measured by the end user lies.


I do not agree. It's not uncommon to find 2- or 3-fold performance factors
between equivalent components when one is carefully optimized and the other
one is not. Granted it takes an awful lot of time doing all those nano-opts
at the beginning, but the more you learn about how the hardware reacts to
your code, the more efficiently you write future code, with the fewest bloat.
End users notice bloat a lot (especially when CPU and RAM are excessively
wasted).

Can you give an example of a 2- or 3- fold factor on an end-userworkload achieved by microopts?


I agree about bloat.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: Kernel Development & Objective-C
  - From: Willy Tarreau <[email protected]>

References:
- Kernel Development & Objective-C
  - From: Ben Crowhurst <[email protected]>
- Re: Kernel Development & Objective-C
  - From: [email protected] (Lennart Sorensen)
- Re: Kernel Development & Objective-C
  - From: Kyle Moffett <[email protected]>
- Re: Kernel Development & Objective-C
  - From: Avi Kivity <[email protected]>
- Re: Kernel Development & Objective-C
  - From: Andi Kleen <[email protected]>
- Re: Kernel Development & Objective-C
  - From: Avi Kivity <[email protected]>
- Re: Kernel Development & Objective-C
  - From: Andi Kleen <[email protected]>
- Re: Kernel Development & Objective-C
  - From: Avi Kivity <[email protected]>
- Re: Kernel Development & Objective-C
  - From: Willy Tarreau <[email protected]>

Prev by Date: Re: remote debugging via FireWire * __fast__ firedump!
Next by Date: Re: [PATCH] fix for futex_wait signal stack corruption
Previous by thread: Re: Kernel Development & Objective-C
Next by thread: Re: Kernel Development & Objective-C
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]