[Corrected Arjan's address I messed up earlier.]
Ingo Molnar wrote:
* Nick Piggin <[email protected]> wrote:
Ingo Molnar wrote:
curious, do you have any (relatively-) simple to run testcase that
clearly shows the "scalability issues" you mention above, when going
from rwlocks to spinlocks? I'd like to give it a try on an 8-way box.
Arjan van de Ven wrote:
I'm curious what scalability advantage you see for rw spinlocks vs real
spinlocks ... since for any kind of moderate hold time the opposite is
expected ;)
It actually surprised me too, but Peter Chubb (who IIRC provided the
motivation to merge the patch) showed some fairly significant
improvement at 12-way.
https://www.gelato.unsw.edu.au/archives/scalability/2005-March/000069.html
i think that workload wasnt analyzed well enough (by us, not by Peter,
who sent a reasonable analysis and suggested a reasonable change), and
we went with whatever magic change appeared to make a difference,
without fully understanding the underlying reasons. Quote:
"I'm not sure what's happening in the 4-processor case."
Now history appears to be repeating itself, just in the other direction
;) And we didnt get one inch closer to understanding the situation for
real. I'd vote for putting a change-moratorium on tree-lock and only
allow a patch that tweaks it that fully analyzes the workload :-)
one thing off the top of my mind: doesnt lockstat introduce significant
overhead? Is this reproducable with lockstat turned off too? Is the same
scalability problem visible if all read_lock()s are changed to
write_lock()? [like i did in my patch] I.e. can other explanations (like
unlucky alignment of certain rwlock data structures / functions) be
excluded.
Yes, it would need re-testing.
another thing: average hold times in the spinlock case on that workload
are below 1 microsecond - probably on the range of cachemiss bounce
costs on such a system.
It's the wait time that I'd be more worried about. As I said, my wild
guess is that the wait times are creeping up.
I.e. it's the worst possible case for a
spinlock->rwlock conversion! The only reason i can believe this to make
a difference are cycle level races and small random micro-differences
that cause heavier bouncing in the spinlock workload but happen to avoid
it in the read-lock case. Not due to any fundamental advantage of
rwlocks.
I'd say the 12 way results show that there is a fundamental advantage
(although that's pending whether or not lockstat is wrecking the results).
I'd even go out on a limb ;) and say that it will only become more
pronounced at higher cpu counts.
Correct me if I'm wrong, but... a read-lock requires at most a single
cacheline transfer per lock acq and a single per release, no matter the
concurrency on the lock (so long as it is read only).
A spinlock is going to take more. If the hardware perfectly round-robins
the cacheline, it will take lockers+1 transfers per lock+unlock. Of
course, hardware might be pretty unfair for efficiency, but there will
still be some probability of the cacheline bouncing to other lockers
while it is locked. And that probability will increase proportionally to
the number of lockers.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]