Re: [PATCH 1b/7] dlm: core locking

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thursday 28 April 2005 08:55, Lars Marowsky-Bree wrote:
> On 2005-04-28T02:49:04, Daniel Phillips <[email protected]> wrote:
> > > Just some food for thought how this all fits together rather
> > > neatly.
> >
> > It's actually the membership system that glues it all together.  The
> > dlm is just another service.
>
> Membership is one of the lowest level and high privileged inputs to the
> whole picture, of course.
>
> However, "membership" is already a pretty broad term, and one must
> clearly state what one is talking about. So we're clearly focused on
> node membership here, which is a special case of group membership; the
> top-level, sort of.

Indeed, you caught me being imprecise.  By "membership system" I mean cman, 
which includes basic cluster membership, service groups, socket interface, 
event messages, PF_CLUSTER, and a few other odds and ends.  Really, it _is_ 
our cluster infrastructure.  And it has warts, some really giant ones.  At 
least it did the last time I used it.  There is apparently a new, 
much-improved version I haven't seen yet.  I have heard that the re-rolled 
cman is in cvs somewhere.  Patrick?  Dave?

> Then every node has it's local view of node membership, constructed
> typically from observing node heartbeats.

Actually, it is constructed from observing cman events over the socket.

I see that some fantastical /sys/ filesystem has wormed itself into the 
machinery.  I need to check that this hasn't compromised the basic beauty of 
the event messaging model.

Fencing is a whole nuther issue.  It's sort of unclear how it is actually 
supposed to work, and judging from the number of complaints I see about it on 
mailing lists, it doesn't work very well.  We need to take a good look at 
that.

> Then the nodes communicate to reach concensus on the coordinated
> membership, which will usually be a set of nodes with full N:N
> connectivity (via the cluster messaging mechanism); and they'll also
> usually aim to identify the largest possible set.

Yes.  "Reaching consensus" is signalled to each node by cman sending a 
"finish" event, as in "finish recovering".  (To be sure, this is misleading 
terminology.  We should kill it before it has a chance to reproduce.)

> Eventually, there'll be a membership view which also implies certain
> shared data integrity guarantees if appropriate (ie, fencing in case a
> node didn't go down cleanly, and granting access on a clean join).

Each node's membership view is simply the cumulative state implied by the cman 
events.  Necessarily, this view will suffer some skew across the cluster.  
All cluster algorithms _must_ recognize and accomodate that.  This is where 
barriers come into play, though that mechanism is buried inside cman, and 
each node's view of barrier operations consists of cman events.  (The way 
this is actually implemented smells a little scary to me, but it seems to 
work ok for small numbers of nodes.)

> These steps but the last one usually happen completely internal to the
> membership layer; the last one requires coordination already, because
> the fencing layer itself might need recovery before it can fence
> something after a node failure.

Right, we need to do a lot more work on the fencing interface.  For example, I 
haven't even begun to analyze it from the point of view of memory inversion 
deadlock.  My spider sense tells me there is some of that in there.  Fencing 
is currently done via bash scripts, which alone sucks nearly beyond belief.

> And then there's quorum computation.

Aha!  There is a beautiful solution in the case of ddraid, i.e., any cluster 
with (m of n) redundant shared disks resident on the nodes themselves:

   http://sourceware.org/cluster/ddraid/

For ddraid order 1 and higher, there is no quorum ambiguity at all, because 
you _require_ a quorum of data nodes in order for any node to access the 
cluster filesystem data.  For example, for a five node ddraid distributed 
data cluster, you need four data nodes active or the cluster will only be 
able to sit there stupidly doing nothing.  Four data nodes is therefore the 
quorum group ordained by God.  Non-data nodes can come and go as they please, 
without ever worrying about split brain or other nasty quorum-related 
diseases.

> Certainly you could also try looking at it from a membership-centric
> angle, but the piece which coordinates the recovery of the various
> components which makes sure the right kind of membership events are
> delivered in the proper order, and errors during component recovery are
> appropriately handled, is, I think, pretty much distinct from the
> "membership" and a higher level component.

Sorry for the red herring.  Where I wrote "membership" I meant to write 
"cman", that is, cluster management.

> So I'm not sure I'd buy "the membership is what glues it all together"
> on eBay even for a low starting bid.

Though I'm not sure the concept is for sale, your buy-in will be appreciated 
nonetheless, no matter how many limp jokes we need to put up with on the way 
there.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux