MOO-cows Mailing List Archive

[Prev][Next][Index][Thread]

Re: mysterious LambdaMOO startup stall



>         I wrote to the MOO-Cows mailing list, but not a peep has
> come back.

I saw the posting, but until now I didn't have enough information to even
propose a theory about what's happening.

> Apr 24 23:39:31: VALIDATE: Phase 1: Check for invalid objects ...
> Apr 24 23:40:11: VALIDATE: Phase 2: Check for cycles ...
> Apr 24 23:41:17: VALIDATE: Phase 3: Check for inconsistencies ...
> 
> as you can see, the first two phases take about a minute. Then
> the third phase trips merrily along --- at first.
> The first 84,000 objects take less than a minute per thousand.
> Then suddenly it starts to take 3 HOURS per thousand.

This suggests a theory to me, but to explain it I'll have to describe something
of how the parent/children and location/contents hierarchies are actually
represented in memory.

For each hierarchy, each object has three object-number fields:
	1) this object's parent in the hierarchy,
	2) this object's *first* child in the hierarchy, and
	3) the *next* sibling of this object in the list of
	   children of this object's parent in the hierarchy

For example, suppose the object A had these values for the fields relating to
the parent/children hierarchy:

	A.parent = B
	 .child = C
	 .sibling = D

Then A and D are adjacent members of B's children list, and C is the first
member of A's children list (we would follow the chain of `sibling' fields
starting from B to discover all of A's children).

Note that the parent and child links in the hierarchy are stored separately and
can therefore, theoretically, get out of sync.  For each hierarchy, phase 3 of
the server's validation suite checks that these two conditions hold:

	1) If A.parent = B, then A appears somewhere in B's children list, and
	2) If A appears somewhere in B's children list, then A.parent = B

For each object, then, phase 3 performs four tests (two tests each for two
hierarchies), each of which is linear in the length of some object's children
or contents list.

Now, let's get back to the behavior you've seen.  Suppose that nearly *all* of
the roughly 27,000 objects above #84000 all have the same parent and/or
location.  That is, suppose there's some one object with roughly 27,000
children and/or contents and that nearly all of those 27,000 are above #84000.
Then, for each of those 27,000 objects, the server would have to do at least
one linear traversal of a 27,000-object list.  27,000 squared is a *lot*, and
this lot could perhaps account for the huge amount of time the validation is
(was?) taking.

You said in your MOO-Cows posting that this behavior happens under 1.7.8p4 as
well as 1.8.0p4, but that it only started happening recently.  How long has it
been since you rebooted your server?  Could it be that something like what I
described above has happened *just since that previous reboot*?

This is the best theory I've come up with so far.  If you were to confirm its
correctness and then claim that the server's current children/contents list
representation is perhaps sub-optimal for lists as large as these, I think I
would not disagree... :-)

If I'm right, then after you wait the requisite 60 more hours, you should
(eventually) be able to fix the problem by somehow breaking up this mob of
objects into smaller sub-mobs.  It may take a while to do that, though... :-)

	Pavel



Home | Subject Index | Thread Index