MOO-cows Mailing List Archive

[Prev][Next][Index][Thread]

Re: LambdaMOO Upgrades to 1.8.0p5





On Fri, 13 Sep 1996, Judy Anderson wrote:

> Film at 11.
> 
> Sep 13 09:09:27: *** PANIC: Caught signal 10
> Sep 13 09:09:27: #153:task_valid, line 5:  server panic
> Sep 13 09:09:27: ... called from #56789:do_reset (this == #5720), line 8
> Sep 13 09:09:27: ... called from #5678:disfunc (this == #5720), line 18
> Sep 13 09:09:27: ... called from #0:user_disconnected
> user_client_disconnected, line 14
> Sep 13 09:09:27: (End of traceback)
> Sep 13 09:09:28: PANIC-DUMPING on lambda.db.new.PANIC ...
> Sep 13 09:09:48: PANIC-DUMPING: Writing 109330 objects...
> 
> Any ideas?

Hmmm, did it give you a recursive panic ? It died refrencing a task_id so it
would seem logical it would die during the panic dump too... (and your next
mail seems to imply that :) A while back MetroMOO had similar problems, only
our errors were more signal 11 (segmentation violation) than signal 10 (bus
error) but we got them both so you might too, eventually.

Also, MetroMOO was running on Solaris 2.4 at the time, and @version on Lagda
told me you were running solaris 2.3 -- is this still true ? news told me
that the server was going down for some hardware upgrades as well, that
might have to do with it... I have to warn you though, MetroMOO's crashes
were with 1.7.9, so if it's the same, downgrading probably won't help.

One of my C manuals states that 'A bus error is almost always caused by a
misaligned read or write.(*)' In C all variables are stored at addresses
which are multiples of their size, so their value will not span a page
boundary (or a cache boundary). If you try to retrieve a value from the
wrong spot (or too big a value from the right spot) you get a bus error. A
segmentation violation is similar, you get it when you try to retrieve
memory that is invalid -- an address that doesn't exist.

However, it's hard to say if something really went wrong in the MOO server,
or in the kernel itself (the (virtual) memory manager, the disk interface)
because a fault in the kernel would appear the same :( The fact the error
only occured on our setup (last year at least) made me to believe it was
either a hardware or a kernel error, or maybe some strange incompatibility
between the kernel and the MOO. One of our sysadmins claimed it might have
been the scsi interface that was giving the errors.

Fortunately, our errors disappeared, just like that. Maybe one of the
sysadmins patched the kernel or plugged in more memory/cache, i'm unsure,
for both sysadmins that worked with us then changed jobs :) Anyway, we work
with solaris 2.5 now, and it works like perfect... Maybe talk to your
sysadmins, or to Sun, if they know of any strange solaris bug which might
cause such errors... Maybe try to run a second, small 1.8.0 moo to see if
that gives similar errors, or try to run the Lambda core (notice the space
:) on a different machine (*shiver*, imagine the ftp session it would take,
and the type of machine on the other end... no, you can't do this at home on
your linux box :-)

Regards, and please, let us know how it ends ;-)
Thomas








References:

Home | Subject Index | Thread Index