At Wit's End on system freezes

Frank da Cruz (fdc@watsun.cc.columbia.edu)
Tue, 24 Nov 1998 14:50:20 EST


> Hi,
> 
> On Tue, Nov 24, 1998 at 04:18:40PM +0000, Frank da Cruz wrote:
> > To isolate the problem, you must separate Kermit and mgetty.  My long
> > experience has told me that bidirectional terminal ports on Unix rarely
> > work as desired, and when problems such as this occur, they can almost
> > always be attributed to the bidirectionality.  Remove that and the problems
> > go away.
> 
> I strongly disagree.
> 
> I *wrote* mgetty to make bidirectional port usage work robust and
> reliable, and can say that I succeeded.  My SCO OpenServer 3.0 system
> serves three modems, bidirectionally, for now about 5 years, and not a
> single problem that wasn't caused by bad hardware.  Some of our systems at
> work have really busy lines (about 200 outgoing faxes and 100 incoming
> data calls on a given modem per day, for a total of > 1000 faxes a day),
> and I don't see any system locks there either.
> 
Sorry if I gave offense.  I was speaking from many years of experience with
complaints about bidirectional ttys in all forms of UNIX -- SunOS, Solaris, 
SCO, HP-UX, etc  etc -- which have a pretty sorry history.  I would be
delighted if mgetty is solid.  I should add that on some of these systems,
problems related to bidirectional ports could have symptoms all the way up
to and including kernel panics (e.g. in Solaris 2.3).

> [OTOH, to nail down the problem, it might be a good idea to run the ports
> unidirectionally for a couple of days and see what will happen.]
> 
> What is known to cause problems like the ones observed:
> 
>  - compiling linux 2.0.x kernels with C compilers different to gcc 2.7.2.3
>    (egcs for sure breaks 2.0 kernels [because kernel ASM and egcs ASM just
>    don't work together, I'm not blaiming anyone, just stating facts]).
> 
>    It's *unlikely* that this is the reason here, because it wouldn't
>    go away if you do anything to the modem.
> 
>  - bad modem cabling - if you have noise on the cable, especially the
>    RTS/CTS lines, and weak signals, the serial port might generate too
>    many IRQs and things might freeze.   This is more likely, as it
>    will be influenced by switching off the modem.
> 
And for that matter, the stupid PC architecture itself, with its shortage of
interrupts.  (I think if IBM had known that this would be the architecture for
all eternity, they might have done things a bit differently...)

>    Actually, I *don't think* this is the cause.  Why?  The Rocketport
>    card does not *use* an IRQ [as far as I know] and thus the system
>    itself shouldn't be affected by RTS/CTS noise at all.  "Bad things" on
>    the modem lines could freeze the serial ports, but not the host
>    computer.
> 
>  - bad modems.  I'm not really sure how that could lead to a system
>    freeze, but I've had bad experience with Zoom in the past.  If you
>    can, borrow a handful of USR Courier modems somewhere, and see whether
>    that changes things.  From my experience, USR Couriers are absolutely
>    perfect for 24x7 mostly-data operations.
> 
Preferably external ones.  Internal USR modems these days are problematic.
Many of them are not even modems at all.  I recently bought a PC that comes
with a non-name Winmodem as standard equipment, but with an extra-cost option
to replace it by a "real modem" from USR.  It was a Winmodem too.  (I know
that's not the problem in this case, but as far as I'm concerned, USR ==
quality applies only to external modems).

> [..]
> > : out.  I obtained the latest and greatest RocketPort driver.  The
> > : behavior stayed the same.  I played with the initialization strings in
> > : mgetty and the behavior stayed the same.
> > :
> > This is a critical area.  There might very well be a setting that makes
> > the problem go away, but you didn't hit upon it in your experimentation.
> > For example, did you try all possible DSR-behavior selections?  (&S0,
> > &S1, &S2, ...)
> 
> DSR should be set to "on all times".  More interesting is DTR sensitivity.
> I usually use AT&D2 or &D3, but some modems don't like AT&D3 and lock up.
> 
> [..]
> > : In the middle of all this, I
> > : discovered that if I turned on and off the offending modem my system
> > : "unfroze" without rebooting.  This saved a lot of time and frustration
> > : for all concerned.
> > :
> > Because it resets the modem to its factory or saved state, which agrees
> > with what mgetty and the port drivers need.
> 
> The port driver shouldn't care about the state the modem is in.
> 
> Mgetty cares, but all it will do if the modem is hosed is to complain 
> into the log files (possibly filling up your disk), but never "freeze 
> the system" - this isn't possible for a user-mode program.
> 
That's what I said :-)

> [..]
> > : Sometimes when the script is restarted and the modem is then accessed
> > : the system freezes, as if  the modem is already being used but no lock
> > : file exists.
> > :
> > When you say "the system freezes", do you mean the entire system, or do you
> > mean the process that is trying to open the modem?  If you mean the whole
> > system, then there is a serious problem in the system itself, since no
> > user program should be able to freeze Linux.
> 
> Yes, exactly.  You're voicing my question :-) -- I could easily imagine a
> frozen process trying to access the serial port, but hardly a frozen
> system (especially not a "frozen system that unfreezes if the modem is
> switched off").
> 
> [..]
> > So again, my first recommendation would be to separate the inbound and
> > outbound modems.  Configure the inbound modems for answering calls, and let
> > Kermit handle the outbound modems.  Pay very careful attention to the modem
> > signal configurations on the modems (&Sn, &Cn, &Dn, etc), which MUST agree
> > in every respect with what your port drivers require.
> 
> Actually, to locate whether the problem is caused by inbound or outbound
> calls, or the combination of both, this is a good start to nail down the
> problems.
> 
> gert
>
Glad to meet you!

- Frank