Les Mikesell wrote:
Roger Heflin wrote:
I can't recall ever being in a position of "having to bring in new
hardware". What scenario forces this issue on you? I haven't
noticed a shortage of vendors who will sell RHEL supported boxes.
But it sounds like you have an interesting job...
More cpu power needed to do the job. And the new boxes aren't
officially RHEL supported (and sometimes won't even boot with the
latest update-but will work with the latest fedora/kernel.org).
Something faster than IBM could sell you?
At the time, yes, this was before IBM sold AMD stuff, and the early though
troublesome Athlon's were faster than the Intel stuff.
I had a subset of machines (about 250 machines) all of which had
reached about 500+ days of uptime (the uptime counter rolled over)
Wasn't that fixed circa RH8? I had some 7.3 machines roll over twice.
It was pre-RH8.
The issue with all OSes is that no one tests enough to catch these
high MTBF issues, and in a big environment a machine crashing 1x per
every 1000 days of uptime, comes to 1 machine a day crashing because
of software, and typically the enterprise OSes aren't even close to
that level, and while fedora is worse, it is just not that much worse.
I don't think RH7.3 with its final updates or Centos3.x (where x>1) had
anything approaching a software crash per 1000 days - at least not in
the base system and common services. I mostly skipped the 4.x series
because I didn't trust the early 2.6 kernels at all, but 5.1 seems solid.
Both of them have issues if you are running NFS servers with lots of clients,
other than that they are pretty stable, but if you are relying on NFS heavily
that is a show-stopper, but once you get a working stable setup if you really
want stability you don't touch it, no matter how good anyone tests things, they
will miss something, and things are worse the more different applications you
are running, all doing different odd things each of which may find one the bugs
no one at Redhat/Suse found in their testing.
And on top of that I have had trivial driver changes in the enterprise OSes
cause huge performance regressions (an FC driver update changed the queue depth
to 64-which caused the speed to be 30% of what it was before on certain external
FC raid disk arrays-this affected SLES9sp3 (9sp[12] kernel was ok), SLES10, any
kernel.org with the newer driver, RHEL4(all of them at the time)), so no update
can be counted on to not cause issues, this error was not seen by the driver
maintainer until they got one of the external arrays to test with and saw it
compared to a competitors board that was 3x faster under the newer kernel, but
almost identical under the older kernel, and both RHEL and Sles testing did not
catch it, to fix it we actually had to update to an unreleased driver that
allowed the queue depth to be changed down (none of the updates at the time
fixed it), and wait for a update on Sles. To get this fixed it was far easier
to work with the upstream driver maintainer and get them to push the update to
the enterprise vendors than to try to get the enterprise vendors to find and fix
the problem. I was told by a different upstream maintainer that typically the
enterprise vendors pushed any serious issues directly to them, and did very
little with the issues themselves, if you could get it past their first line
support people.
The big problem is that the testing has to include it does not crash, it runs
roughly the same speed as before, and it still gives the same answer, and even
if one runs every test they know about, some configuration will still get
through for a given setup.
I guess my experience is that even with Enterprise updates at least 25-50% of
the time there is a serious regression (speed, crash, wrong answer), one has to
carefully consider what do I gain by doing that update. So the testing require
for a full update is really no better than the testing required to go from F7 to
F8, and fedora updates puts out new kernels faster so getting a fix into the
stream is a lot faster than with the enterprise oses, and once you get one that
works on a given piece of HW correctly you stop updating, some of the things I
have ran into on an update are things you would have never though to test for,
so you have to watch out on any update-and it is just best to only update when
required to.
Some of the customers I used to support typically stayed on what was shipped
with the machine, because their validation procedures were fairly extensive, and
the update not worth it when no useful features were to be gained.