Fedora Users — Filesystem corruption, data loss on usb drives, and more problems with FC2

Hi all,

I'm posting here first to find out if anyone else is seeing similar problems, before filing these on bugzilla.

System description
==================

Clean, custom (over FTP) install of Fedora2 Final.

System: Dell Optiplex GX-270 (Pentium 4, 2.8GHz, 120 GB Hard disk, 1 GB RAM,
NVidia GeForce 4MX video card, Dell 2000FP LCD, correctly identified and
working at 1600x1200 resolution).  This machine was previously running
Fedora1, but a clean install (not an upgrade) was done.

Filesystems:

planck[~]> df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/hda2             7.6G  4.6G  2.7G  64% /
/dev/hda6              68G   33M   65G   1% /backups
none                  506M     0  506M   0% /dev/shm
/dev/hda5              29G   33M   27G   1% /scratch/local
/dev/hda3             5.7G  2.2G  3.2G  41% /usr/local
littlewood:/home       19G   15G  2.8G  84% /home
littlewood:/local     4.9G  2.3G  2.3G  50% /local

/home is NFS served by a RedHat9 box, and this setup had been successfully
used for about a year (first with RedHat and then with Fedora1 on the
client).

This system has Hyperthreading activated in the BIOS.  Fedora installed both
an SMP and a UP kernel, I have done tests with both.  All critical bugs listed
below have been reproduced with both kernels.

I am running the stock kernels from the Fedora install, along with the Xorg
'nv' NVidia driver, which works ok (albeit with no 3d acceleration).  So no
issues can be attributed to third-party kernel modules.


BUGS
====

Below I list the bugs I've encountered so far (2 days of use), in decreasing
order of severity.  In 10 years of Linux use, I had NEVER seen a release from
any manufacturer with this kind of data-loss bugs out of the box.


Root filesystem corruption - CRITICAL, DATA LOSS
------------------------------------------------

/ became 'read-only' after a very large build (rebuilding a large .src.rpm, ~ 3 hours of very cpu-intensive C++ compilation). I have never seen this before. Obviously once the system thought that '/' was read-only, shit hit the fan pretty quickly.

From /var/log/messages: Jun 3 13:01:51 planck gconfd (root-8244): Resolved address "xml:readonly:/etc/gconf/gcon f.xml.mandatory" to a read-only config source at position 0 Jun 3 13:01:51 planck gconfd (root-8244): Resolved address "xml:readwrite:/root/.gconf" to a writable config source at position 1 Jun 3 13:01:51 planck gconfd (root-8

After that point, the file becomes _binary_. What is going on? dmesg shows all lines like this:

EXT3-fs error (device hda2) in start_transaction: Journal has aborted

A reboot was fairly painful, and upon restarting '/' required a manual fsck.
Tons and tons of 'yes, fix it' later, the filesystem came back online with a
very full lost+found directory.  Fortunately, all of that was junk from the
rpm build process, so I could safely remove it.  No user data was lost, but
out of pure luck: all lost data was in the /usr/src/redhat directories in this
case.


Problems with USB -- CRITICAL, DATA LOSS
----------------------------------------

I use a 512Mb USB flash disk for keeping my home and office machines in sync,
with a custom-made rsync script.  This worked for months with RH9 and Fedora
1, without a single glitch ever.  It continues to work fine on a physically
identical Fedora 1 computer connected to the same /home NFS server, so the
issue is NOT with either the USB flash disk nor the network/NFS server.

Now I have seen all kinds of different problems, generally causing the rsync process to fail catastrophically. Here is a sample of dmesg the last time I tried to use it:

... scsi3 (0:0): rejecting I/O to offline device printk: 261 messages suppressed. Buffer I/O error on device sda2, logical block 1 lost page write due to I/O error on sda2 EXT2-fs error (device sda2): ext2_get_inode: unable to read inode block - inode=16208, block=65540 scsi3 (0:0): rejecting I/O to offline device scsi3 (0:0): rejecting I/O to offline device Buffer I/O error on device sda2, logical block 1 lost page write due to I/O error on sda2 EXT2-fs error (device sda2): read_inode_bitmap: Cannot read inode bitmap - block_group = 4, inode_bitmap = 32770 scsi3 (0:0): rejecting I/O to offline device Buffer I/O error on device sda2, logical block 81281 lost page write due to I/O error on sda2 scsi3 (0:0): rejecting I/O to offline device Buffer I/O error on device sda2, logical block 81921 lost page write due to I/O error on sda2 Buffer I/O error on device sda2, logical block 81922 lost page write due to I/O error on sda2 Buffer I/O error on device sda2, logical block 81923 lost page write due to I/O error on sda2 Buffer I/O error on device sda2, logical block 81924 lost page write due to I/O error on sda2 scsi3 (0:0): rejecting I/O to offline device scsi3 (0:0): rejecting I/O to offline device ... # More of the same

From /var/log/messages, here's the start of the mess:

Jun 3 18:05:15 planck kernel: usb 1-6: new high speed USB device using address 3 Jun 3 18:05:15 planck kernel: scsi3 : SCSI emulation for USB Mass Storage devices Jun 3 18:05:15 planck kernel: Vendor: USB Model: Flash Drive Rev: 1.12 Jun 3 18:05:15 planck kernel: Type: Direct-Access ANSI SCSI revi sion: 02 Jun 3 18:05:15 planck kernel: SCSI device sda: 1015805 512-byte hdwr sectors (520 MB) Jun 3 18:05:15 planck kernel: sda: assuming Write Enabled Jun 3 18:05:15 planck kernel: sda: assuming drive cache: write through Jun 3 18:05:16 planck kernel: sda: sda1 sda2 Jun 3 18:05:16 planck kernel: Attached scsi removable disk sda at scsi3, channel 0, id 0, lun 0 Jun 3 18:05:16 planck scsi.agent[5065]: disk at /devices/pci0000:00/0000:00:1d.7/usb1/1-6 /1-6:1.0/host3/3:0:0:0 Jun 3 18:09:53 planck kernel: usb 1-6: control timeout on ep0in Jun 3 18:09:53 planck kernel: scsi: Device offlined - not ready after error recovery: host 3 channel 0 id 0 lun 0 Jun 3 18:09:53 planck kernel: SCSI error : <3 0 0 0> return code = 0x50000 Jun 3 18:09:53 planck kernel: end_request: I/O error, dev sda, sector 127490 Jun 3 18:09:53 planck kernel: scsi3 (0:0): rejecting I/O to offline device Jun 3 18:09:53 planck kernel: Buffer I/O error on device sda2, logical block 24085 Jun 3 18:09:53 planck kernel: lost page write due to I/O error on sda2 Jun 3 18:09:53 planck kernel: scsi3 (0:0): rejecting I/O to offline device Jun 3 18:09:53 planck last message repeated 123 times Jun 3 18:09:53 planck kernel: EXT2-fs error (device sda2): read_inode_bitmap: Cannot read inode bitmap - block_group = 17, inode_bitmap = 139266 Jun 3 18:09:53 planck kernel: scsi3 (0:0): rejecting I/O to offline device

And then much more of the same...  The key messages seem to be:

Jun 3 18:09:53 planck kernel: usb 1-6: control timeout on ep0in Jun 3 18:09:53 planck kernel: scsi: Device offlined - not ready after error recovery: host 3 channel 0 id 0 lun 0

Why the timeout, I don't know.  But after this, there's no hope for the device
to continue working properly.


OpenOffice won't start- CRITICAL
--------------------------------

It simply won't start.  The splash screen comes up, and when the progress bar
finishes, it just stays there.  I tried removing the old ~/.openoffice
directory and all other dot-files related to any previous OpenOffice version,
but this did not help.

Starting ooffice from the cli opens up the shell, but from there any attempt
to open one of the actual programs freezes the window.  Starting oowrite,
oocalc, etc, simply shows the splash dialog running to completion, which then
stays in place forever.

This has been tested both for several normal users (whose home dirs are served via NFS) and for root (with no NFS access). The same users can start OO.o on other machines served by the same /home server (those other clients run either RedHat9 or Fedora1).


Display doesn't shut off
------------------------

I have display power management enabled in KDE, and it does go black, but not
really off.  I issue manually

xset dpms force off

it goes black, but the monitor doesn't detect that it's in 'power save' mode,
where the power light would switch from green to orange.

This used to work in FC1 just fine, the display would truly go into power save mode at the specified delay.

The dpms option is correctly specified in the xorg.conf file, so that's not
the problem.

=============================================================================

I'd very much appreciate any feedback on any of these, before I file bug reports in bugzilla.

Cheers,

Fernando Perez.