-----Original Message----- From: fedora-test-list-bounces@xxxxxxxxxx [mailto:fedora-test-list-bounces@xxxxxxxxxx]On Behalf Of Alan Cox Sent: Tuesday, April 27, 2004 12:58 PM To: For testers of Fedora Core development releases Subject: Re: OT - Journaling File Systems? > On Tue, Apr 27, 2004 at 12:03:35PM -0500, Edwards, Scott (MED, Kelly IT Resouces) wrote: >> After the second 'plug pull' it took 1 minute and 16 seconds to boot. >> But it claimed corrupted metadata and that the superblock was trashed >> and could not even mount the partiton. I found this comment on the >> Gentoo installation instructions > > XFS caches lots of data so you can lose stuff if you dont sync it in > bigger chunks than you would expect. Corrupt metadata suggests a bug. > You may want to file it upstream in bugs.kernel.org. XFS should only > ever lose unsynced data. > >> I'm completely confused now. I have been under the impression that this >> was the main purpose for these (journaling) file systems? I knew a guy >> that worked on BeOS and he claimed that you could flick the power on and >> off all day and it wouldn't lose data. Am I doing something wrong? Do >> I need to set a different mode or something on these file systems so >> that they can recover? > > Ext3 gets intensive testing as does JFFS2 on flash (people have done > thousands of random reboot tests on them). Red HAt doesn't do much with > reiserfs although the code should match the upstream tree. I finally managed to turn off the write caching on the SATA drives (see patch below) and I have re-run the tests on all four journaling file systems. And again I am completely stumped by the results. The test is really pretty simple. It is hooked to a machine that cycles the power. It runs for 5 minutes and then the power is turned off for 1 minute (to simulate the plug being pulled). For this last set of tests it has 3 hard drives connected, one Parallel IDE and two Serial ATA. The system boots a minimal install of Fedora Core 2 from the IDE drive. No tests are running on the IDE drive. In the rc.local file it starts the fsstress test http://ltp.sf.net/nfs/fsstress.tgz and the three scripts below (to simulate writing into a log file) on each of the SATA drives. The ext3 have almost a perfect record with the write cache off: I have run over 300 cycles on the two drives and only had two corrupted lines in the output files. So out of 600 total cycles on the two drives there were only two lines with bad data, I think that is a pretty good record. None of the other journaling file systems have come anywhere near this performance. After 3 or 4 power cycles, ReiserFS became corrupted to the point that the system would not boot up (the fsck failed and the bootup stopped there). XFS never got corrupted to the point it wouldn't boot, but with approximately 100 power cycles on each drive, one drive had 73 corrupted lines and the other had 82. With JFS after 15 power cycles one of the drives was corrupted and the system would no longer boot up (fsck failed again). I just can't understand what is happening, it makes no sense to me that one file system would be almost perfect and three would fail so dramatically. I am going to re-run the tests on all 4 file systems to verify that it is repeatable. Should I report these problems to the upstream projects (Reiser, XFS, JFS)? Thanks -Scott --- linux-2.6.5-1.358.orig/include/linux/ata.h 2004-05-08 06:56:41.000000000 -0600 +++ linux-2.6.5-1.358.dwc/include/linux/ata.h 2004-06-15 20:36:09.924515048 -0600 @@ -129,6 +129,8 @@ XFER_UDMA_0 = 0x40, XFER_PIO_4 = 0x0C, XFER_PIO_3 = 0x0B, + ENABLE_WRITE_CACHE = 0x02, + DISABLE_WRITE_CACHE = 0x82, /* ATAPI stuff */ ATAPI_PKT_DMA = (1 << 0), --- linux-2.6.5-1.358.orig/drivers/scsi/libata-core.c 2004-05-08 06:56:41.000000000 -0600 +++ linux-2.6.5-1.358.dwc/drivers/scsi/libata-core.c 2004-06-15 20:40:24.703782712 -0600 @@ -58,6 +58,7 @@ static void ata_host_set_udma(struct ata_port *ap); static void ata_dev_set_pio(struct ata_port *ap, unsigned int device); static void ata_dev_set_udma(struct ata_port *ap, unsigned int device); +static void ata_dev_disable_wcache(struct ata_port *ap, unsigned int device); static void ata_set_mode(struct ata_port *ap); static unsigned int ata_unique_id = 1; @@ -1093,14 +1094,20 @@ dev->n_sectors = ata_id_u32(dev, 60); } + if (dev->id[85] & (1 << 5)) { + dev->flags |= ATA_DFLAG_WCACHE; + } + ap->host->max_cmd_len = 16; /* print device info to dmesg */ - printk(KERN_INFO "ata%u: dev %u ATA, max %s, %Lu sectors%s\n", + printk(KERN_INFO + "ata%u: dev %u ATA, max %s, %Lu sectors%s wcache:%s\n", ap->id, device, ata_udma_string(udma_modes), (unsigned long long)dev->n_sectors, - dev->flags & ATA_DFLAG_LBA48 ? " (lba48)" : ""); + dev->flags & ATA_DFLAG_LBA48 ? " (lba48)" : "", + dev->flags & ATA_DFLAG_WCACHE ? "on" : "off"); } /* ATAPI-specific feature tests */ @@ -1279,6 +1286,12 @@ ata_dev_set_udma(ap, 1); } + if (ap->device[0].flags & ATA_DFLAG_WCACHE) + ata_dev_disable_wcache(ap, 0); + + if (ap->device[1].flags & ATA_DFLAG_WCACHE) + ata_dev_disable_wcache(ap, 1); + if (ap->flags & ATA_FLAG_PORT_DISABLED) return; @@ -1703,6 +1716,43 @@ DPRINTK("EXIT\n"); } +static void ata_dev_disable_wcache(struct ata_port *ap, unsigned int device) +{ + struct ata_taskfile tf; + + struct ata_device *dev = &ap->device[device]; + + if (!ata_dev_present(dev) || (ap->flags & ATA_FLAG_PORT_DISABLED)) + return; + + printk(KERN_INFO "disable write cache on dev: %u\n", device); + + /* set up set-features taskfile */ + DPRINTK("set features - disable write cache\n"); + ata_tf_init(ap, &tf, dev->devno); + tf.ctl |= ATA_NIEN; + tf.command = ATA_CMD_SET_FEATURES; + tf.feature = SETFEATURES_XFER; + tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE; + tf.protocol = ATA_PROT_NODATA; + tf.nsect = DISABLE_WRITE_CACHE; + + /* do bus reset */ + ata_tf_to_host(ap, &tf); + + /* crazy ATAPI devices... */ + if (dev->class == ATA_DEV_ATAPI) + msleep(150); + + ata_busy_sleep(ap, ATA_TMOUT_BOOT_QUICK, ATA_TMOUT_BOOT); + + ata_irq_on(ap); /* re-enable interrupts */ + + ata_wait_idle(ap); + + DPRINTK("EXIT\n"); +} + /** * ata_dev_set_udma - * @ap: sleep1.sh: for ((a=0; a < 1000000000; a++)) do sleep 1 date >> $1 done exit sleep100.sh: for ((a=0; a < 1000000000; a++)) do sleep 0.1 date >> $1 done exit nosleep.sh: for ((a=0; a < 1000000000; a++)) do date >> $1 done exit