Fedora Users — RE: OT - Journaling File Systems?

-----Original Message-----
From: fedora-test-list-bounces@xxxxxxxxxx
[mailto:fedora-test-list-bounces@xxxxxxxxxx]On Behalf Of Alan Cox
Sent: Tuesday, April 27, 2004 12:58 PM
To: For testers of Fedora Core development releases
Subject: Re: OT - Journaling File Systems?


> On Tue, Apr 27, 2004 at 12:03:35PM -0500, Edwards, Scott (MED, Kelly IT
Resouces) wrote:
>> After the second 'plug pull' it took 1 minute and 16 seconds to boot.
>> But it claimed corrupted metadata and that the superblock was trashed
>> and could not even mount the partiton.  I found this comment on the
>> Gentoo installation instructions
>
> XFS caches lots of data so you can lose stuff if you dont sync  it in
> bigger chunks than you would expect. Corrupt metadata suggests a bug.
> You may want to file it upstream in bugs.kernel.org. XFS should only
> ever lose unsynced data.
>
>> I'm completely confused now.  I have been under the impression that this
>> was the main purpose for these (journaling) file systems?  I knew a guy
>> that worked on BeOS and he claimed that you could flick the power on and
>> off all day and it wouldn't lose data.  Am I doing something wrong?  Do
>> I need to set a different mode or something on these file systems so
>> that they can recover?
>
> Ext3 gets intensive testing as does JFFS2 on flash (people have done 
> thousands of random reboot tests on them). Red HAt doesn't do much with
> reiserfs although the code should match the upstream tree.

I finally managed to turn off the write caching on the SATA drives (see
patch below) and I have re-run the tests on all four journaling file
systems.  And again I am completely stumped by the results.

The test is really pretty simple.  It is hooked to a machine that cycles
the power.  It runs for 5 minutes and then the power is turned off for
1 minute (to simulate the plug being pulled).  For this last set of tests
it has 3 hard drives connected, one Parallel IDE and two Serial ATA.  The
system boots a minimal install of Fedora Core 2 from the IDE drive.  No
tests are running on the IDE drive.  In the rc.local file it starts the
fsstress test http://ltp.sf.net/nfs/fsstress.tgz and the three scripts
below (to simulate writing into a log file) on each of the SATA drives.

The ext3 have almost a perfect record with the write cache off:  I have
run over 300 cycles on the two drives and only had two corrupted lines
in the output files.  So out of 600 total cycles on the two drives there
were only two lines with bad data, I think that is a pretty good record.

None of the other journaling file systems have come anywhere near this
performance.  After 3 or 4 power cycles, ReiserFS became corrupted to
the point that the system would not boot up (the fsck failed and the
bootup stopped there).  XFS never got corrupted to the point it wouldn't
boot, but with approximately 100 power cycles on each drive, one drive
had 73 corrupted lines and the other had 82.  With JFS after 15 power
cycles one of the drives was corrupted and the system would no longer
boot up (fsck failed again).

I just can't understand what is happening, it makes no sense to me that 
one file system would be almost perfect and three would fail so 
dramatically.  I am going to re-run the tests on all 4 file systems to
verify that it is repeatable.

Should I report these problems to the upstream projects (Reiser, XFS, JFS)?

Thanks
  -Scott




--- linux-2.6.5-1.358.orig/include/linux/ata.h	2004-05-08
06:56:41.000000000 -0600
+++ linux-2.6.5-1.358.dwc/include/linux/ata.h	2004-06-15
20:36:09.924515048 -0600
@@ -129,6 +129,8 @@
 	XFER_UDMA_0		= 0x40,
 	XFER_PIO_4		= 0x0C,
 	XFER_PIO_3		= 0x0B,
+	ENABLE_WRITE_CACHE      = 0x02,
+	DISABLE_WRITE_CACHE     = 0x82,
 
 	/* ATAPI stuff */
 	ATAPI_PKT_DMA		= (1 << 0),

--- linux-2.6.5-1.358.orig/drivers/scsi/libata-core.c	2004-05-08
06:56:41.000000000 -0600
+++ linux-2.6.5-1.358.dwc/drivers/scsi/libata-core.c	2004-06-15
20:40:24.703782712 -0600
@@ -58,6 +58,7 @@
 static void ata_host_set_udma(struct ata_port *ap);
 static void ata_dev_set_pio(struct ata_port *ap, unsigned int device);
 static void ata_dev_set_udma(struct ata_port *ap, unsigned int device);
+static void ata_dev_disable_wcache(struct ata_port *ap, unsigned int
device);
 static void ata_set_mode(struct ata_port *ap);
 
 static unsigned int ata_unique_id = 1;
@@ -1093,14 +1094,20 @@
 			dev->n_sectors = ata_id_u32(dev, 60);
 		}
 
+		if (dev->id[85] & (1 << 5)) {
+			dev->flags |= ATA_DFLAG_WCACHE;
+		}
+
 		ap->host->max_cmd_len = 16;
 
 		/* print device info to dmesg */
-		printk(KERN_INFO "ata%u: dev %u ATA, max %s, %Lu
sectors%s\n",
+		printk(KERN_INFO 
+		       "ata%u: dev %u ATA, max %s, %Lu sectors%s
wcache:%s\n",
 		       ap->id, device,
 		       ata_udma_string(udma_modes),
 		       (unsigned long long)dev->n_sectors,
-		       dev->flags & ATA_DFLAG_LBA48 ? " (lba48)" : "");
+		       dev->flags & ATA_DFLAG_LBA48 ? " (lba48)" : "",
+		       dev->flags & ATA_DFLAG_WCACHE ? "on" : "off");
 	}
 
 	/* ATAPI-specific feature tests */
@@ -1279,6 +1286,12 @@
 		ata_dev_set_udma(ap, 1);
 	}
 
+	if (ap->device[0].flags & ATA_DFLAG_WCACHE)
+		ata_dev_disable_wcache(ap, 0);
+
+	if (ap->device[1].flags & ATA_DFLAG_WCACHE)
+		ata_dev_disable_wcache(ap, 1);
+
 	if (ap->flags & ATA_FLAG_PORT_DISABLED)
 		return;
 
@@ -1703,6 +1716,43 @@
 	DPRINTK("EXIT\n");
 }
 
+static void ata_dev_disable_wcache(struct ata_port *ap, unsigned int
device)
+{
+	struct ata_taskfile tf;
+
+	struct ata_device *dev = &ap->device[device];
+
+	if (!ata_dev_present(dev) || (ap->flags & ATA_FLAG_PORT_DISABLED))
+		return;
+
+	printk(KERN_INFO "disable write cache on dev: %u\n", device);
+
+	/* set up set-features taskfile */
+	DPRINTK("set features - disable write cache\n");
+	ata_tf_init(ap, &tf, dev->devno);
+	tf.ctl |= ATA_NIEN;
+	tf.command = ATA_CMD_SET_FEATURES;
+	tf.feature = SETFEATURES_XFER;
+	tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE;
+	tf.protocol = ATA_PROT_NODATA;
+	tf.nsect = DISABLE_WRITE_CACHE;
+
+	/* do bus reset */
+	ata_tf_to_host(ap, &tf);
+
+	/* crazy ATAPI devices... */
+	if (dev->class == ATA_DEV_ATAPI)
+		msleep(150);
+
+	ata_busy_sleep(ap, ATA_TMOUT_BOOT_QUICK, ATA_TMOUT_BOOT);
+
+	ata_irq_on(ap);	/* re-enable interrupts */
+
+	ata_wait_idle(ap);
+
+	DPRINTK("EXIT\n");
+}
+
 /**
  *	ata_dev_set_udma -
  *	@ap:



sleep1.sh:

	for ((a=0; a < 1000000000; a++))
	do
	   sleep 1
	   date >> $1
	done
	exit


sleep100.sh:

	for ((a=0; a < 1000000000; a++))
	do
	   sleep 0.1
	   date >> $1
	done
	exit


nosleep.sh:

	for ((a=0; a < 1000000000; a++))
	do
	   date >> $1
	done
	exit