Re: OCFS2 Filesystem inconsistency across nodes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

We are testing OCFS2 on a 2 node cluster.  Both the nodes are Sun v40z
with OpenSuSE10.1 Beta3 (x86_64) installed (kernel-smp-2.6.16_rc2_git5-3)
OCFS2 version in OpenSuSE kernel is 1.1.7-SLES.  OCFS2 fences the
system with the default heartbeat threshold.  So heartbeat threshold
is configured to 31 on both the nodes (O2CB_HEARTBEAT_THRESHOLD=31).
Using Sun StorEdge 6920 for shared storage. Have also setup multipath
on both the nodes. Formatted a 200GB shared volume with the following
command:

    mkfs.ocfs2 -b 4K -C 32K -N 4 -L Raid5 /dev/mapper/3600015d00004b200000000000000050e-part1

Performed the following 2 tests:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mounted the formatted partition on both the nodes.
    sun5:~ # mount /dev/mapper/3600015d00004b200000000000000050e-part1 /raid5 -t ocfs2
    sun4:~ # mount /dev/mapper/3600015d00004b200000000000000050e-part1 /raid5 -t ocfs2

Created a tar file on Sun5.
    sun5:~ # tar cvf /raid5/test.tar /etc

File viewable on both the nodes.
    sun5:~ # ls -l /raid5/test.tar
    -rw-r--r-- 1 root root 21749760 Feb 13 19:57 /raid5/test.tar
    sun4:~ # ls -l /raid5/test.tar
    -rw-r--r-- 1 root root 21749760 Feb 13 19:57 /raid5/test.tar

Created a file on sun4 using touch. 
    sun4:~ # touch /raid5/abcd
    touch: cannot touch `/raid5/abcd': Input/output error

OCFS2 marks the filesystem read only on sun4. Relevant messages from
/var/log/messages of sun4:

<----------------------------------------------------------->
<Sun4 /var/log/messages>
<----------------------------------------------------------->
Feb 13 19:56:18 sun4 klogd: (4200,0):o2net_set_nn_state:418 connected to node sun5 (num 0) at 196.1.1.240:7777
Feb 13 19:56:22 sun4 klogd: OCFS2 1.1.7-SLES Mon Jan 16 11:58:10 PST 2006 (build sles)
Feb 13 19:56:22 sun4 klogd: (4784,3):ocfs2_initialize_super:1379 max_slots for this device: 4
Feb 13 19:56:22 sun4 klogd: (4784,3):ocfs2_fill_local_node_info:1044 I am node 2
Feb 13 19:56:22 sun4 klogd: (4784,3):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 19:56:22 sun4 klogd: (4784,3):__dlm_print_nodes:388  node 0
Feb 13 19:56:22 sun4 klogd: (4784,3):__dlm_print_nodes:388  node 2
Feb 13 19:56:22 sun4 klogd: (4784,3):ocfs2_find_slot:267 taking node slot 1
Feb 13 19:56:22 sun4 klogd: kjournald starting.  Commit interval 5 seconds
Feb 13 19:56:22 sun4 klogd: ocfs2: Mounting device (253,4) on (node 2, slot 1)
Feb 13 19:58:30 sun4 klogd: OCFS2: ERROR (device dm-4): ocfs2_search_chain: Group Descriptor # 4485090715960753726 has bad signature >>>>>>>
Feb 13 19:58:30 sun4 klogd: File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted.
Feb 13 19:58:30 sun4 klogd: (4890,0):ocfs2_claim_suballoc_bits:1157 ERROR: status = -5
Feb 13 19:58:30 sun4 klogd: (4890,0):ocfs2_claim_new_inode:1255 ERROR: status = -5
Feb 13 19:58:30 sun4 klogd: (4890,0):ocfs2_mknod_locked:479 ERROR: status = -5
Feb 13 19:58:30 sun4 klogd: (4890,0):ocfs2_mknod:384 ERROR: status = -5
<----------------------------------------------------------->

<----------------------------------------------------------->
<Sun5 /var/log/messages>
<----------------------------------------------------------->
Feb 13 19:56:03 sun5 klogd: OCFS2 1.1.7-SLES Mon Jan 16 11:58:10 PST 2006 (build sles)
Feb 13 19:56:03 sun5 klogd: (4914,2):ocfs2_initialize_super:1379 max_slots for this device: 4
Feb 13 19:56:03 sun5 klogd: (4914,2):ocfs2_fill_local_node_info:1044 I am node 0
Feb 13 19:56:03 sun5 klogd: (4914,2):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 19:56:03 sun5 klogd: (4914,2):__dlm_print_nodes:388  node 0
Feb 13 19:56:03 sun5 klogd: (4914,2):ocfs2_find_slot:267 taking node slot 0
Feb 13 19:56:03 sun5 klogd: kjournald starting.  Commit interval 5 seconds
Feb 13 19:56:03 sun5 klogd: ocfs2: Mounting device (253,5) on (node 0, slot 0)
Feb 13 19:56:18 sun5 klogd: (4198,0):o2net_set_nn_state:418 accepted connection from node sun4 (num 2) at 196.1.1.229:7777
Feb 13 19:56:22 sun5 klogd: (4198,0):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 19:56:22 sun5 klogd: (4198,0):__dlm_print_nodes:388  node 0
Feb 13 19:56:22 sun5 klogd: (4198,0):__dlm_print_nodes:388  node 2
<----------------------------------------------------------->




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test 2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mounted the formatted partition on both the nodes.
This time mounted it first on Sun4 and then on Sun5
    sun4:~ # mount /dev/mapper/3600015d00004b200000000000000050e-part1 /raid5 -t ocfs2
    sun5:~ # mount /dev/mapper/3600015d00004b200000000000000050e-part1 /raid5 -t ocfs2

Created 2 files:
    sun4:~ # touch /raid5/abcd
    sun4:~ # touch /raid5/abcd1

Files viewable on both the nodes.
    sun4:~ # ls -l /raid5/abcd*
    -rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd
    -rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd1

    sun5:~ # ls -l /raid5/abcd*
    -rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd
    -rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd1

Issued "rm" command on Sun5.
    sun5:~ # rm /raid5/abcd
    sun5:~ # rm /raid5/abcd1

    sun5:~ # ls -l /raid5/abcd*
    /bin/ls: /raid5/abcd*: No such file or directory

This time no error message on Sun5. However files exist on Sun4

    sun4:~ # ls -l /raid5/abcd*
    -rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd
    -rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd1

Sun4 kernel reports the following:

<----------------------------------------------------------->
<Sun4 /var/log/messages>
<----------------------------------------------------------->
Feb 13 20:20:18 sun4 klogd: OCFS2 1.1.7-SLES Mon Jan 16 11:58:10 PST 2006 (build sles)
Feb 13 20:20:18 sun4 klogd: (4774,0):ocfs2_initialize_super:1379 max_slots for this device: 4
Feb 13 20:20:18 sun4 klogd: (4774,0):ocfs2_fill_local_node_info:1044 I am node 2
Feb 13 20:20:18 sun4 klogd: (4774,0):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 20:20:18 sun4 klogd: (4774,0):__dlm_print_nodes:388  node 2
Feb 13 20:20:18 sun4 klogd: (4774,0):ocfs2_find_slot:267 taking node slot 0
Feb 13 20:20:18 sun4 klogd: kjournald starting.  Commit interval 5 seconds
Feb 13 20:20:18 sun4 klogd: ocfs2: Mounting device (253,3) on (node 2, slot 0)
Feb 13 20:20:28 sun4 klogd: (4235,0):o2net_set_nn_state:418 connected to node sun5 (num 0) at 196.1.1.240:7777
Feb 13 20:20:32 sun4 klogd: (4235,0):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 20:20:32 sun4 klogd: (4235,0):__dlm_print_nodes:388  node 0
Feb 13 20:20:32 sun4 klogd: (4235,0):__dlm_print_nodes:388  node 2
<----------------------------------------------------------->

<----------------------------------------------------------->
<Sun5 /var/log/messages>
<----------------------------------------------------------->
Feb 13 20:20:28 sun5 klogd: (4310,0):o2net_set_nn_state:418 accepted connection from node sun4 (num 2) at 196.1.1.229:7777
Feb 13 20:20:32 sun5 klogd: OCFS2 1.1.7-SLES Mon Jan 16 11:58:10 PST 2006 (build sles)
Feb 13 20:20:32 sun5 klogd: (5284,0):ocfs2_initialize_super:1379 max_slots for this device: 4
Feb 13 20:20:32 sun5 klogd: (5284,2):ocfs2_fill_local_node_info:1044 I am node 0
Feb 13 20:20:32 sun5 klogd: (5284,2):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 20:20:32 sun5 klogd: (5284,2):__dlm_print_nodes:388  node 0
Feb 13 20:20:32 sun5 klogd: (5284,2):__dlm_print_nodes:388  node 2
Feb 13 20:20:32 sun5 klogd: (5284,2):ocfs2_find_slot:267 taking node slot 1
Feb 13 20:20:32 sun5 klogd: kjournald starting.  Commit interval 5 seconds
Feb 13 20:20:32 sun5 klogd: ocfs2: Mounting device (253,5) on (node 0, slot 1)
<----------------------------------------------------------->

Filesystem is in read-write mode on both the nodes.

Unmounted filesystem on Sun4 and remounted. 
    sun4:~ # umount /raid5
    sun4:~ # mount /dev/mapper/3600015d00004b200000000000000050e-part1 /raid5 -t ocfs2
    sun4:~ # ls -l /raid5/abcd*
    /bin/ls: /raid5/abcd*: No such file or directory

Now the filesystems are in sync on both the nodes. 

~~~~~~~~~~~~~~~~~~~~~~~
/etc/cluster/ocfs2.conf (conf file same on both the servers,
                         sun1 mentioned in conf file is not powered on)
~~~~~~~~~~~~~~~~~~~~~~~
node:
        ip_port = 7777
        ip_address = 196.1.1.240
        number = 0
        name = sun5
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 196.1.1.118
        number = 1
        name = sun1
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 196.1.1.229
        number = 2
        name = sun4
        cluster = ocfs2

cluster:
        node_count = 3
        name = ocfs2
~~~~~~~~~~~~~~~~~~~~~~~

Regards

Nohez
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux