ZFS issues, disk unavail and corruption

10:52 am Solaris ZFS

I had corruption in my huge zfs pool last week. After many hours troubleshooting i came to the conclusion it was corrupted beyond repair and i destroyed the pool and started again. I won’t go into the detail here of what happened, but it would appear to be related to the manner in which i connect and disconnect the d1000.

I spoke to a friend of mine who is a hardware expert, and this is what he told me about hot plugging the d1000.

“The scsi-connector of the cable is not hot-plug capable. Period. First requirement would be that during nsertion:
1) ground is established
2) power is enabled, and circuitry is initialised.
3) data lines are connected.

Since the pins of the scsi cable connector are all equal in length, there is no way that the above can happen in the correct sequence.

Look at the connector of the scsi-drive itself. You’ll see pins protruding of different lengths. That’s why the drives in the d1000 are hotpluggable, but the chassis/cable isn’t.

You run the risk of frying the logic on either the HBA or the D1000. If you want to do this sensibly, you poweroff the box and the D1000. Then either connect or disconnect the cable. We’ve seen plenty of HBA’s fried because they were plugged not according to spec. Especially the qus-driver ones). Fair enough, the old PCI-dual channel wide scsi-2 adapters are pretty resilient from what we’ve seen, but then again they weren’t maltreated every week like this.”

So, this morning i was staring at this:

anouk:/ # zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
storage 136G 71.4G 64.6G 52% ONLINE -
anouk:/ # zpool export storage

The system went into immediate panic.

So after the system finished rebooting started troubleshooting.

anouk:/ # mount -F zfs storage /storage/
cannot open ’storage’: I/O error

anouk:/ # zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
storage - - - - FAULTED -

anouk:/ # zpool status -x
pool: storage
state: UNAVAIL
status: One or more devices could not be opened. There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using ‘zpool online’.
see: http://www.sun.com/msg/ZFS-8000-D3
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
storage UNAVAIL 0 0 0 insufficient replicas
c1t0d0 UNAVAIL 0 0 0 cannot open

Whilst the disk was there, i could see it touch it and play with it.

anouk:/ # format
… <snip>
Specify disk (enter its number): 2
selecting c1t0d0
[disk formatted]
/dev/dsk/c1t0d0s0 is part of active ZFS pool storage. Please see zpool(1M).

anouk:/ # zpool online storage c1t0d0
cannot open ’storage’: pool is unavailable

anouk:/ # zpool status -v storage
pool: storage
state: UNAVAIL
status: One or more devices could not be opened. There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using ‘zpool online’.
see: http://www.sun.com/msg/ZFS-8000-D3
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
storage UNAVAIL 0 0 0 insufficient replicas
c1t0d0 UNAVAIL 0 0 0 cannot open

Tried rebooting, to no avail. Solaris could see the disk but zfs could not. From my zfs troubleshooting a few days earlier i remember the zpool.cache file in /etc/zfs.

anouk:/ # ls -la /etc/zfs
total 14
drwxr-xr-x 2 root sys 512 Dec 26 23:28 .
drwxr-xr-x 82 root sys 4608 Dec 28 09:34 ..
-rw-r–r– 1 root root 852 Dec 26 23:28 zpool.cache

anouk:/ # rm /etc/zfs/zpool.cache
anouk:/ # init 6

anouk:/ # zpool status -x
no pools available
anouk:/ # zpool list
no pools available
anouk:/ # zpool import
pool: storage
id: 10619813594259770929
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

storage ONLINE
c1t0d0 ONLINE

anouk:/ # zpool import storage
cannot import ’storage’: pool may be in use from other system
use ‘-f’ to import anyway
anouk:/ # zpool import -f storage
anouk:/ # zpool status -v pool: storage
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0

errors: No known data errors
anouk:/ # mount -F zfs storage /storage/
anouk:/ # df -k
… <snip>
storage 140378112 74860993 65516734 54% /storage

And all is well again. With all the zfs problem in the last week i think i will listen to the scsi expert and not hotplug the d1000 again. Fortunately the lost data is a backup of data which has not been lost.

Leave a Comment

Your comment

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.