Recovering a large file system from a crashed RAID

(If you don’t care for the story then just head over to the conclusion. There you’ll find all the tools I’ve used and how.)

Think of a scenario where you asked your server hosting provider to perform a maintenance and later you discover the file system on your hardware RAID is dead. This had happened to me and this is the story of my solution, how I recovered the data and what challenges I faced.

The configuration:

The first check on the unmounted file system and its dreaded outcome:

# fsck.ext4 -f /dev/sdc1
fsck from util-linux 2.21.2
e2fsck 1.42.4 (12-June-2012)
ext2fs_open2: Bad magic number in super-block
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sdc1

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

The disks appear to be fine

In order to rule out errors in the configuration I verified the state of the RAID using Adaptec’s setup tool arcconf: both disks where online and active.

# arcconf GETCONFIG 1 LD | grep -E '\s+Segment'
   Segment 0    : Present (Controller:1,Connector:0,Device:0)
   Segment 1    : Present (Controller:1,Connector:0,Device:1)

Still, there could be errors on the disks (which were the reason for the maintenance in the first place). Smartmontools to the rescue! Since the individual disks are hidden from view and Adaptec’s RAID driver doesn’t allow S.M.A.R.T. commands to pass through (easily?) I needed a way to bypass the access: the generic SCSI driver sg. A subsequent call to smartctl on the last two devices (the first one represented the actual controller) reported healthy disks.

# modprobe sg
# sg_scan -i
/dev/sg0: scsi6 channel=0 id=0 lun=0 [em]
    Adaptec   RAID1             V1.0
/dev/sg1: scsi6 channel=1 id=0 lun=0 [em]
              ST33000650NS      0002
/dev/sg2: scsi6 channel=1 id=1 lun=0 [em]
              ST33000650NS      0003

# smartctl -d auto -a /dev/sg1
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.4.28-2.20-default] (SUSE RPM)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

Vendor:
Product:              ST33000650NS
Revision:             0003
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
Serial number:
Device type:          disk
Transport protocol:   SAS
Local Time is:
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Error Counter logging not supported
Device does not support Self Test logging

Only later did I find out that with -d sat I can read the full attributes table which reveale Raw_Read_Error_Rate, Seek_Error_Rate and Hardware_ECC_Recovered and give you a far more detailed picture. So much for the aggregated “Health Status”.

Where did my Superblocks go?

Due to the nature of the problem and the customised configuration I estimated any upcoming attempt to bind resources for an extended period. This is where I usually ask: Do we really need that disk? To my concerns the answer was yes: the contained data bear high business value and were not recoverable by other means.

Fortunately the system’s configuration was applied to several other servers, too, which empowered me to follow a lead I gathered from nixCraft: Assuming the first superblock is corrupt due to power failure, use one of its backups. However, since the superblock positions are a function of the number of inodes of the system and its block size and said (non-default) number of inodes were identical on the reference system I could simply fetch the superblock positions and apply them to fschk. If any of those checks would pass I’d be done.

# on reference system
dumpe2fs /dev/sdc1 | grep -i superblock

# simulate check on ruptured system
mysuperblocks="0 31952 95856 159760 223664 ..."

for mysb in $mysuperblocks; do
  echo "On superblock $mysb";

  # stop if a valid superblock was found
  if fschk -n -o ro,sb=$mysb /dev/sdc1 /media/sdc1; then
    break;
  fi;
done

This failed for all expected superblock positions. Of course. Why should it be that simple. In a different post (which I failed to rediscover) I was reminded that it is wise to verify that the data you search is actually present. After all the disks could be blank and from what I knew the data loss happended during a maintenance. Time for some geek magic: hexdump!

# skip any of dd's info clutter
# print human readables in a seperate column
dd if=/dev/sdc1 ibs=2k count=1 2>/dev/null | hexdump -C

First try… null values. Too bad.

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000800

Second try… Random data? Non-deterministic results? What is this nonsense?

00000000  |................|
*
00000400  |INA.............|
00000410  |................|
00000420  |........O..+"c`X|
00000430  |O..+"c`X........|
00000440  |................|
00000450  |................|
00000460  |............0tes|
00000470  |tfile.zero......|
00000480  |................|
*
00000500  |IN..............|

This was going nowhere. Without knowing what to expect I went to inspect the reference system, again. The default superblock should lie somewhere in block group 0 within the first few blocks.

dd if=/dev/sdc1 ibs=2k count=100 2>/dev/null | hexdump -C | less

To my suprise I found the most recent mount point just within the very first block. Finally something I could grep for. That is my silver lining on the horizon!

00000000  |................|
*
00000400  |...,...+f.-.....|
00000410  |..U!............|
00000420  |.|...|........HQ|
00000430  |..HQ..$.S.......|
00000440  |.==O.N..........|
00000450  |............<...|
00000460  |F...{......a.DG.|
00000470  |.......B........|
00000480  |......../media/s|
00000490  |dc1.............|
000004a0  |.....@df........|
000004b0  |.......U^.......|
000004c0  |.......=......h.|
000004d0  |................|
000004e0  |..............S.|
000004f0  |..O...Y..3......|
00000500  |.........==O....|
00000510  |................|
00000520  |................|
00000530  |................|
00000540  |................|
00000550  |................|
00000560  |................|
00000570  |........8..m....|
00000580  |................|
*
00000800

Performing a search for my mount point on the dead file system yielded a single hit at 0xF8680:

# dd if=/dev/sdc1 ibs=2k count=500 2>/dev/null | hexdump -C | grep '/media/s'
000f8680  |......../media/s|

Success! My dead file system may have just turned into a bad file system with the lost first superblock dislocated at an offset of 0xF8200. I even found some valid file contents.

At first glance the superblocks on both systems look similar. Thanks to the guys at kernel.org and their explanation of the Ext4 disk layout I could both verify that I had found a proper superblock and that the configuration of the bad file system (block size, inodes and blocks per group) are as expected and consequently the previously gathered superblock positions apply relative to the dislocation-offset.

As a side note: it took me a good night’s sleep to arrive at those last four words. Before that I considered my finding a backup record which is to be found in block group 0, 3, 5, 7, … if a certain flag is set (which happens to be the case). However, after trying all dislocated positions nothing supported this assumption any more.

# count of blocks of size 4k
mysuperblocks="0 31952 95856 159760 223664 ..."

for mysb in $mysuperblocks; do
  # input block size (ibs) must be a greatest common divisor
  # proper ibs for 0xf8200 is 512 with count 1985
  # block size of mysb is 8x larger than ibs
  mydisloc=$(( 1985 + $mysb*8 ))

  echo "On superblock $mydisloc";
  dd if=/dev/sdc1 ibs=512 count=8 skip=$mydisloc 2>/dev/null | hexdump -C
done

What do I make of all this? Three TB of data moved by roughly 1 MB during a power failure? Surely not. A prefix that changes when you look at it? A damaged controller? Maybe. The observed prefix would support this assumption. A RAID maintains its configuration of logical volumes in such a prefix. And since the RAID is mirrored only a single RAID chunk per disk was expected which is supported by the latest superblock search incorporating the dislocation.

Mounting the unmountable

All that remained was to convince fschk and mount to accept a byte value dislocation. Such an option does not exist, though. Both commands operate on blocks, not bytes. And since the dislocation is no integral multiple of the file system’s block size I couldn’t simple add the value to the known superblock position (as you’ve seen in the scan above). That’s what loop devices are made for.

# losetup -f
/dev/loop0

# losetup --read-only --offset 1016320 /dev/loop0 /dev/sdc1

# fschk.ext4 -nf /dev/loop0
e2fsck 1.42.4 (12-June-2012)
Warning: skipping journal recovery because doing a read-only filesystem check.
The filesystem size (according to the superblock) is 731378688 blocks
The physical size of the device is 536869954 blocks
Either the superblock or the partition table is likely to be corrupt!
Abort? no

Pass 1: Checking inodes, blocks, and sizes
Error reading block 547018272 (Invalid argument) while getting next inode from scan.
Ignore error? no

Error while scanning inodes (477695894): Can't read next inode
e2fsck: aborted

Another side note: At first I was so happy that fschk accepted the offset and considered the loop device a valid ext4 file system that I overlooked the words “partition table is likely to be corrupt”. I erroneously started blaming losetup and blockdev --getsize for beeing limited to 2 TB. Some fresh air blowing through my head (and explaining the case to a colleague) led me to inspect the partition record which turned out to be shortened.

Number  Start           End             Size            Type     File system  Flags
 1      32256B          2199020382719B  2199020350464B  primary               type=83
        2199020382720B  2995729203199B  796708820480B            Free Space

That was just a minor problem: simply point losetup to the disk instead of the partition, add the partition’s offset to the dislocation and provide a sizelimit in case some free space throws itself in the works.

# losetup -d /dev/loop0

# losetup --read-only \
          --offset $((32256 + 1016320)) \
          --sizelimit $((731378688 * 4096)) \
          /dev/loop0 /dev/sdc

# fschk.ext4 -nf /dev/loop0
e2fsck 1.42.4 (12-June-2012)
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/loop0: 190781054/750059520 files (0.1% non-contiguous), 348544758/731378688 blocks

Now it was just me and the damaged journal which insistently prohibited me from mounting the loop device. How difficult could things get now that even a read-only fschk passed? Well, you never know. All this mess should not have happened in the first place so better be safe than sorry. A two-part, opportunistic approach sounded most efficient: copy the raw data to a backup machine (to repair the file system there) and access the bad file system without repair to inspect any damage.

Over at sans.org I found a solution to mount my dirty file system without the journal (the noload option). That way I could access the file system before the raw-data-copy-job finishes which estimated five days to complete.

mount -t ext4 -o ro,noexec,noload /dev/loop0 /media/sdc1

And it worked like a charm!

Fully recovered

As for the backup I used a combination of netcat and SSH to copy the raw data to another machine. This unfortunate nuisance was inevitable since the available space of the host system was already maxed out. I will skip those details. It shifts the focus of this article and there is plenty to say about netcat over SSH to justify an article devoted to this topic alone.

Once all data was transmitted it was “safe” to apply a full file system recovery onto the backup. Safe enough I’d better say since at that point I already knew that my assumption of a steady state of the RAID was not quite right due to my first SMART check.

Conclusion

At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.

Twelve Principles of Agile Software

A reminiscence of agility

Don’t be afraid of details. If you feel lost then get yourself a map. And make an effort to feel success often. See Crystal Clear.

Always challenge your own assumptions and expectations.

Explain the case to others for the sake of explicity and feedback.

Gained tools

Use hexdump -C to read the superblock of the first block on a reference machine. Find a comprehensive description and understand what you see.

dd if=/dev/sdc1 ibs=1k skip=1 count=1 2>/dev/null | hexdump -C

Establish a good signature1 to find all superblocks. A candidate are the first 12 bytes (+0xB) at offset 0x20 of a reference superblock. These represent the defining parameters of a file system: blocks per group, fragments per group, inodes per group.

# Capture the first 12 hex-bytes from a reference system (0x20+0xB).
# Exclude offset prefix from previous search!
dd if=/dev/sdc1 ibs=1k skip=1 count=1 | hexdump | grep '^00000020\s+'

Verify the signature on the reference system; it is good if and only if all superblocks were found.

# Get block size and superblock positions for reference
tune2fs -l /dev/sdc1 | grep -i '^block size'
dumpe2fs /dev/sdc1 | grep -i superblock

# Search the disk for occurences and compare to reference
dd if=/dev/sdc1 ibs=$myblocksize | hexdump | grep "$mysignature"

Assuming the data is still properly aligned — this assumption should be challenged at least once — you may expect an exact match for each “lost” superblock on the bad file system. Otherwise a substring search is in order because a misalignment results in the signature being broken at unknown positions into several lines, therefore do not expect a line wise grep to find anything sensible.

dd if=/dev/sdc1 ibs=$myblocksize | hexdump | grep "$mysignature"

Once you found the offset to your lost superblock use losetup to create your own logical volumes. But remember: The first superblock is always 1k Bytes away from the start of the partition. First test-drive your finding and gather the number of blocks of the file system.

# Get free loop device bind it read-only to the disk
losetup -f
losetup --read-only --offset $mybyteoffset /dev/loop0 /dev/sdc1

# Gather block count and size (always challenge assumption!)
tune2fs -l /dev/loop0 | grep -i '^block'
losetup -d /dev/loop0

If your partition record tells you a size other than your found block count (remember: the counts may be based on different block sizes) you do not want to point losetup to the partition but the disk itself. Use parted with unit b to get the partition’s offset and add it to $mybyteoffset.

Create your final volume with proper offsets and limits. Verify the volume.

losetup --read-only \
        --offset $(( $mypartition + $mybyteoffset )) \
        --sizelimit $(( $myblockcount * $myblocksize )) \
        /dev/loop0 /dev/sdc1

fschk.ext4 -nf /dev/loop0

Your journal will probably be corrupt and you don’t want to repair something that broken without a backup (by now you should have enough knowledge to devise a backup strategy fitting your needs). You may however mount your bad file system read-only using noload (anticipating inconsistencies).

mount -t ext4 -o ro,noexec,noload /dev/loop0 /media/sdc1

Open questions

Why did the RAID controller let me see its configuration prefix? Is the controller damaged? Would a new one still expose this behaviour? How could that have happened? Would Adaptec comment on this incident?

The partition record was changed but this structure should be transparent to a hardware RAID controller. Could the incident be (in part) human error? Maybe while fixing the software RAID with mdadm? A broken software RAID and damaged disks where the reason for the original maintenance so this is a reasonable assumption.

You may wonder why I emphasised “large file system” in the beginning. During the recovery I often thought I came across the dubious 32 bit threshold (which roughly is 2 T on a signed byte map). Upon completion of this article I realised that all evidence had vanished and my assumptions crumbled. It is just another reminiscence of how you can connect things in a wrong way. I think that this is the most important learning I had. Again.

Sources

Acknowledgement

Special thanks Catherine for helping me with proper english, Katrin with proper looks and Tobias and Tom with better story telling.


  1. A perfect signature neither reports false positives nor misses desired elements. A good signature tries a close approximation.