Recovering a large file system from a crashed RAID
(If you don’t care for the story then just head over to the conclusion. There you’ll find all the tools I’ve used and how.)
Think of a scenario where you asked your server hosting provider to perform a maintenance and later you discover the file system on your hardware RAID is dead. This had happened to me and this is the story of my solution, how I recovered the data and what challenges I faced.
The configuration:
- Adaptec 5405 RAID controller (Firmware 5.2-0, 18252)
- RAID 1 with two large disks over 2 TB (2x 3TB to be precise)
- GPT partition table
- EXT4 file system spanning the whole volume (here: sdc1)
- Custom inode configuration
- An identical system for reference
The first check on the unmounted file system and its dreaded outcome:
# fsck.ext4 -f /dev/sdc1
fsck from util-linux 2.21.2
e2fsck 1.42.4 (12-June-2012)
ext2fs_open2: Bad magic number in super-block
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sdc1
The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
The disks appear to be fine
In order to rule out errors in the configuration I verified the state of the RAID using Adaptec’s setup tool arcconf
: both disks where online and active.
# arcconf GETCONFIG 1 LD | grep -E '\s+Segment'
Segment 0 : Present (Controller:1,Connector:0,Device:0)
Segment 1 : Present (Controller:1,Connector:0,Device:1)
Still, there could be errors on the disks (which were the reason for the maintenance in the first place). Smartmontools to the rescue! Since the individual disks are hidden from view and Adaptec’s RAID driver doesn’t allow S.M.A.R.T. commands to pass through (easily?) I needed a way to bypass the access: the generic SCSI driver sg
. A subsequent call to smartctl on the last two devices (the first one represented the actual controller) reported healthy disks.
# modprobe sg
# sg_scan -i
/dev/sg0: scsi6 channel=0 id=0 lun=0 [em]
Adaptec RAID1 V1.0
/dev/sg1: scsi6 channel=1 id=0 lun=0 [em]
ST33000650NS 0002
/dev/sg2: scsi6 channel=1 id=1 lun=0 [em]
ST33000650NS 0003
# smartctl -d auto -a /dev/sg1
smartctl 6.0 2012-10-10 r3643 [x86_64-linux-3.4.28-2.20-default] (SUSE RPM)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
Vendor:
Product: ST33000650NS
Revision: 0003
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Logical block size: 512 bytes
Serial number:
Device type: disk
Transport protocol: SAS
Local Time is:
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK
Error Counter logging not supported
Device does not support Self Test logging
Only later did I find out that with -d sat
I can read the full attributes table which reveale Raw_Read_Error_Rate
, Seek_Error_Rate
and Hardware_ECC_Recovered
and give you a far more detailed picture. So much for the aggregated “Health Status”.
Where did my Superblocks go?
Due to the nature of the problem and the customised configuration I estimated any upcoming attempt to bind resources for an extended period. This is where I usually ask: Do we really need that disk? To my concerns the answer was yes: the contained data bear high business value and were not recoverable by other means.
Fortunately the system’s configuration was applied to several other servers, too, which empowered me to follow a lead I gathered from nixCraft: Assuming the first superblock is corrupt due to power failure, use one of its backups. However, since the superblock positions are a function of the number of inodes of the system and its block size and said (non-default) number of inodes were identical on the reference system I could simply fetch the superblock positions and apply them to fschk
. If any of those checks would pass I’d be done.
# on reference system
dumpe2fs /dev/sdc1 | grep -i superblock
# simulate check on ruptured system
mysuperblocks="0 31952 95856 159760 223664 ..."
for mysb in $mysuperblocks; do
echo "On superblock $mysb";
# stop if a valid superblock was found
if fschk -n -o ro,sb=$mysb /dev/sdc1 /media/sdc1; then
break;
fi;
done
This failed for all expected superblock positions. Of course. Why should it be that simple. In a different post (which I failed to rediscover) I was reminded that it is wise to verify that the data you search is actually present. After all the disks could be blank and from what I knew the data loss happended during a maintenance. Time for some geek magic: hexdump
!
# skip any of dd's info clutter
# print human readables in a seperate column
dd if=/dev/sdc1 ibs=2k count=1 2>/dev/null | hexdump -C
First try… null values. Too bad.
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000800
Second try… Random data? Non-deterministic results? What is this nonsense?
00000000 |................|
*
00000400 |INA.............|
00000410 |................|
00000420 |........O..+"c`X|
00000430 |O..+"c`X........|
00000440 |................|
00000450 |................|
00000460 |............0tes|
00000470 |tfile.zero......|
00000480 |................|
*
00000500 |IN..............|
This was going nowhere. Without knowing what to expect I went to inspect the reference system, again. The default superblock should lie somewhere in block group 0 within the first few blocks.
dd if=/dev/sdc1 ibs=2k count=100 2>/dev/null | hexdump -C | less
To my suprise I found the most recent mount point just within the very first block. Finally something I could grep for. That is my silver lining on the horizon!
00000000 |................|
*
00000400 |...,...+f.-.....|
00000410 |..U!............|
00000420 |.|...|........HQ|
00000430 |..HQ..$.S.......|
00000440 |.==O.N..........|
00000450 |............<...|
00000460 |F...{......a.DG.|
00000470 |.......B........|
00000480 |......../media/s|
00000490 |dc1.............|
000004a0 |.....@df........|
000004b0 |.......U^.......|
000004c0 |.......=......h.|
000004d0 |................|
000004e0 |..............S.|
000004f0 |..O...Y..3......|
00000500 |.........==O....|
00000510 |................|
00000520 |................|
00000530 |................|
00000540 |................|
00000550 |................|
00000560 |................|
00000570 |........8..m....|
00000580 |................|
*
00000800
Performing a search for my mount point on the dead file system yielded a single hit at 0xF8680:
# dd if=/dev/sdc1 ibs=2k count=500 2>/dev/null | hexdump -C | grep '/media/s'
000f8680 |......../media/s|
Success! My dead file system may have just turned into a bad file system with the lost first superblock dislocated at an offset of 0xF8200. I even found some valid file contents.
At first glance the superblocks on both systems look similar. Thanks to the guys at kernel.org and their explanation of the Ext4 disk layout I could both verify that I had found a proper superblock and that the configuration of the bad file system (block size, inodes and blocks per group) are as expected and consequently the previously gathered superblock positions apply relative to the dislocation-offset.
As a side note: it took me a good night’s sleep to arrive at those last four words. Before that I considered my finding a backup record which is to be found in block group 0, 3, 5, 7, … if a certain flag is set (which happens to be the case). However, after trying all dislocated positions nothing supported this assumption any more.
# count of blocks of size 4k
mysuperblocks="0 31952 95856 159760 223664 ..."
for mysb in $mysuperblocks; do
# input block size (ibs) must be a greatest common divisor
# proper ibs for 0xf8200 is 512 with count 1985
# block size of mysb is 8x larger than ibs
mydisloc=$(( 1985 + $mysb*8 ))
echo "On superblock $mydisloc";
dd if=/dev/sdc1 ibs=512 count=8 skip=$mydisloc 2>/dev/null | hexdump -C
done
What do I make of all this? Three TB of data moved by roughly 1 MB during a power failure? Surely not. A prefix that changes when you look at it? A damaged controller? Maybe. The observed prefix would support this assumption. A RAID maintains its configuration of logical volumes in such a prefix. And since the RAID is mirrored only a single RAID chunk per disk was expected which is supported by the latest superblock search incorporating the dislocation.
Mounting the unmountable
All that remained was to convince fschk and mount to accept a byte value dislocation. Such an option does not exist, though. Both commands operate on blocks, not bytes. And since the dislocation is no integral multiple of the file system’s block size I couldn’t simple add the value to the known superblock position (as you’ve seen in the scan above). That’s what loop devices are made for.
# losetup -f
/dev/loop0
# losetup --read-only --offset 1016320 /dev/loop0 /dev/sdc1
# fschk.ext4 -nf /dev/loop0
e2fsck 1.42.4 (12-June-2012)
Warning: skipping journal recovery because doing a read-only filesystem check.
The filesystem size (according to the superblock) is 731378688 blocks
The physical size of the device is 536869954 blocks
Either the superblock or the partition table is likely to be corrupt!
Abort? no
Pass 1: Checking inodes, blocks, and sizes
Error reading block 547018272 (Invalid argument) while getting next inode from scan.
Ignore error? no
Error while scanning inodes (477695894): Can't read next inode
e2fsck: aborted
Another side note: At first I was so happy that fschk accepted the offset and considered the loop device a valid ext4 file system that I overlooked the words “partition table is likely to be corrupt”. I erroneously started blaming losetup and blockdev --getsize
for beeing limited to 2 TB. Some fresh air blowing through my head (and explaining the case to a colleague) led me to inspect the partition record which turned out to be shortened.
Number Start End Size Type File system Flags
1 32256B 2199020382719B 2199020350464B primary type=83
2199020382720B 2995729203199B 796708820480B Free Space
That was just a minor problem: simply point losetup to the disk instead of the partition, add the partition’s offset to the dislocation and provide a sizelimit in case some free space throws itself in the works.
# losetup -d /dev/loop0
# losetup --read-only \
--offset $((32256 + 1016320)) \
--sizelimit $((731378688 * 4096)) \
/dev/loop0 /dev/sdc
# fschk.ext4 -nf /dev/loop0
e2fsck 1.42.4 (12-June-2012)
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/loop0: 190781054/750059520 files (0.1% non-contiguous), 348544758/731378688 blocks
Now it was just me and the damaged journal which insistently prohibited me from mounting the loop device. How difficult could things get now that even a read-only fschk passed? Well, you never know. All this mess should not have happened in the first place so better be safe than sorry. A two-part, opportunistic approach sounded most efficient: copy the raw data to a backup machine (to repair the file system there) and access the bad file system without repair to inspect any damage.
Over at sans.org I found a solution to mount my dirty file system without the journal (the noload
option). That way I could access the file system before the raw-data-copy-job finishes which estimated five days to complete.
mount -t ext4 -o ro,noexec,noload /dev/loop0 /media/sdc1
And it worked like a charm!
Fully recovered
As for the backup I used a combination of netcat and SSH to copy the raw data to another machine. This unfortunate nuisance was inevitable since the available space of the host system was already maxed out. I will skip those details. It shifts the focus of this article and there is plenty to say about netcat over SSH to justify an article devoted to this topic alone.
Once all data was transmitted it was “safe” to apply a full file system recovery onto the backup. Safe enough I’d better say since at that point I already knew that my assumption of a steady state of the RAID was not quite right due to my first SMART check.
Conclusion
At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.
Twelve Principles of Agile Software
A reminiscence of agility
Don’t be afraid of details. If you feel lost then get yourself a map. And make an effort to feel success often. See Crystal Clear.
Always challenge your own assumptions and expectations.
- Verify the assumption: Dump the sectors under scrutiny, compare to a known-good, understand its meaning.
- Disprove counter assumptions (at least try): don’t give into the temptation the first occurence is sufficient proof – find them all!
- Even the “impossible” may be more likely than something well understood: it may be “impossible” for a hardware RAID to show you things it was not designed to but it is more likely than moving 3 TB by 1 MB.
- Interpret data yourself. Rules of interpretations differ. Don’t rely on other people’s thoughts. “SMART Health Status” may indicate failures but it is by no means a proper indicator for “good health”.
Explain the case to others for the sake of explicity and feedback.
- Re-iterating out loud reveals details and reasoning you were missing before.
- Value and expected outcome are external measures. Let them be applied by a helping pair of ears.
Gained tools
Use hexdump -C
to read the superblock of the first block on a reference machine. Find a comprehensive description and understand what you see.
dd if=/dev/sdc1 ibs=1k skip=1 count=1 2>/dev/null | hexdump -C
Establish a good signature1 to find all superblocks. A candidate are the first 12 bytes (+0xB) at offset 0x20 of a reference superblock. These represent the defining parameters of a file system: blocks per group, fragments per group, inodes per group.
# Capture the first 12 hex-bytes from a reference system (0x20+0xB).
# Exclude offset prefix from previous search!
dd if=/dev/sdc1 ibs=1k skip=1 count=1 | hexdump | grep '^00000020\s+'
Verify the signature on the reference system; it is good if and only if all superblocks were found.
# Get block size and superblock positions for reference
tune2fs -l /dev/sdc1 | grep -i '^block size'
dumpe2fs /dev/sdc1 | grep -i superblock
# Search the disk for occurences and compare to reference
dd if=/dev/sdc1 ibs=$myblocksize | hexdump | grep "$mysignature"
Assuming the data is still properly aligned — this assumption should be challenged at least once — you may expect an exact match for each “lost” superblock on the bad file system. Otherwise a substring search is in order because a misalignment results in the signature being broken at unknown positions into several lines, therefore do not expect a line wise grep
to find anything sensible.
dd if=/dev/sdc1 ibs=$myblocksize | hexdump | grep "$mysignature"
Once you found the offset to your lost superblock use losetup
to create your own logical volumes. But remember: The first superblock is always 1k Bytes away from the start of the partition. First test-drive your finding and gather the number of blocks of the file system.
# Get free loop device bind it read-only to the disk
losetup -f
losetup --read-only --offset $mybyteoffset /dev/loop0 /dev/sdc1
# Gather block count and size (always challenge assumption!)
tune2fs -l /dev/loop0 | grep -i '^block'
losetup -d /dev/loop0
If your partition record tells you a size other than your found block count (remember: the counts may be based on different block sizes) you do not want to point losetup
to the partition but the disk itself. Use parted with unit b
to get the partition’s offset and add it to $mybyteoffset
.
Create your final volume with proper offsets and limits. Verify the volume.
losetup --read-only \
--offset $(( $mypartition + $mybyteoffset )) \
--sizelimit $(( $myblockcount * $myblocksize )) \
/dev/loop0 /dev/sdc1
fschk.ext4 -nf /dev/loop0
Your journal will probably be corrupt and you don’t want to repair something that broken without a backup (by now you should have enough knowledge to devise a backup strategy fitting your needs). You may however mount your bad file system read-only using noload (anticipating inconsistencies).
mount -t ext4 -o ro,noexec,noload /dev/loop0 /media/sdc1
Open questions
Why did the RAID controller let me see its configuration prefix? Is the controller damaged? Would a new one still expose this behaviour? How could that have happened? Would Adaptec comment on this incident?
The partition record was changed but this structure should be transparent to a hardware RAID controller. Could the incident be (in part) human error? Maybe while fixing the software RAID with mdadm
? A broken software RAID and damaged disks where the reason for the original maintenance so this is a reasonable assumption.
You may wonder why I emphasised “large file system” in the beginning. During the recovery I often thought I came across the dubious 32 bit threshold (which roughly is 2 T on a signed byte map). Upon completion of this article I realised that all evidence had vanished and my assumptions crumbled. It is just another reminiscence of how you can connect things in a wrong way. I think that this is the most important learning I had. Again.
Sources
- About Smartmontools. sourceforge.net. Retrieved 2013-03-31
- Cockburn, Alistair; Crystal Clear destilled. alistair.cockburn.us. 2004, retrieved 2013-03-31
- ext2: Ext2 data structures. wikipedia.org. 2013-02-21, retrieved 2013-03-31
- Ext4 Disk Layout: The Super Block. wiki.kernel.org. 2013-02-01, retrieved 2013-03-31
- S.M.A.R.T.: Visibility to host systems. wikipedia.org. 2013-03-29, retrieved 2013-03-31
- Pomeranz, Hal; How to Mount Dirty EXT4 File Systems. computer-forensics.sans.org. 2011-06-14, retrieved 2013-03-31
- nixCraft; “Linux: Find Alternative Superblocks”. cyberciti.biz. 2010-12-06, retrieved 2013-03-31
- Manifesto for Agile Software Development: Twelve Principles of Agile Software. agilemanifesto.org. 2001, retrieved 2013-03-31
- The Linux SCSI Generic (sg) HOWTO: Appendix A. Sg3 utils package. faqs.org. 2002-05-03, retrieved 2013-03-31
Acknowledgement
Special thanks Catherine for helping me with proper english, Katrin with proper looks and Tobias and Tom with better story telling.
-
A perfect signature neither reports false positives nor misses desired elements. A good signature tries a close approximation. ↩