Recover from disk accident crash
I had this very bad habit of testing speed of my disks with dd, very simply :
root@server:/mnt# dd if=/dev/md0 of=/dev/null bs=1M count=100
100+0 enregistrements lus
100+0 enregistrements écrits
104857600 octets (105 MB) copiés, 4,56589 s, 23,0 MB/s
But with some lack of sleep, I accidentaly replaced the wrong argument and wrote to my disk, by putting my disk in ‘of’ instead of ‘if’ argument. My disks are in RAID5 to have redundancy and allow one failure. If it was a physical disk, that would be OK, just have to resync the array. But this would have been too easy, and the mistake was done with the array disk, unrecoverable. And with the first giga of the disk, it aims critical data…The repair was quite difficult, it took me one day to minimally recover and the service to be back again (step 1), but siw months to fully recover (step 2). As it could be useful, below are main parts.
Step 1 : recover the disk
With the first giga of data overwritten, the ext4 filesystem was broken, and fsck was unable to recover with the first superblock absent. The solution is to use a backup superblock. To find where backup superblocks are located, the easiest solution is to run mke2fs in test mode. DO NOT FORGET the -n flag… If not your disk would be another time overwritten, and backup superblocks would be destroyed…
mke2fs -n /dev/lvm/main
When you have the superblock positions you can now repair the filesystem :
fsck.ext3 -b 1934917632 -B 4096 /dev/lvm/main
It will rebuild the master superblock, and find a bunch of errors because of the first giga garbage… If you are not a supernatural drive hacker, you will have no choice other than to accept all the changes proposed by fsck, while hoping it have not taken the wrong solution… In my case it have worked pretty well.
You will need to rerun fsck several times
fsck.ext4 -y -f -C0 /dev/lvm/main
You may also find some recurring errors that fsck do not succeed to fix. The only way I found to fix is to use debugfs, and remove the wrong entries.
root@server:~# debugfs -w /dev/lvm/main
debugfs 1.42.12 (29-Aug-2014)
debugfs: blocks 61901
61901: File not found by ext2_lookup
debugfs: blocks
34764 34769 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 2393 [...] 2393
debugfs: clri <61901>
debugfs: blocks <61901>
debugfs: quit
Do not forget to rerun several times fsck, until no errors are found or fixed.
After all these steps, you should now have a working ext filesystem. Because of the first giga overwritten, it is most likely that the / folder would have been deleted, and you may find an empty filesystem. But do not panic, your files are in lost+found ! The filenames will be likely lost (all files are renamed #xxxxx), but it should be easy to rename back from their contents, and move back to root folder (As it is often folders, it is easier)
Ouf! You have now back a working folder. You may take a snapshot of this folder, and you can use it normally.
But you may have lost or altered some files (one giga was overvritten, and it may affect several thousands files…). Here come the second part.
Step 2 : recover files
As I am a bit paranoid with my files, I had several security systems :
- A simple file list backup (weekly) : to be able to detect file loss
- A md5 hash backup of all my files (weekly) : to be able to detect file corruption (viruses,…)
- A crashplan offsite backup (immediate backup or daily) : offsite backup, if anything wrong…
The second one was very useful to see the damages : it obviously need to be rebuilt completely, which is quite long. Thus the first file list can provide a rough result more quickly. By diffing the md5 hashes, you will have a file list which may be corrupted, and thus should be restored from crashplan.
Crashplan has a powerful GUI, but you must select each file to recover manually, which could be very tedious. If you have sufficient space and bandwidth, you can obvisouly restore all files, but that was not my case. Moreover restore speed can be quite slow, I didn’t workout why. Crashplan has an incredible API which would have been great to restore files, but sadly it is reserved to enterprise version. So I tried to automate this task by the GUI. I tested several automation frameworks on Linux and Windows, and the one which was the least problematic was PyWinAuto on Windows. Performance is not good, but proved to be quite reliable.
Here is the script (prerequisite : PyWinAuto)
"""CrashPlan Restore Automation (win32)"""
from __future__ import unicode_literals
from __future__ import print_function
import imp
from pywinauto.controls import common_controls
imp.reload(common_controls)
from pywinauto import application
from time import sleep
from pprint import pprint
import sys
app = application.Application()
app.connect(title_re=".*CrashPlan", class_name="SWT_Window0")
win = app.top_window()
# Activate Restore Tab
win.type_keys("^r")
win.wait_for_idle()
sleep(1)
#win.print_control_identifiers()
def getChild(node, name):
if node is None:
return None
for item in node.children():
text = item.Text()
if text == "":
item.EnsureVisible()
if name == item.Text():
return item
return None
def getChildTimeout(node, name, timeout):
sleepdef = 0.1
n = 0
text = ""
# while ( getChild(node, name) is None) and n < timeout:
while( ( text == "") and n < timeout ):
try:
text = node.children()[0].Text()
except:
pass
n = n + sleepdef
if (n % (1 / sleepdef)) == 0:
print(".",end="")
sys.stdout.flush()
sleep(sleepdef)
return getChild(node, name)
def getItemOne(node, name):
if node is None:
return None
node.EnsureVisible()
node.Expand()
#while getChild(node,'chargement en cours...') is not None:
# sleep(1)
item = getChildTimeout(node, name, 60)
if item is None:
print("ERROR : Not found " + name, end="")
sys.stdout.flush()
else:
print("/", end="")
sys.stdout.flush()
return item
def getItemTree(node, path):
names = path.split('/')
item = node
for name in names:
if len(name) > 0:
item = getItemOne(item, name)
return item
def checkItem(item):
if item is not None:
item.select()
win.type_keys('{SPACE}')
return True
return False
def checkItemTree(node, path):
return checkItem(getItemTree(node, path))
tree = win.child_window(class_name='SysTreeView32')
root = tree.Root()
root.select()
#checkItemTree(root, "/space/Backups/Server/encfs-find.pgp")
#sys.exit(0)
fname = "batch"
prefix = "/mnt/space_encfs/"
errors = []
print("Selecting file " + fname + " with prefix " + prefix)
for line in open(fname):
line = line.rstrip("nr").rstrip()
print("Check item " + line + " ... ", end="")
sys.stdout.flush()
if checkItemTree(root,prefix + line):
print(" OK!")
else:
print(" ERROR !")
errors.append(line)
print("Done.")
print("List of errors :")
for line in errors:
print(line)
And then a script to restore files (check md5 and restore file attributes)
#
SRCPATH=/space/Temp/crashplan/restored
BCKPATH=/space/Temp/crashplan/backup
SRCMD5=/space/Temp/crashplan/md5-restored
CHKMD5=/mnt/extra/md5-2016-12-09
#CHKMD5=/mnt/extra/md5-2016-07-03
RMMD5=/mnt/extra/md5
find "$SRCPATH" -type f | while read FILE
do
DSTFILE=`echo $FILE | sed -e "s?$SRCPATH??"`
BCKFILE="$BCKPATH$DSTFILE"
# echo "$FILE -> $DSTFILE ($BCKFILE)"
# ls -l "$FILE" && ls -l "$DSTFILE"
if [ ! -f "$DSTFILE" ]
then
echo "$DSTFILE does not exists ; Moving restored $FILE to $DSTFILE"
[ -d "`dirname "$DSTFILE"`" ] || mkdir -p "`dirname "$DSTFILE"`"
mv "$FILE" "$DSTFILE"
else
FILEMD5=`cat "$SRCMD5$FILE" | cut -d " " -f 1`
FCKMD5=`cat "$CHKMD5$DSTFILE" | cut -d " " -f 1`
# if [ "$FILEMD5" != "$FCKMD5" ] && [ -n "$FCKMD5" ]
if [ "$FILEMD5" != "$FCKMD5" ]
then
echo "MD5 difference on $DSTFILE : $FCKMD5"
else
echo "Backuping $DSTFILE to $BCKFILE"
chmod --reference="$DSTFILE" "$FILE"
chown --reference="$DSTFILE" "$FILE"
mkdir -p "`dirname "$BCKFILE"`"
mv "$DSTFILE" "$BCKFILE"
echo "Moving restored $FILE to $DSTFILE"
mv "$FILE" "$DSTFILE"
mv "$RMMD5$DSTFILE" "$RMMD5$DSTFILE.bk"
fi
fi
done
By the way a simple script to merge folders :
#!/bin/bash
DEST="${@:${#@}}"
ABS_DEST="$(cd "$(dirname "$DEST")"; pwd)/$(basename "$DEST")"
for SRC in ${@:1:$((${#@} -1))}; do (
cd "$SRC";
find . -type d -exec mkdir -p "${ABS_DEST}"/{} ;
find . -type f -exec mv {} "${ABS_DEST}"/{} ;
find . -type d -empty -delete
) done
And yeah, you have now restored your files !! (it took me six months…)
Conclusions
- Best is always prevention : you need to set a few things to be prepared
- The three levels are useful :
- Detect the file you lost (file list)
- Detect a file corruption (md5 hash)
- Backup contents (offsite)
- Crashplan do not backup files reliably : I had approximately 5% of files which were corrupted