I recently built a new fileserver for home. It’s pretty nice, with 3.7TB of RAID5 storage. I’ll never have to worry about losing data again (mostly).
Unfortunately, in the process of transferring data, I found that, well, I had already lost data. My old backup drive (an external, FAT32 disk) contained a directory that pointed off into gobbledygook, containing garbage as “files” and “directories.” I noticed this when my rsync went off the deep end. This “broken copy problem” is a simple problem to fix: rsync has a –exclude option. To get the copy to finish, I just –exclude the broken directory and recopy. Thankfully, the directory was junk (eclipse metadata), so I could ignore it, and all was well.
But. Once I finish the copy, I’d like to continue using this drive for moving data around, etc. With this corrupt directory, though, I can’t quite trust it. When you realize what causes this corrupt, it becomes clear that no off-the-shelf tools will repair this without potentially mangling other data. So, there’s no way to fix it short of formatting it. But, understanding the corruption pretty intimately, I saw that it might be possible to fix this, if I hand-edit the disk.
So I did. Knowing what the underlying problem was, I wrote some tools to inspect the filesystem manually, then went in and excised this directory. Once that was done, the FS came up “clean enough,” and worked. It might be fun to go through the process of building these tools, so here’s a quick overview.
First off, let’s identify the problem more precisely. The big symptom is that ‘find’ returns tons of the following:
It’s clearly “corrupt,” but what does that mean? First, we’ll need to learn a little bit about how FAT works. To do that, you can read any of a myriad of difficult-to-follow documents, all of which talk a lot about things that are only relevant on floppy disks (no, really). Examples: Microsoft’s Description of the FAT File System (which is incomplete: it doesn’t cover FAT32), Wikipedia’s File Allocation Table (good, but tl;dr), Andries Brouwer’s The FAT filesystem (great, a little incomplete), and Paul Stoffregen’s Understanding FAT32 Filesystems (aha, finally helpful! But lots of disk-ish stuff, and a few missing bits).
Once you’ve digested all of those, you’ve got a good idea of how FAT works. Or you can sit down and listen to Josh talk story.
First and foremost: FAT was designed for floppies. There’s lots of cruft in there, and a bit of weirdness, from that heritage. A lot of things we think of as constants are variables in the land of 3.5″ vs 5.25″, DS vs SS, DD or not… all that great floppy crap. Just be prepared for this.
So, without further ado, the structure of a FAT filesystem. I’m going to assume you’re working with a partition on a hard disk, or a ‘dd’ copy of one. The first chunk of the disk is the Boot Sector. This is where the CPU boots from, so it’s got some actual code in it. This is important for floppies. We don’t care. What we do care about are the following fields stored in the Boot Sector:
- bytes per sector (0x0b-0x0c)
- sectors per cluster (0x0d)
- number of FAT copies (0×10)
- number of sectors per FAT (0×24-0×27)
- first cluster of FS (0x3C~0x3F)
- FS type (0×52-0×59)
If you’re playing the home game, begin by checking the FS Type. It should be “FAT32 “. If it’s not, you’re in trouble, because I’m only explaining FAT32.
This is just the header structure, which lets us determine the geometry of the drive, and calculate where data goes. After this, there’s the eponymous File Allocation Table, then a second copy of the FAT, and then we get to the actual data. Let’s talk a little about how the data is organized, then come back to the FAT.
The actual directories and files in FAT32 are stored in “clusters.” A cluster is a contiguous block of sectors; a sector is a contiguous block of bytes (corresponding to a floppy disk sector, once upon a time). All of these things are variables, though it’s really typical that sectors are 512 bytes, and clusters are 32Kbytes. You should read in the appropriate variables above, and compute them for your disk, though. The cluster is the base unit of allocation of your filesystem. When you store a file, its content gets put into a set of clusters. If the file is under 32K long, it goes into a single cluster; if it’s more than 32K, it goes into multiple clusters. Similarly, directories are stored in clusters. Most directories fit within one cluster, but sometimes they, too, span multiple clusters.
The content of files are stored “raw”: a file cluster is just a 32K chunk of that file. All the metadata about the file is stored in a directory cluster. Directory clusters, on the other hand, are nothing but metadata. They’re a list of entries with names, types (directory or file), and cluster numbers. So, you start out at the root cluster (“first cluster of FS”), and read it as a directory. Walking through this, you find the name of all files and directories in the root of the FS, and pointers to the clusters containing those files and directories. From there, you can go to those clusters and read their contents.
But what if a file (or directory) spans more than one cluster? Normally, you’d assume the OS would allocate a contiguous range of clusters to any file, and just write the file there. However, let’s assume you’re the OS, and you’re working on a disk that’s been in use for a while. What if the user wants to write a file that’s bigger than any contiguous range of clusters, but fits into the free space fo the disk? You could shuffle things around to make that work, but, haven’t I heard this before? Oh, yeah, the bin packing problem. It’s hard. So, instead, we fragment the file into multiple clusters, and spread them to the four corners of the disk. This lets us use all the free space on the disk, without having to solve NP problems every time the user writes a new WordPerfect document. (Yay floppies!)
So, now we need a way to keep track of all these clusters. Putting them into the directory entry itself is tricky: we’d really prefer it if our directory tables were made of constant-length entries. We could use some space at the end of each cluster to point at its next entry, but there must be some good reason not to do it that way, which probably involves floppies (it smells like a DMA/alignment thing to me, but I dunno. If you do, please comment). Why must there be a good reason not to do it that way? Because MS chose to solve it in a way that makes absolutely no sense otherwise.
They created the File Allocation Table. If you looked at the Boot Sector of your disk, you might’ve noticed that the “number of sectors per FAT” is a kinda big number. On my 320GB drive, it comes out to 37MB. If all the file contents are in clusters, and the directories and filesystem metadata are in other clusters, what the heck is in the FAT?
Naturally, it’s the linked list of what cluster comes after the current one. Every cluster on the disk gets a 32 bit number entry in the FAT (hence FAT32), which is the index of the next cluster after it in the current file. These are coindexed, so cluster 2 gets the 2′th FAT entry: whatever number is in there is the next cluster in this entry. There’s a few special values: anything over 0x0ffffff7 is an “END” marker. So, let’s assume we have a directory that points to “NIBBLE.BAS” of length 35K at cluster 13, and that FAT = 37, and FAT = 0x0ffffff. Then NIBBLE.BAS is made up of all 32K of cluster 13, and then the first 3K of cluster 37. The remainder of 37 gets wasted, which is a real shame on a 720K floppy.
So, to recap: file metadata is stored in directory tables, which are stuck in clusters. File contents are also stuck in clusters. If any one entity spans more than one cluster, you start at the first cluster in its chain, then go to the FAT entry for that cluster to find out the next one. If the next one is greater than 0x0fffff7, you’ve hit the end of the chain.
How is that metadata stored? Well, it’s not too bad until we get to long file names. It’s just a packed struct, of a fixed length. Every directory table entry is 0×20 bytes long. The fields you care about:
- FilenameExt: (0×00-0x0b): 8.3, minus the dot
- Attributes: directory/readonly/system/etc. (0x0f is a wacky VFAT entry, if 0×20 is set, it’s a directory, see Andries for a complete bitmask)
- Cluster high: 0×14-0×15
- Cluster low: 0x1a-0x1b
- Filesize: 0x1c-0x1f
Notice that the cluster is split up? The “high” part is in a previously reserved part of the directory entry structure. They used to only have the low part. Floppies.
So, you read in those bits, multiple high by 256 and add it to low, and bob’s your uncle. The filenames are probably awful (BOBSY~12TXT), but, for the purposes of “Find the place to excise a tumor from my FS,” I can deal with ugly path components.
At this point, we can see what went wrong above: the directory entry for …/8f/ got mangled somehow, and its clusterhi+clusterlo now points off into the middle of file. The contents of that file are being dutifully interpreted as a directory. Better still, the file goes on for a reasonable number of clusters, and, best of all: since those clusters contain random stuff, some of it is being interpreted as pointers to subdirectories, which point off to random clusters. Which are probably other files, and so on, and so forth.
This explains the behavior I was seeing quite nicely. So, what do I do to fix it? If I tried to delete it recursively (rm -rf and/or deltree), it could be Very Bad: it’s going to go through and crap all over the clusters randomly pointed to, breaking the chains they’re in, and potentially mangling the whole FS. That would be pretty tragic.
Knowing what we know above, though, there’s a pretty obvious entry: let’s just remove the bogus directory entry by hand! Since the contents of 8f are a lost cause anyway, I could just go in and nuke its directory entries. But, honestly, I kinda hate eclipse. I only care about that workspace directory for the code in it; the eclipse metadata can go to hell for all I care. So, I’m going to cut the tree of there. Now I just need to find the 32 bytes on the disk to overwrite to make this happen.
To do that, we need to find the cluster on disk. We know the first cluster comes right after the FAT, so that’s easy: len(boot sector) + len(FAT) + N * sizeof(cluster), right? No, that would make sense, and this is FAT32. Remember the variable “first cluster of FS” in the boot sector? Didn’t it seem like that should be zero? Yeah, it’s two. It’s always two. Not sure why, it just is, but, remember, it’s stored in a variable. So, that’s the cluster you go to read the root filesystem, but it’s actually the first cluster immediately after the FATs. Yes, FATs, because there’s probably two of them. So, in the end, you wind up with:
offset_of_cluster(N) = 512 + 2*sectors_per_fat*bytes_per_sector + (N – first_cluster) * (sectors_per_cluster * bytes_per_sector)
Ain’t FAT grand? This is the most widely-deployed FS in the world, folks.
With all that in hand, we can finally write code to walk down these things and find out where they live on disk. Once you’ve done that, you can find the offending entry in the directory table and pluck it out by writing zeros over it. Well, assuming it’s after anything else you might want in that directory. If the zeroth byte of a filename is 0×00, it’s assumed to be the end of the directory listing. Supposedly. So, if you want to access stuff beyond the excised directory, you might be in trouble. Good luck with that. Happily (and oddly), Linux’s implementation of vfat skips over zeroed out entries, probably assuming that they’re problematic in this kind of a way. Actually, it only does that when I mount the disk I modified; the disk image treats this as a directory terminator. There’s an awkward question I don’t want to ask there, so I won’t.
Also note that the FAT is probably wonky at this point. We pruned the directory entry, but it might still have clusters allocated to it. Since we didn’t (and can’t) clean those up, there are probably a bunch of clusters marked as used which aren’t actually connected to the current FS tree. There’s not any good way to identify these, short of walking the whole FS and keeping track of which clusters are part of something. At that point, we’ve written our own baby fsck (the stock fsck.vfat totally fails on drives this size for me, and I’m definitely not hooking this thing up to a windows machine at this point).
After all was said and done, I got my drive back. It’s probably got a bunch of wasted clusters, but, you know what? I don’t care. There’s a lot of stuff in this process that, when it comes down to it, I don’t care. All that matters is that, at the end of the day, I got my drive back.