tadhg.com
tadhg.com
 

Too Many Backups?

22:55 Sun 25 Nov 2012
[, , , , , ]

Given how often I’ve stressed the need to back up your stuff, it may seem odd for me to claim that it’s possible to have too many of them. But in some senses it is, which is why I’m writing this post as I’m copying files to my main hard drive from a virtual machine running Ubuntu that’s mounting an OpenBSD drive via a USB-SATA adapter[1].

Backups

“Backups” normally means addressing one problem, redundancy. That is, you have some number of files on your hard drive, and you want to still have them if your hard drive melts. So, you copy them to some external device at regular intervals. Now you have redundancy, and presumably if you copied your files to more devices, you would only have more redundancy.

Really dealing with backups, however, means tackling at least the five following areas:

  • Redundancy.
  • Versioning.
  • Security.
  • Accessibility.
  • Information management.

Versioning

If you make a major change to file x, and your backup system catches the change and propagates it, you’re out of luck if you want the prior version of the file. Unless you have versioning. More and more systems, including Time Machine, offer some form of the ability to go back to an earlier version of a file, at least up to a point. If you really care, you should be using a real version control system such as Git or Mercurial. Having some system manage versioning for you also makes long-term backups easier to handle, and a cross-platform system is clearly more future-proof. If you’re using one, your efforts at redundancy should be focused on making the version control repository redundant[2].

Versioning is more important than it might sound. Even if you don’t have enough files, or enough versions of files, right now, how long do you think you’re going to be using computers? How many files and version of those files are you going to have over that time? Do you really want to manage them using guesswork and filenames ending with version1, version2, final, final2, etc.?

Security

How important this is will vary, but maximizing redundancy can mean sacrificing some security. The more machines your information is on, the easier it is for others to access. One of the answers to this problem is encryption, but that raises an additional point of failure: if you lose or forget your key, your backups are lost to you. While this is a matter for each individual, I suspect the best approach is to use encryption on some things but not everything.

Accessibility

Meaning: how quick and easy is it to restore everything from your backups. Also: how easy is it to make those backups? The answer to the second question absolutely needs to be “very”, otherwise you won’t do it often enough. The first depends on what you’re backing up, and for some cases slow restoration isn’t a problem.

Information Management

This is cheating a little, as this category encompasses all the others. Backups are one aspect of information management. But your backups are going to be less and less useful over time if the rest of your information management isn’t handled well—a lesson I’ve been learning repeatedly over time.

Poor information management is what leads to the “too many backups” problem that I’m currently tackling. Irregular backups that I’ve made over the last couple of decades are present in various forms around me: external hard drives, internal hard drives such as the one I managed to finally gain access to this evening, USB flash drives, DVDs, CDs, Zip disks, and even some 3.5-inch floppies. While figuring out how to access them is difficult, and may be prohibitively so in some cases, the true difficulty is that I just don’t know what I actually need from them.

Do some of them contain journal entries that were lost in a prior hard drive disaster? Writing that I didn’t manage to get into the main branch of organization for some reason? Photos that never got backed up elsewhere? I don’t know.I don’t know, and the only way to find out is the slow way: going through what’s on them.

When I first started dealing with digital files, I handled versions by changing file names and copying them, and had some rudimentary organization, and did ad hoc backups that I would make at irregular intervals. I started using real version control well after I should have. I didn’t switch away from binary data formats early enough, and didn’t switch to a true plain text format until embarrassingly late. I didn’t make a real effort to organize my files rationally from the outset; I also didn’t include dates and other metadata in my files as a matter of course.

Doing any of those earlier would have eased my current predicament: real version control would have imposed a structure on further backups even if they were irregular; a true plain text format makes comparisons between versions trivial[3]; rational organization from the outset makes comparisons between backups far easier.

Digital Hoarding

All of this only matters if you care about your files. Not caring is a perfectly defensible position, but it doesn’t work for me. I have pack rat tendencies, but in many ways these are less problematic for digital assets. The biggest problem with being a pack rat is that physical things take up space. One of the ways I’ve tackled that problem is to move increasingly to being a digital pack rat instead. Random stuff that I want to keep for some (normally not very good) reason I now turn into a digital file, normally via scanning or photographing it. I’ve always been a digital pack rat anyway, before making any effort to shift, but that mostly works in my favor. It’s easier to structure digital assets, and the structures I’ve created already make it simple to add things like scans.

The other advantage of strong structure is that it’s easier to see what matters and what doesn’t. Because it’s harder to impose structure on physical things, they’re much more likely to end up in an undifferentiated mass, and with an undifferentiated mass it’s easier to give in to the fear that you’re going to get rid of “something important”. At least for me, putting things into a structure makes it much more obvious that I’m being ridiculous, and that makes it easier get rid of things.

The fear of getting rid of “something important” is also greater with an undifferentiated mass of things because it’s less clear what precisely those things are—which is why I don’t simply wipe the too many backups I have and get rid of them. I don’t know what’s on them, and while they provided needed redundancy at times in the past, my past poor information management means I now have to invest time in going through them.

What do you Want to Keep?

Are there documents, photos, or other files[4] you have now that you think you’ll want in 10, 20, or 50 years? If not, I salute you and your probably correct carefree attitude about the relative importance of such things. If you do want to keep some things, though, you should come up with a strategy. In addition to everything above, I also offer this advice:

  • Prune ruthlessly. The more you keep the harder it’ll be to manage[5].
  • Centralize. Make sure the things you want to keep are in specific locations on your devices; otherwise backups are too hard.
  • Use a consistent structural system. This can be anything you want, but the easier you make it for yourself to simply know where a new category of things will go, and where a half-forgotten category of things might be, the more likely you are to have a useful system. This also means that if a machine breaks, it’s easier to restore your backups to the next machine, as your backups won’t be simply “everything on the old machine”, including things you might not actually want on the new one.
  • Use version control. Yes, I’m repeating myself.
  • Use plain text formats. They’re smaller, they’re more likely to be accessible decades from now, they work better with version control.
  • Figure out what you want to keep, figure out a system that doesn’t create additional work for every backup, then backup regularly and automatically. The last part is fairly easy now; if you don’t know how to do it, it’s almost guaranteed that a friend of yours does.

[1] It was not a trivial task to get access to the data on the old drives, but I eventually got it to work using the following parts: VirtualBox for the VM, Ubuntu 12.04 desktop for the guest OS, OpenBSD who-knows-what (circa 2005) for the filesystem of the mounted drive, and a cheap CablesToGo-branded device that handles giving USB access to SATA and IDE drives—and the latter ability is important for me because I still have IDE drives I need to get data from…

[2] Or both, really; for the paranoid there’s always the possibility that the repository might be corrupt in some way, so having multiple checkouts as well as multiple repositories is safer.

[3] It also makes old files more accessible: the only way to open some of my older files is using OpenOffice.org, and I’m lucky that project has a focus on supporting old formats.

[4] Or things that can be transformed into files in some way.

[5] However, err on the side of keeping things—the additional cost of keeping something you may want later is fairly low. While it’s still better to be ruthless, it’s not worth it to delete and later regret. Just keep 10 of the 300 photos you took of the hike; don’t gamble on never wanting any record of that hike ever again and thus delete all 300 photos.

One Response to “Too Many Backups?”

  1. alex Says:

    Tadhg,
    I think you should also mention regular random verification of any backups. Any backup regime should have scheduled restoring of randomly selected backed up files ( or versions of files ) to confirm backup system, restore system and file data integrity. Without this you have no idea if your system is actually working, running regular backups is a lot less reliable if they are never checked until disaster strikes.

Leave a Reply