A becomes B

I've been spending a lot of time recently taking all of my bulky old financial, academic and miscellaneous paperwork and scanning it into my computer. Another ongoing project of mine is to "go legit", whittling my stockpile of pirated films and music down to zero by either buying or deleting each file in turn - this has involved a hefty amount of CD and DVD ripping. I've got a lot of valuable data now which means it's become ever more important to have a clear, systematic backup system. Between my computer's hard drive, problematic network attached storage, burnable DVDs and miscellaneous resources online and offline, and the wildly varying levels of value and replaceability and size of my various files, this hasn't been trivial.

The core issue I've been mulling over is how to efficiently back up my computer's hard drive to a remotely-attached hard drive. Given unlimited bandwidth or a reasonably small file size, it's trivial to just copy everything over everything else once a week. But when we're talking about close to 500GiB of data it becomes less so. Naive folder synchronisation software might just look at file modification dates and filenames when making a comparison between a local and a remote directory. What about a file which was renamed? If that file was a 700MiB movie, or a directory containing 15GiB of Alias, I don't particularly want to have to upload all of that again just because the software was too stupid to spot it. Likewise duplicated data. Sure, it's possible to take MD5 hashes of data to perform swifter content comparisons, but doing that requires my computer to read every last bit of that file, and since we've established that the total size of the file is 15GiB, that doesn't actually save me any time, because all of that data has to be sent back to my computer for the hash to be calculated. And what about a huge file which was only modified in a small part? A completely different hash. What simple digest algorithms are there? Is it possible to make a piece of storage calculate these itself? What would be the ideal size of file to "chunk up" into hashes? Even knowing these, how do you 1) take the current, live filesystem and 2) the remote, out-of-date filesystem and turn that into the shortest possible sequence of file operations which would bring the remote filesystem up to date, including simple binary difference edits and file renaming? Adding in a cost for retrieving information about the remote files makes this much more complex.

I'm putting together a spider of sorts which can monitor a web site for changes. I want it to take an initial snapshot of the site, store this in a database, and then maybe a year later take a second snapshot. I want it to be able to figure out exactly how the site was restructured during that time. What links changed their targets? Which pages are new, which disappeared, how has the navigation changed, how has the content?

I'm always working on the content management system that makes up this website. At the moment I'm trying to wrap my head around the model-view-controller design pattern and figure out 1) how, exactly, it differs from what I'm already doing and 2) how to make what I'm already doing fit the pattern if it doesn't. As part of this I'm investigating ways to read data out of, and commit back into, the database, particularly when I edit pages. I don't want to have to write the entire database row back into the database every time I change the slightest thing. Okay, this problem is trivially easy compared to the others, but bear with me.

At work I routinely have to analyse test output from a grand total test suite numbering in the millions, which exercise thousands of distinct components making up the Bus software in dozens of distinct hardware/operating system environments. I have to take the vast swathes of passes and failures and make sense out of them. I have to find the common features which might have given rise to multiple simultaneous regressions. Were all these tests relating to SQL Server 2008? Were all those tests invoking TCP/IP calls? Did these other ones involve character encoding problems, or line-ending issues, or a specific machine whose hard drive has crashed its cog and needs junking? What new code has gone in? What parts of the product does that code affect? I want to start from a network of 100% passes, and figure out the simplest set of explanations which give rise to all of my observed failures. I want to automate this process.

I'm educating myself on formal language theory. I'm learning about taking an empty string (well, a starting character S) and applying a finite collection of operations to turn it into a terminal string.

I'm taking a structure A, and another structure B, and a set of costed mutator functions F, and trying to figure out the simplest, best, fastest or cheapest way to apply those functions to turn A into B.

Everywhere I look.

In everything I do in life.

I don't know if this is just my mighty brain's stupendous innate powers of pattern recognition on the blink, or whether there's deeper meaning here. I do know that these problems are all wildly different and that the variation in F particularly is what makes each of these tasks trivial or monumental, but I don't know how similar all the Fs are or how much of this much greater meta-problem has already been solved. This could be a fertile ground for research. Or, it may be old hat. Or, the metaphor may be too loose to be useful.

In university I took a course in Optimisation. That wasn't this. There were passing resemblances, though.

Back to Blog
Back to Things Of Interest

Discussion (25)

2010-03-02 00:45:35 by JeremyBowers:

rsync is a word that should have come up in that post. If you don't know what it is, there's actually a paper on the core algorithm that you can read: http://samba.anu.edu.au/rsync/tech_report/

I consider MVC to be grotesquely overrated and completely superseded by the more general DRY (Don't Repeat Yourself) principle. If your code is factored so that concepts only live in one place, you win. Laying on the further constraint that it has to be MVC just means you're going to push your code around to jam it into a format that doesn't necessarily offer any compensating benefits. I am in the minority on this... well, sort of. The majority of people will claim it's a good thing, then fail to be able to actually define it or understand the historical context of where it came from, then claim that something is "MVC" when it actually isn't. So actually when it gets down to it I'm not sure it's fair to say the majority even supports MVC, either, if they don't even really know what it means.

2010-03-22 22:49:34 by Pi:

I use Dirvish: http://www.dirvish.com/

It uses rsync to solve your problem, essentially. The only major issue is that if you change 1 byte of a 1 gigabyte file, the filesystem will store two files only differing by one byte. It still only transfers the one byte (plus overhead) over the network, though.

2010-03-22 23:35:40 by Val:

One of the big problems is that file systems do not work that way. When the NTFS and ext3 were born, there were not that many needs for synchronizations of files so large. A solution would be to store the last few alterations of a file in the file system itself. This would create overhead, of course, but for very large files the gain in synchronization speed would be worth the extra size.
Until such a file system is introduced or an OS supports it natively, there remains only the software solution: write (or search for) a script that does this. It should run on the other side es well, so I don't know weather it can be used if there is no computer there (or not one you can install programs on), just a dumb storage device.

2010-03-23 06:04:22 by eneekmot:

You could try Subversion as a file-control system. Once you've made your changes to your file system, you just Commit the top-level folder. I use Tortoise on Windows.

2010-03-23 10:34:51 by skztr:

If you are making a backup, but part of your solution involves "if there is a small change in this file, propagate the change as efficiently as possible", What you've set up is a very efficient way to invalidate your backups.

But as for your larger point, the entire universe is a computer which only allows your to Move things, never Copy them. So in a way, every aspect of life is about trying to find the most efficient way to re-arrange one thing so that it looks like another.

2010-03-23 11:22:00 by Val:

@skztr: The main reason behind transistor-transistor logic and digital electronics derived from it is that the copy is, for all practical purposes, exactly the same as the original. It's because you work with sampled and quantized data.

2010-03-23 11:25:19 by frymaster:

for windows computers, I store all my documents on a file server and use offline files to ensure there's a local copy on my PC. Aside from the initial copying to this file server, a gigabit network is sufficient to not slow down use of my documents (I'm not entirely sure what happens in the case where you are connected to a file server using an offline-files-enabled share and you copy data on faster than the net connection can handle i.e. I'm not sure if I'm limited to the network speed or not)

I'm not sure if there's a remote file share system for linux that behaves the same way; I'm also not sure if you use linux

Anyway, once all my files were in the one centralised place, backups become easier (the "distributed" backup effect of using offline files acts more as redundancy (like RAID) than as a true backup, as it doesn't prevent me deleting a file). As someone has pointed out, backups should, in general, not be incremental

oh, and make sure the file share is covered by the volume shadow copy service, that way you also get the windows built-in "restore previous versions of this files" safety-net (though again, it's not a substitue for backups)

sorry this is so windows-centric, I've never done anything like that on linux (only straightforward total backups using rsync)

2010-03-23 14:21:04 by qntm:

What's wrong with incremental backups? I have to point out that doing a full backup of a 500GiB hard drive takes literally three and a half days at the speed that the remote drive runs at. I would be forever backing up.

2010-03-23 14:26:38 by pozorvlak:

eneekmot: Subversion doesn't do automatic rename detection, IIRC. Git does, and has the additional bonus of needing much less disk space to store the same history. Speaking of git, Sam, you might like to investigate git-bisect, which tracks down the commit in which a bug was introduced using binary search. It should be possible to script something of the sort on top of whatever VCS you're using.

More generally, I think version-control systems are going to have the most to offer as a starting point for the general problem. You could try reading up on modern DVCS merge algorithms and extending them to arbitrary trees, perhaps.

2010-03-23 14:28:16 by pozorvlak:

The problem with incremental backups is that recovering your data involves getting the first backup and then downloading and applying all diffs. A standard strategy is to do full backups every week or so, and then incremental backups more frequently.

2010-03-23 14:30:46 by pozorvlak:

Val: this breaks down at a quantum level. You might be interested to read up on linear logic (http://en.wikipedia.org/wiki/Linear_logic), which disallows copying and destruction of data. The hope is that it will provide the theoretical underpinnings of quantum computation.

2010-03-23 14:59:46 by Val:

pozorvlak: Of course, that's why I wrote "for all practical purposes". When planning a system we usually don't consider the astoundingly small probability of the atoms of our computer suddenly rearranging themselves to form a bowl of petunia. Thresholding to 0 and 5 Volts respectively is pretty robust, as the probability of all electrons suddenly changing direction together is small enough.

2010-03-23 20:06:16 by skztr:

Val: That's pretty much my point. Finding the most efficient way to move things around so that one grouping of things looks like another grouping of things is often more about finding out when to declare things as "close enough". For example- propogation is not backup, and it's a good idea to make sure you know what changes you're pushing.
I think Incremental backups are a great solution to the problem, provided you have multiple full backups as well. I know we're not talking about accounting data or anything, but the level of annoyance data loss can cause rarely scales downward with the importance of data as much as one might expect.

Snapshotting filesystems exist, though I expect they wouldn't play nicely with large blobs.
Speaking of which, anyone who would actually suggest using Subversion or Git as a backup solution for 500GB of binary data has clearly never tried to use either as any sort of solution for as little as 30MB of not-set-in-stone binary data.

Git is not for backups. Subversion is not for backups. Rsync is not for backups (but is a very good transport layer)

From what I'm hearing, it seems your backup drive is slow enough that you want it to be "write-only" during the backup process. I don't know the solution, but it's certainly not a full-fledged version control system written with text in mind.

2010-03-23 23:42:29 by qntm:

I was about to add similar comments re: using any version control system for a 500GiB file base. That is not what version control has ever been about.

2010-03-23 23:56:37 by Andrew:

You took optimisation? Was it still taught by the vaguely incomprehensible Russian guy? Maybe he would do your backups.

2010-03-24 00:10:13 by pozorvlak:

Oh, I wasn't suggesting using Git for the backup problem, but it should at least make a decent starting point for the "how has this website changed" problem.

2010-03-25 22:26:39 by TJSomething:

I personally use a script that calls plain rsync with a boatload of parameters to backup my ~250 GB media collection to an eSATA hard drive, which works quite fine.

2010-03-26 11:12:00 by pozorvlak:

Here's a guy who did write a backup system based on Git:

http://eigenclass.org/hiki/gibak-backup-system-introduction

He /claims/ it scales linearly with repo size, which would mean snapshotting would take about an hour and a half for a 500GB repository.

I'd be tempted to treat media (large, rarely-changing binary files) and everything else differently; use plain rsync for ~/Music and ~/Video, and gibak for everything else.

2010-03-27 19:13:43 by TJSomething:

pozorvlak: I actually do something like that, except I use Duplicity (http://duplicity.nongnu.org/) to backup everything outside of a "bigfiles" directory. The directories in there are selectively backed up using rsync. The whole thing takes about 30 minutes, most of which is Duplicity running.

2010-03-28 18:59:56 by drh:

@val

VMS actually had filesystem versioning built in, exactly as you described.

2010-03-30 06:29:45 by neil:

What about rdiff-backup?

Supposedly it gets the best of both worlds of incremental and full backups: changed files are copied in full and the difference to the previous version stored as a diff. Therefore the most recent backup can be obtained instantly, while old ones are still there. Also, any corruption in one of the diffs still leaves your most recent backup usable, as opposed to "normal" incremental backups.

It uses rsync under the hood as well. Of course this is (mostly) only an advantage if you have rsync or shell access to the storage to which you're backing up.

2010-04-12 23:02:53 by DTal:

It seems to me that if the thing Sam is backing up to is unintelligent (i.e. a consumer NAS) and can't perform md5s, etc, the way to do this particular problem would be to hash every file on the disk, and simply *keep track* of what's on the drive. If anything's changed on disk, just re-upload that file (or, depending on bandwidth cost versus cpu cost of more granular hashing) parts of files.

Regarding the generic optimization problem, it reminded me of something I'd read in "How The Mind Works" about vision - apparently we resolve visual ambiguities (such as the parity of a necker cube) with weighted transform functions. I think there really is something deeper here, and I have a tickle in the back of my head that stuff has been done on it under a different name, but I can't recall..

2010-05-20 18:52:18 by Dmytry:

500 GB / snail pace 10 megabytes per second = 50000 seconds = under 14 hours. I don't understand where can you get a drive that is slower than 10 megabytes per second, yet is >= 500 gigabytes. I'm using rsync (from shell script) to backup my files to another computer, and I keep everything important under GIT.

2010-06-06 19:50:03 by NuclearDog:

I'll toss my vote in for rsync/rdiff-backup/that family of apps. I use them to back up our web server at work (few hundred GB of data) to another internal hard-drive, and to another server in the same data center. It's quite bandwidth efficient for the amount of data we have (and how frequently it changes).

Caveat is you do need some sort of intelligent storage on the other end. If you look around Google, I'm sure there's probably some very polished way to handle backing up as efficiently as possible to a non-intelligent device. (Out there, right now, there's someone who lives and breathes this stuff. I'm sure they're much smarter than all of us in this domain :)).

I can give you a simple solution to your issues with using MD5 hashes to detect changes, though... You've already calculated it when you decide to send it over, so just store the hash somewhere. Save it in "movie.avi.md5", save it in a database locally, whatever. No longer need to transfer the file back to hash it again - it's already there. And if you don't mind me offering a pro-tip: Store the file size as well. You can avoid wasting a lot of time hashing a file to find out it's changed if you just compare file sizes first :)

2010-07-19 14:34:29 by green:

Perhaps bup deserves mention here. It is new and immature but promises fast backups with deduplication; faster and better than rsync.

http://apenwarr.ca/log/?m=201001
http://github.com/apenwarr/bup