I've been spending a lot of time recently taking all of my bulky old financial, academic and miscellaneous paperwork and scanning it into my computer. Another ongoing project of mine is to "go legit", whittling my stockpile of pirated films and music down to zero by either buying or deleting each file in turn - this has involved a hefty amount of CD and DVD ripping. I've got a lot of valuable data now which means it's become ever more important to have a clear, systematic backup system. Between my computer's hard drive, problematic network attached storage, burnable DVDs and miscellaneous resources online and offline, and the wildly varying levels of value and replaceability and size of my various files, this hasn't been trivial.
The core issue I've been mulling over is how to efficiently back up my computer's hard drive to a remotely-attached hard drive. Given unlimited bandwidth or a reasonably small file size, it's trivial to just copy everything over everything else once a week. But when we're talking about close to 500GiB of data it becomes less so. Naive folder synchronisation software might just look at file modification dates and filenames when making a comparison between a local and a remote directory. What about a file which was renamed? If that file was a 700MiB movie, or a directory containing 15GiB of Alias, I don't particularly want to have to upload all of that again just because the software was too stupid to spot it. Likewise duplicated data. Sure, it's possible to take MD5 hashes of data to perform swifter content comparisons, but doing that requires my computer to read every last bit of that file, and since we've established that the total size of the file is 15GiB, that doesn't actually save me any time, because all of that data has to be sent back to my computer for the hash to be calculated. And what about a huge file which was only modified in a small part? A completely different hash. What simple digest algorithms are there? Is it possible to make a piece of storage calculate these itself? What would be the ideal size of file to "chunk up" into hashes? Even knowing these, how do you 1) take the current, live filesystem and 2) the remote, out-of-date filesystem and turn that into the shortest possible sequence of file operations which would bring the remote filesystem up to date, including simple binary difference edits and file renaming? Adding in a cost for retrieving information about the remote files makes this much more complex.
I'm putting together a spider of sorts which can monitor a web site for changes. I want it to take an initial snapshot of the site, store this in a database, and then maybe a year later take a second snapshot. I want it to be able to figure out exactly how the site was restructured during that time. What links changed their targets? Which pages are new, which disappeared, how has the navigation changed, how has the content?
I'm always working on the content management system that makes up this website. At the moment I'm trying to wrap my head around the model-view-controller design pattern and figure out 1) how, exactly, it differs from what I'm already doing and 2) how to make what I'm already doing fit the pattern if it doesn't. As part of this I'm investigating ways to read data out of, and commit back into, the database, particularly when I edit pages. I don't want to have to write the entire database row back into the database every time I change the slightest thing. Okay, this problem is trivially easy compared to the others, but bear with me.
At work I routinely have to analyse test output from a grand total test suite numbering in the millions, which exercise thousands of distinct components making up the Bus software in dozens of distinct hardware/operating system environments. I have to take the vast swathes of passes and failures and make sense out of them. I have to find the common features which might have given rise to multiple simultaneous regressions. Were all these tests relating to SQL Server 2008? Were all those tests invoking TCP/IP calls? Did these other ones involve character encoding problems, or line-ending issues, or a specific machine whose hard drive has crashed its cog and needs junking? What new code has gone in? What parts of the product does that code affect? I want to start from a network of 100% passes, and figure out the simplest set of explanations which give rise to all of my observed failures. I want to automate this process.
I'm educating myself on formal language theory. I'm learning about taking an empty string (well, a starting character S) and applying a finite collection of operations to turn it into a terminal string.
I'm taking a structure A, and another structure B, and a set of costed mutator functions F, and trying to figure out the simplest, best, fastest or cheapest way to apply those functions to turn A into B.
Everywhere I look.
In everything I do in life.
I don't know if this is just my mighty brain's stupendous innate powers of pattern recognition on the blink, or whether there's deeper meaning here. I do know that these problems are all wildly different and that the variation in F particularly is what makes each of these tasks trivial or monumental, but I don't know how similar all the Fs are or how much of this much greater meta-problem has already been solved. This could be a fertile ground for research. Or, it may be old hat. Or, the metaphor may be too loose to be useful.
In university I took a course in Optimisation. That wasn't this. There were passing resemblances, though.