git-annex awesomeness

So a few days ago, there was this:

21:24 < wouter> hum.
21:24 < wouter> Anyone know of a tool to manage scanned documents?
21:25 < wouter> the idea being that I can tell this tool "here's a bunch of newly-scanned documents", and it will upload them to a server
21:25 < wouter> and it should allow me to easily find a specific file later on
21:25 < wouter> and I'd also like version control there
21:26 < wouter> and I do _not_ want to download the entire repository of scanned documents on my laptop (that's why I have a server)
21:26 < wouter> and perhaps I'd also like a pony to go with that.
21:29 < wouter> oh, yes, and I do _not_ want a webbrowser as the primary interface (that might be okay to look things up, but not to store stuff)

The answer, as it turned out, was git-annex: a tool to manage files with git, without checking them into git.

What, I hear you say? Yes, that sounds a little weird, doesn't it?

Perhaps it's easiest to explain with a little example.

$ git annex add 2011-11-07-belgacom.pdf
$ ls -l 2011-11-07-belgacom.pdf
lrwxrwxrwx 1 wouter wouter 191 nov  7 14:46 2011-11-07-belgacom.pdf ->
../.git/annex/objects/xx/3F/SHA256-s1537334--c44e1a057e247bfe7c196ac146c8a0ca32096c0b10df6c18fd3f1c2e99ecddbf/SHA256-s1537334--c44e1a057e247bfe7c196ac146c8a0ca32096c0b10df6c18fd3f1c2e99ecddbf

The file is now known to git-annex, and I can have it do all kinds of useful things with it now:

$ git annex drop 2011-11-07-belgacom.pdf
drop 2011-11-07-belgacom.pdf (unsafe)
  Could only verify the existence of 0 out of 1 necessary copies

  No other repository is known to contain the file.

  (Use --force to override this check, or adjust annex.numcopies.)
failed
git-annex: drop: 1 failed

Oops, we hadn't copied it to anywhere else yet. We don't want to lose our data!

$ git annex move --to server 2011-11-07-belgacom.pdf
move 2011-11-07-belgacom.pdf (checking server...) (to server...)
SHA256-s1537334--c44e1a057e247bfe7c196ac146c8a0ca32096c0b10df6c18fd3f1c2e99ecddbf
     1537334 100%    9.22MB/s    0:00:00 (xfer#1, to-check=0/1)

sent 30 bytes  received 1537668 bytes  1025132.00 bytes/sec
total size is 1537334  speedup is 1.00
ok
$

What just happened? git-annex copied the file to a git remote called "server", and then dropped it from my local copy. It's no longer here! The symlink in my local directory is now a dead link; I can not open it anymore.

But, no worries! If we ever need it again, it's just a single command away.

$ git annex get 2011-11-07-belgacom.pdf
get 2011-11-07-belgacom.pdf (from server...) 
SHA256-s1537334--c44e1a057e247bfe7c196ac146c8a0ca32096c0b10df6c18fd3f1c2e99ecddbf
     1537334 100%    9.58MB/s    0:00:00 (xfer#1, to-check=0/1)

sent 30 bytes  received 1537668 bytes  3075396.00 bytes/sec
total size is 1537334  speedup is 1.00
ok

This allows me to save space on my local laptop while not having to care where the files are -- they're just there. And it gets more awesome if you know that git-annex can store multiple copies of each file (so you have automatic distributed backups, as with regular git), where you can enforce the minimum number of copies. Also, git-annex supports multiple backends -- you can store your data in Amazon S3, or on an encrypted USB drive, or whatever, and have git-annex manage it transparently for you.

I said this already on IRC, but: Joey, I owe you beer.