Eymiha - Versioning Data Revisited

27 October 2014

Dave Anderson • @eymiha

Versioning Data Revisited

As a developer in these modern times, you’re probably quite familiar with versioning. You’ve either seen it before or, better yet, are using it yourself now to do your work.

Versions of a system are created for release, versions of code and documents are created to safeguard progress, and more than likely you interact with version control systems such as git or subversion on a daily basis. Even Google’s drive system lets you manage versions of your writing and calculations. If you aren’t using versioning to keep the progression of changes in your digital work safe, start now. You never know when you’ll need to look back and see what you had a day, week, month or even a year from now.

Note I’m not talking about backups here - though you should certainly back up your work in case of catastrophe. Backups are generally kicked off at regular intervals, irrespective of what state your work may be in. Versioning is about intentionally committing your work at a logical point in its progress. When you recover a backup, you get back the state of the world at the time the snapshot was taken, complete with any inconsistencies that were present in whatever you were working on at the time. When you go back to a previous version, you get a logically consistent instance of your work.

Looking at it from the high level, you assign a version to a collection of your interrelated work, which for lack of a better description, I’ll considered to be a document. This sort of versioning has been done for years; when you change a significant part of your document - how big or how small being at your discretion - you save it off into a new version. If the document represents a large amount of effort done by many people, there may be hundreds or even thousands of versions recorded.

A Networks of Formats

A while back, many developers who wrote programs that created files used versioning a slightly different way, but for much the same purpose. The idea was that program A would take in data and output data in some file in format A1. Then program B would read the A1 data from that file and output data in another format B1. And so on, down the line for programs C, D, and E. You get the picture: programs chained together by reading the files output from the previous stage and creating the files that were input for the next stage. These didn’t have to be straight lines of processing either. Many programs might be dependent on the output of many others. Webs of interconnected complexity emerged.

It basically was a simple world. Consider program M

write: (file) ->  # format M1
  file.writeInt @a
  file.writeString @b

and program N

read: (file) ->   # format M1
  @a = file.readInt
  @b = file.readString

These systems still exist - programs linked by files - so don’t think that this sort of architecture has gone away. It still lives in the computational hearts of government and industry.

But architectures like these had to solve a problem that was cropping up as these patterns became more popular. The problem was the folks in charge of program X got some requirements that one of the programs, say Y, needed some data to be added to the X1 files. This meant that a new file format X2 needed to be output by X. So there were new programming tasks that needed to be done. The X team needed to make program X output the new format X2, and the Y team needed to change program Y to be able to read X2.

What’s more, any other team using X1 had to switch to X2. If ten programs used X1 data, ten programs would have to be changed to use X2. Some clever tricks were tried of course, like to just append the new data onto the end of the file - but this was a trick that tended to get very painful as formats X3, X4, and X5 inevitably appeared.

Systems still exist that use this trick and others like it. It shouldn’t surprise you that many hacks like this were never taken out because everything just kept working. Programming was harder in those days. And some technical debt is never repaid.

What is Storage?

So, starting from the X-writing-files-for-Y and so on down the line, lets imagine we connect the end of the loop to the start. Now imagine that we tighten the loop by taking out programs, one at a time, more and more until there’s just one program and one file left. Program X writes files for reading by program X. Effectively the file becomes storage for program X.

Now certainly this is a silly way to approach storage, but it’s completely viable. When a program stores its work in a file so it may later be loaded from the file, it amounts to the same thing as storage! But it’s reformulated in terms of a file with a given format.

If we then imagine a file that was written by the program ten years ago, and the X program underwent perhaps several rounds of redesign - literal upheavals of reworking the processing - will the ten year old file still be readable? The answer, if everyone was careful, should be “Yes, of course.”

Backward Compatible Formats

At some point, someone got the bright idea that it’d be a lot easier to bottle up the historical reading and writing of a file in format F into a library, and then any program that needed F could use it. This extended the file access to be backward compatible with all versions of F format. A program could read any version of an F file and use it. In the library, the data in the older versions of F was converted to the new data needed to move forward. And if a program didn’t need the data in the new format and the old data was still present in the new format, the logic in the program didn’t have to change at all.

The same trick was extended to versioning objects. We can have a packer:

currentObjVersion: 5

pack: ->
  version: currentObjVersion
  a: @a
  b: @b

and an unpacker:

unpack: (obj) ->
  switch obj.version
    when 1
      ...
    when 2
      ...
    ...
    when 5
      @a = obj.a
      @b = obj.b

where the cases in the unpacker are all adjusted to advance their data up to the current version. Indeed, if we want to go a step further, we can enhance the packer to be able to pack old versions as well:

pack: (version = currentVersion) ->
  switch version:
    when 1
      ...
    when 2
      ...
    ...
    when 5
      a: @a
      b: @b

which parallels the unpacker. Note that as versions increase the older cases stay the same - this preservation is wonderful for testing. However, realize that writing old versions can be lossy, and new data may need to be properly recomposed into old data to get accurate old formats.

Lazy Cat Skinning

What we just wrote was the lazy approach to skinning the change-cat of our data. By building in backward-compatible unpackers of old formats, we defer the modernizing translation of our data to use-time, when it is requested, rather than blotting out the sun with a one time conversion. Versioning is what makes that possible.

Consider a mechanism that doesn’t do that - take database migrations, for instance. If you’re a Rails fan, relational database migrations are used to work synchronously with versions of the code. If you add a field to a relational table, or change how a value is stored, you write the migration that makes changes to the database schema and then build a rake task that reads all of the affected data and writes the tables out in the new format. That’s very typical in Rails development, and is practiced by developers near and far.

Note that versioning is not precluded here, but until the version of data is determined, the record is like the contents of a pants pocket that could have anything in it. The version determines what the contents may be, and until that’s known, relational methods are intractable. Also, in order to read other versions, you need to understand alternate formats. As the migrations are effectively outside the space of the running program, very special code is needed to handle versioning; and the variations are generally only along the lines of polymorphic entries that technically replace versioning with indirection.

Versioning internalizes what would be Rails’ schema migrations and transformative rake tasks but doing them actively, on demand. They also present the possibility of doing more by knowing when the file was logically preserved and trigger other interactive processing beyond unpacking.

Efficiency vs. Understanding

While a ‘switch’ is probably always the most efficient way to do the unpacking, it perhaps may not be if understanding the progression of changes is important. Given that many programmers are just working fast to get things done, full understanding is not generally desirable - but I recommend that this cavalier attitude can cause a whole swaths of bugs to spring into being. There is no substitute for understanding.

So consider a set of changes where the length of an object started out in yards in version 1:

unpack: (obj) ->
  switch obj.version
    when 1
      @length = obj.length    # yards

switched to feet in version 2:

unpack: (obj) ->
  switch obj.version
    when 1:
      @length = obj.length*3    # yards to feet
    when 2:
      @length = obj.length    #feet

and to inches in version 3:

unpack: (obj) ->
  switch obj.version
    when 1:
      @length = obj.length*36    # yards to inches
    when 2:
      @length = obj.length*12    # feet to inches
    when 3:
      @length = obj.length    # inches

versus the same progression done with straight conditionals:

unpack: (obj) ->
  if obj.version == 1
    @length = obj.length    # yards

then:

unpack: (obj) ->
  if obj.version == 1
    obj.length = obj.length*3    # yards to feet
    obj.version = 2
  if obj.version == 2
    @length = obj.length    # feet

and finally:

unpack: (obj) ->
  if obj.version == 1
    obj.length = obj.length*3    # yards to feet
    obj.version = 2
  if obj.version == 2
    obj.length = obj.length*12    # feet to inches
    obj.version = 3
  if obj.version == 3
    @length = obj.length    # inches

The second group costs more in terms of processing, but can provide much more insight into the sometimes step-by-step nature of versioning. Albeit, this is a simple example, in more elaborate situations it can be very valuable, and also help track down errors in conversions.

Hierarchical Documents

Once objects could be versioned, it stood to reason the sub-objects could be versioned as well, separately from their parents. This is when versioning started to get really powerful.

pack: ->
  version: 1
  foo: packFoo @foo
  bar: packBar @bar

unpack: (obj) ->
  switch obj.version
    when 1
      @foo = unpackFoo(obj.foo)
      @bar = unpackBar(obj.bar)

where the foo and bar use specialized their own pack and unpack functionality. These objects can have their own versioning:

packFoo: (foo) ->
  version: 3
  mumble: foo.mumble
  bast: foo.bast

unpackFoo: (foo) ->
  switch foo.version
    when 1
      ...
    when 2
      ...
    when 3
      mumble: foo.mumble
      bast: foo.bast

or not:

packBar: (bar) ->
  bally: bar.bally
  mump: bar.mump

unpackBar: (bar) ->
  bally: bar.bally
  mump: bar.mump

While versioning can be set aside, if the structure can could change, then without it you may be condemning a large set of documents to immediate rather than lazy conversion - if you even can do it immediately, as noted in the discussion about migrations above. Some old documents might even be archived and out of your reach. Yes, versioning is not only a path to backward compatibility, but it is the simple way to do it lazily and lets you incrementally version data. This allows you grow what’s in storage organically, making the whole the sum of its parts independent of the age of those parts.

It should be clear that the composition of versioned objects may change radically - version 2 of the obj packer could leave out the bar object completely. Or just as radical, You may have older objects with version 1 obj and version 1 or version 2 foo objects. All will unpack just fine.

Post-unpacking Processing

On occasion, objects may need a little adjustment after unpacking. If quite radical changes have gone on that require redistribution of an objects contents, you have to do what you have to do. For instance:

unpack: (obj) ->
  switch obj.version
    when 1
      @foo = unpackFoo(obj.foo)
      @bar = unpackBar(obj.bar)
  if @foo.ballyShouldReplaceBarBally(@bar)
    @bar.bally = @foo.bally

The conditional serves to patch up anything that versioning within the two objects themselves can’t cover. Here the logic is exposed inline, but it could be put in a reconcile method and called instead. I recommend keeping it close to the unpacker either way, since there’s a higher chance of confusion if it’s relatively hidden.

Of course, there’s a price for this added complexity, and without tests can lead to versioned data which is schematically correct yet logically meaningless. The older a piece of data is, the more prone it is to misinterpretation - no matter how careful you are. An ancient language without active speakers might be understood, but you will never be able to pronounce the words in it correctly, and the subtlety of conversations in that language may be important, but lost forever. Upgrading data from old to new versions can have the same loses.

Even with the best post processing, sometimes data needed in modern version is missing in old versions and you have to ask the user what to do. Though this may be somewhat painful and error prone, throwing up your hands and asking questions may be necessary. At least versioning gives you the opportunity to ask rather than make a possibly wrong decision in the code.

Relevance

If you know that I write Meteor applications and use Mongo to store my data, you might get the impression that I’ve been doing this recently. You would be correct. I’ve written hierarchies of versioned objects to files in several other applications over the last few decades, and that solution fits certain classes of document data problems in Meteor.

However, this is all fairly moot if you use Mongo in the traditional Meteor way. In my situation I’ve been storing a working set of data into a single hierarchical Mongo object, and the sets are subscribed and published, but not the inner data. Versioning isn’t free, and packing and unpacking versions may be good for long term storage, but terrible for short term use.

When a large hierarchical object is intended to be shared by multiple collaborators, it should be unpacked and reconstituted in a more traditional Meteor data-breakdown, so each collaborator can observe changes on individual objects. once collaboration is complete the objects can be reconstituted into a versioned object.

Takeaways

Versioning is very useful when the data you’re working with is likely to change over time. It is the best solution when big one-time conversions are neither desirable nor possible. Laziness rules.
Versioning works. It has been in use for a long time and has withstood the test of time. It lets old data keep working in your new systems.
Don’t use highly-packed data collaboratively in Meteor. Unpack it and populate the parts into smaller documents - and then pack it all back up when you’re finished.