based on untested ideas or techniques
Development log on art.gnome.org(AGO) and a journal about my Summer of Code 2007 experience

Versioning System

Bruno Santos on August 8th, 2007

So this was the subject I’ve spent a big chunk of my day thinking in!

AGO3 ,like its predecessor(AGO2) , is going to allow variations of themes(or backgrounds). But I want to take a step further and have something more complex (and powerful).

In AGO2 every time a user submits a update of a theme, only one version is kept in the system. So no real history is kept. This is one of the things I want to change. Users will have at their disposal all the versions of a theme since it was released(if the owner doesn’t remove).

There will also be the possibility to create a branch of a previous theme. However, we haven’t decided yet how this will work. Will the new branch be defined as a new tree of versioning or stays connected to the current tree? Drom suggested to allow users to branch themes from other users, and if so, what about the licenses? Some may not allow those actions!
As you can see there are some decisions yet to be made.

Now, how am I thinking in implementing this?

In paper this can be drawn as a tree of nodes. Which quite different of the representation I have in an DB (flat tables). So I have to find a solution to store hierarchical data in a Database.

The bad way

The easier way would be to use an adjacency list model. With this approach I only need to keep the ‘parent’ of the current node. Whenever I need to find the ’sons’ of a node a single query to the DB is enough. The disadvantages of this solutions is in having to get all the nodes of the tree. For each node I need to do a query for its ’sons’. For each single ’son’ I’ll do a query for their ’sons’ and so on… This nice recursion is way to heavy in the DB! This definitely isn’t a solution. At least not a good one!

The Cool Way

I’ll be implementing an Modified Preorder Tree Transversal Nested Sets.
You can find more information on that subject in the nice article on Storing Hierarchical Data in a Database.
With this method, each node will have two values associated to it, left and right. The way those values are set gives certain properties that allows to only need one DB query to get all the tree nodes in the proper order.
This isn’t that easy to explain in few words, so if you are interested I recommend you to read the article I’ve mentioned

One of those properties is: If you have two nodes, A and B, and B’s left and right values are between A’s left and right values, it means that B is somehow descendent of A. Thus, if I want to retrieve all the descendants of a node I only need one query to get all the nodes where the left value is between my node left and right value (something like 'Select * from table_name where lft between 1 AND 11')

I’m introducing a variation in the method and keeping a field pointing to the ‘parent’ node. I will be needing to get the immediate ’sons’ of an artwork, or by other words, it’s variations. And if I don’t keep the ‘parent’ field, the only way I can get them is by getting the whole tree bellow the ‘parent’ node I want, and then process it to retrieve the nodes I really want.

I haven’t decided yet how I’m going to manage the deletion of artwork in the versioning system. But the I’m thinking in keeping the version node flagged as deleted if the corresponding artwork was deleted. Since I pictured a few cases in which I don’t know how to handle the removal of nodes, this seems to me the best solution for now. Later if I want I can implement a purging method for the flagged nodes.

4 Responses to “Versioning System”

What you describe as “Modified Preorder Tree Transversal” sounds lot like a nested set. The problem with nested sets is that they do not scale, because inserts are expensive. If you expect more than, say 100.000 records in that table, I am pretty sure that this model is NOT suitable for a.g.o.

One solution includes a nested set with “insert holes”, where the neighbors are not increased by +1 but by a value of (for example) +1000, such that you do not have to increase all node values on each and every insert operation.

But IMO a much better model would be the “materialized path”. This also allows for branching very easily.

-Samuel

knipknap:
Well, I’m not expecting a user to have 100.000 works in ago :p

Nonetheless, I’ll give a look in the “materialized path”.

Thanks!

Well, unless you are planning to create one table per user, that would be 100.000 for *all* users.

Of course, you could create multiple nested sets within one table, but then your database has to support constraints over more than one column. You could also NOT use constraints, but that would be a hack.

Provided you are using MySQL, I am pretty sure that using multiple Nested Sets in one table is bound to get you into trouble sooner or later.

Anyway, the nested set would already no longer work as soon as you want to be able to delete specific single versions or branches from the history. Believe me, the MP algorithm is definitely what you want.

Look for the “tefinch_message” table in this file:

http://cvs.gna.org/cvsweb/tefinch/mysql_matpath.sql?rev=1.10;cvsroot=tefinch

for an example MySQL model. Beware, there are some tricky issues to consider in MySQL (like, MySQL strips trailing spaces even from varbinary fields). The corresponding code that loads the database is here:

http://cvs.gna.org/cvsweb/tefinch/src/services/forumdb.class.php?rev=1.8;cvsroot=tefinch

Now that I think about it, did you consider using a simple filesystem structure instead? Remember that reading a BLOB from a database is probably 10 times slower than reading a file directly from the filesystem.

I already think I’m making this way too complex to what it is going to be used.

Plus I’ve already coded the MP model and I think it does the job well.

The idea of the Versioning system is just to maintain all the versions users submit related. And probably do some neat map tree way of showing that evolution.

Wanna share (your opinion)?