INNOVATOR: September 2009

Tuesday, September 15, 2009

How do you store a tree in a database table

Every once and a while I will need to need to store a hierarchical structure in a database table. Recently, I needed to do this for a Facebook.com application I am building. In this particular application, I needed to categorize pieces of content in a hierarchical form, which a users can access by drilling down one level at a time. This blog details the solution I implemented on how to store the data and access it. Aren’t there are other articles on this? Yes, but their approach is typically very trivial and has issues when building out the application.

Included at the bottom of this blog entry is a zip of a sample application that demonstrates how to build product explorer using the technique described.

The problem statement!

Below is the hierarchy of data that users can navigate through. Its a standard data structure that one would see in applications like; shopping carts, employee databases, and content management system.

Books
- Non-fiction
  - Politics
  - Sports
Electronics
- TV
- Speakers
- Computers
Cars
- Dodge
  - Magnum
  - Nitro
- Chevy

Each category can have sub categories and/or products, take for instance the sample above; “Books” has a sub category called “Non-fiction” which has two sub categories. Not shown in the data above is that each category has a set products it is holding, for instance:

Politics
- “Politics for Dummies”
- “Student Atlas of World Politics”

Magnum
- 2007 STX mint condition

How do I store the information in a database table that has just rows and columns?

I need to be able to perform the following operations; insert, update, delete, and select.
The solution should be easy to create. Little code that is easy to understand.
The solution needs to be easily stored in a database.
Selection of items needs to be possible with SQL.
Inserting and updating a record should affect no other data in the database.
Delete should only affect the deleted item and its descendants.

A non optimal approach?

I have read a few articles on approaches to solve this problem and each one has a number of drawbacks that makes their approach hard to deal with when actually building out the application.

One approach is to use the name of the parent as the column in the database.

parent	name
	Books
	Electronics
	Cars
Cars	Dodge
Books	Non-fiction
…

This seems simple, logical and will work but once you start to build the application you will notice a major short coming. You loose any ancestorial information other than the parent. This may not seem like a big deal - I know my parent, so I can find their parent. True. The real problem comes in the complexity of the code you are going to need; recursion or other to trace one’s roots.

A couple questions you way name want to know quickly and easily that make ancestorial information important. How many products do me or my descendants have? Where would you use this? Display the number of products contained below a category “Books (2)” based on the information above.

Another issues is what happens if the name of the category changes; “Cars” now becomes “Automobiles” because marketing thinks it sounds more sophisticated - meaning - intelligently appealing - I looked it up. This can easily be resolved by using the id of the parent, or updating all items that have a certain parent when the name changes.

id	parent	name
1		Books
2		Electronics
3		Cars
4	3	Dodge
5	1	Non-fiction
…

Above is what the data would look like using the id of the parent in the parent column. This solves the name change issue but still has the ancestry problem of the above solution.

There are also other solutions that store the location in the tree, who is to the left and to the right of me. But this solution again is difficult to manage and code against when developing an application.

These solutions fail a few of the solutions requirements.

Fails - The solution should be easy to create. Little code that is easy to understand. You will need to use recursion which is slow and can be complicated.
Fails - Inserting and updating a record should affect no other data in the database. When using the left and right approach you will need to adjust the data for other rows of the database outside the affected rows.
Fails - Delete should only affect the deleted row and its descendants. As above you will need to adjust the left and right when changing the structure of the tree

Resources on this approach:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
http://www.sitepoint.com/article/hierarchical-data-database

So what is the best way to do this?

In a past life, I developed network management software, which had many complicated problems that need solutions. TCP, UDP, SMNP, MIB were all acronyms that you needed to know in order to get the job done.

Most of the applications displayed data queried from a network device. Accessing the information was done by requesting data based on an OID. OIDs are simply a string in dot-notation for a specific piece of information. To pull the “device name” you would ask for OID “1.3.6.1.2.1.1.5″.

So what does this have to do with this article? Everything. The OID was the way to flatten the data which is hierarchical. How would you use this in our category sample? The picture below shows the data displayed with the parent being the “OID” of the parent node.

The root items “Books”, “Cars”, and “Electronics” have a parent which is empty to denote they are a root. All other items have parent value equal to id of all parent’s separated by a “.”.

Lets see how this solution works against the list of requirements:

I need to be able to perform the following operations; insert, update, delete, and select.
- Included in the sample is code to perform all of the functionality that is needed to; insert, update, delete and select the data.
The solution should be easy to create. Little code that is easy to understand.
- Solution is very easy; each operation is no more than two database operations and contains no recursion or rebuilding of the database.
The solution needs to be easily stored in a database.
- The parent information is stored in a single column in the database as alphanumeric characters.
Selection of items needs to be possible with SQL.
- Selecting a category or product is done with a simple equality operator. Selection of descendants is done in MySQL with the “LIKE” operator.
Inserting and updating a record should affect no other data in the database.
- In the sample, look at the “insert.php” and “update.php” to see how to perform this functions.
Delete should only affect the deleted row and its descendants.
- In the sample, look at the “delete.php” file to see how to delete both a category and a leaf (product) from the database

A couple questions that you might have:

Is this hard to work with?

Actually it is very easy. In the sample application, I have functionality for adding, updating, deleting and selecting categories and products as well as displaying a bread crumb of the ancestors (This could also be used to display a breadcrumb trail and make them links in any website).
- You just need this parent’s database record to create the child record.
- Can delete all descendants easily.
- Updating the record requires just the id of the element.
- Less columns using other left and right techniques

How do I create the parent column?

The parent id is created using the following code:
$parentPrefix = $currentCategoryRecord[”parent”];
if ($parentPrefix != null){
$parentPrefix .= $currentCategoryRecord[”id”].”.”;
}else{
$parentPrefix = $currentCategoryRecord[”id”].”.”;
}
This code creates the parent prefix for the currently displayed category and is need only to query descendants.

How do you find all the categories in a certain category?

“SELECT * FROM product_table where parent=’”++”‘” in the data pictured above, if I want to display the categories TVs the select statment would look like “SELECT * FROM product_table where type=1 AND parent=’11.13.’

Why the “.” at the end of the parent?

This makes it possible to distinguish between a category that starts with “1″ and a category that starts “11″. Not have the ending “.” means a startswith(’1′) or “WHERE parent LIKE ‘1%’ will find both “1″ and “11″ to be the ancestor.

Why not just combine the id and parent fields?

By having two separate fields the data can be inserted into the database using a single statement. If they were a single field you would need to find the highest numbered product or category already entered at that level and increment it by one before inserting the new record.
By having two fields, the “id” column can be a simple auto incrementing integer which can be joinde with other tables of data.

How do you tell what level a node is at?

Using the PHP split function you can just capture the level:
$level = count(split($parent) ) - 1; In the sample I display a breadcrumb trail for the currently selected category. It is very simple use the split function and select all the ancestors from the database based on the id.

Why is there a type column?

In my sample application, I am storing both categories (Tree Nodes) and products (Leafs). The type field differentiates the tree nodes from the leafs. In another application is may demote folders and files.

How do I find how many products a node or its descendants have?

Determining how many products are under a node is performed by using a single SQL query on the database. “SELECT COUNT(id) as count from product_table where parent LIKE ‘%’.

Isn’t the “LIKE” operator a performance hog?

Performance is something that may take a slight hit, but only slightly. I did a bunch of research on this and found a good resource on using the “LIKE” operator. Click here to read a write up by John Nelsonhttp://myitforum.com/cs2/blogs/jnelson/archive/2007/11/16/108354.aspx
You also reduce the number of queries need overall and you never need to rebuild the tree if the structure changes.

Content Copied from http://www.rockstarapps.com/wordpress/?p=82

Storing Hierarchical Data in a Database

Storing trees is a common problem, with multiple solutions. There are two major approaches: the adjacency list model, and the modified preorder tree traversal algorithm.

In this article, we'll explore these two methods of saving hierarchical data. I'll use the tree from a fictional online food store as an example. This food store organizes its food by category, by colour and by type. The tree looks like this:

This article contains a number of code examples that show how to save and retrieve data. Because I use that language myself, and many other people use or know that language too, I chose to write the examples in PHP. You can probably easily translate them to your own language of choice.

The Adjacency List Model

The first, and most elegant, approach we'll try is called the ‘adjacency list model' or the ‘recursion method'. It's an elegant approach because you'll need just one, simple function to iterate through your tree. In our food store, the table for an adjacency list looks like this:

As you can see, in the adjacency list method, you save the ‘parent' of each node. We can see that ‘Pear' is a child of ‘Green', which is a child of ‘Fruit' and so on. The root node, ‘Food', doesn't have a parent value. For simplicity, I've used the ‘title' value to identify each node. Of course, in a real database, you'd use the numerical id of each node.

Give Me the Tree

Now that we've inserted our tree in the database, it's time to write a display function. This function will have to start at the root node -- the node with no parent -- and should then display all children of that node. For each of these children, the function should retrieve and display all the child nodes of that child. For these children, the function should again display all children, and so on.

As you might have noticed, there's a regular pattern in the description of this function. We can simply write one function, which retrieves the children of a certain parent node. That function should then start another instance of itself for each of these children, to display all their children. This is the recursive mechanism that gives the ‘recursion method' its name.

// $parent is the parent of the children we want to see

// $level is increased when we go deeper into the tree,

//        used to display a nice indented tree

function display_children($parent, $level) {

   // retrieve all children of $parent

   $result = mysql_query('SELECT title FROM tree '.

                          'WHERE parent="'.$parent.'";');



   // display each child

   while ($row = mysql_fetch_array($result)) {

       // indent and display the title of this child

       echo str_repeat('  ',$level).$row['title']."\n";



       // call this function again to display this

       // child's children

       display_children($row['title'], $level+1);

   }

}

?>

To display our whole tree, we'll run the function with an empty string as $parent and $level = 0: display_children('',0); For our food store tree, the function returns:

Food

Fruit

Red

Cherry

Yellow

Banana

Meat

Beef

Pork

Note that if you just want to see a subtree, you can tell the function to start with another node. For example, to display the ‘Fruit' subtree, you would run display_children('Fruit',0);

The Path to a Node

With almost the same function, it's possible to look up the path to a node if you only know the name or id of that node. For instance, the path to ‘Cherry' is ‘Food' > ‘Fruit' > ‘Red'. To get this path, our function will have to start at the deepest level: ‘Cherry'. It then looks up the parent of this node and adds this to the path. In our example, this would be ‘Red'. If we know that ‘Red' is the parent of ‘Cherry', we can calculate the path to ‘Cherry' by using the path to ‘Red'. And that's given by the function we've just used: by recursively looking up parents, we'll get the path to any node in the tree.

// $node is the name of the node we want the path of

function get_path($node) {

   // look up the parent of this node

   $result = mysql_query('SELECT parent FROM tree '.

                          'WHERE title="'.$node.'";');

   $row = mysql_fetch_array($result);



   // save the path in this array

   $path = array();



   // only continue if this $node isn't the root node

   // (that's the node with no parent)

   if ($row['parent']!='') {

       // the last part of the path to $node, is the name

       // of the parent of $node

       $path[] = $row['parent'];



       // we should add the path to the parent of this node

       // to the path

       $path = array_merge(get_path($row['parent']), $path);

   }



   // return the path

   return $path;

}

?>

This function now returns the path to a given node. It returns that path as an array, so to display the path we can use print_r(get_path('Cherry')); If you do this for 'Cherry', you'll see:

Array

(

   [0] => Food

   [1] => Fruit

   [2] => Red

)

Disadvantages

As we've just seen, this is a great method. It's easy to understand, and the code we need is simple, too. What then, are the downsides of the adjacency list model? In most programming languages, it's slow and inefficient. This is mainly caused by the recursion. We need one database query for each node in the tree.

As each query takes some time, this makes the function very slow when dealing with large trees.

The second reason this method isn't that fast, is the programming language you'll probably use. Unlike languages such as Lisp, most languages aren't designed for recursive functions. For each node, the function starts another instance of itself. So, for a tree with four levels, you'll be running four instances of the function at the same time. As each function occupies a slice of memory and takes some time to initiate, recursion is very slow when applied to large trees.

Modified Preorder Tree Traversal

Now, let's have a look at another method for storing trees. Recursion can be slow, so we would rather not use a recursive function. We'd also like to minimize the number of database queries. Preferably, we'd have just one query for each activity.

We'll start by laying out our tree in a horizontal way. Start at the root node (‘Food'), and write a 1 to its left. Follow the tree to ‘Fruit' and write a 2 next to it. In this way, you walk (traverse) along the edges of the tree while writing a number on the left and right side of each node. The last number is written at the right side of the ‘Food' node. In this image, you can see the whole numbered tree, and a few arrows to indicate the numbering order.

We'll call these numbers left and right (e.g. the left value of ‘Food' is 1, the right value is 18). As you can see, these numbers indicate the relationship between each node. Because ‘Red' has the numbers 3 and 6, it is a descendant of the 1-18 ‘Food' node. In the same way, we can say that all nodes with left values greater than 2 and right values less than 11, are descendants of 2-11 ‘Fruit'. The tree structure is now stored in the left and right values. This method of walking around the tree and counting nodes is called the ‘modified preorder tree traversal' algorithm.

Before we continue, let's see how these values look in our table:

Note that the words ‘left' and ‘right' have a special meaning in SQL. Therefore, we'll have to use ‘lft' and ‘rgt' to identify the columns. Also note that we don't really need the ‘parent' column anymore. We now have the lft and rgt values to store the tree structure.

Retrieve the Tree

If you want to display the tree using a table with left and right values, you'll first have to identify the nodes that you want to retrieve. For example, if you want the ‘Fruit' subtree, you'll have to select only the nodes with a left value between 2 and 11. In SQL, that would be:

SELECT * FROM tree WHERE lft BETWEEN 2 AND 11;

That returns:

Well, there it is: a whole tree in one query. To display this tree like we did our recursive function, we'll have to add an ORDER BY clause to this query. If you add and delete rows from your table, your table probably won't be in the right order. We should therefore order the rows by their left value.

SELECT * FROM tree WHERE lft BETWEEN 2 AND 11 ORDER BY lft ASC;

The only problem left is the indentation.

To show the tree structure, children should be indented slightly more than their parent. We can do this by keeping a stack of right values. Each time you start with the children of a node, you add the right value of that node to the stack. You know that all children of that node have a right value that is less than the right value of the parent, so by comparing the right value of the current node with the last right node in the stack, you can see if you're still displaying the children of that parent. When you're finished displaying a node, you remove its right value from the stack. If you count the elements in the stack, you'll get the level of the current node.

function display_tree($root) {

   // retrieve the left and right value of the $root node

   $result = mysql_query('SELECT lft, rgt FROM tree '.

                          'WHERE title="'.$root.'";');

   $row = mysql_fetch_array($result);



   // start with an empty $right stack

   $right = array();



   // now, retrieve all descendants of the $root node

   $result = mysql_query('SELECT title, lft, rgt FROM tree '.

                          'WHERE lft BETWEEN '.$row['lft'].' AND '.

                          $row['rgt'].' ORDER BY lft ASC;');



   // display each row

   while ($row = mysql_fetch_array($result)) {

       // only check stack if there is one

       if (count($right)>0) {

           // check if we should remove a node from the stack

           while ($right[count($right)-1]<$row['rgt']) {

               array_pop($right);

           }

       }



       // display indented node title

       echo str_repeat('  ',count($right)).$row['title']."\n";



       // add this node to the stack

       $right[] = $row['rgt'];

   }

}

?>

If you run this code, you'll get exactly the same tree as with the recursive function discussed above. Our new function will probably be faster: it isn't recursive and it only uses two queries.

The Path to a Node

With this new algorithm, we'll also have to find a new way to get the path to a specific node. To get this path, we'll need a list of all ancestors of that node.

With our new table structure, that really isn't much work. When you look at, for example, the 4-5 ‘Cherry' node, you'll see that the left values of all ancestors are less than 4, while all right values are greater than 5. To get all ancestors, we can use this query:

SELECT title FROM tree WHERE lft < 4 AND rgt > 5 ORDER BY lft ASC;

Note that, just like in our previous query, we have to use an ORDER BY clause to sort the nodes. This query will return:

+-------+

| title |

+-------+

| Food |

| Fruit |

| Red |

+-------+

We now only have to join the rows to get the path to ‘Cherry'.

How Many Descendants

If you give me the left and right values of a node, I can tell you how many descendants it has by using a little math.

As each descendant increments the right value of the node with 2, the number of descendants can be calculated with:

descendants = (right – left - 1) / 2

With this simple formula, I can tell you that the 2-11 ‘Fruit' node has 4 descendant nodes and that the 8-9 ‘Banana' node is just a child, not a parent.

Automating the Tree Traversal

Now that you've seen some of the handy things you can do with this table, it's time to learn how we can automate the creation of this table. While it's a nice exercise the first time and with a small tree, we really need a script that does all this counting and tree walking for us.

Let's write a script that converts an adjacency list to a modified preorder tree traversal table.

function rebuild_tree($parent, $left) {

   // the right value of this node is the left value + 1

   $right = $left+1;



   // get all children of this node

   $result = mysql_query('SELECT title FROM tree '.

                          'WHERE parent="'.$parent.'";');

   while ($row = mysql_fetch_array($result)) {

       // recursive execution of this function for each

       // child of this node

       // $right is the current right value, which is

       // incremented by the rebuild_tree function

       $right = rebuild_tree($row['title'], $right);

   }



   // we've got the left value, and now that we've processed

   // the children of this node we also know the right value

   mysql_query('UPDATE tree SET lft='.$left.', rgt='.

                $right.' WHERE title="'.$parent.'";');



   // return the right value of this node + 1

   return $right+1;

}

?>

This is a recursive function. You should start it with rebuild_tree('Food',1); The function then retrieves all children of the ‘Food' node.

If there are no children, it sets its left and right values. The left value is given, 1, and the right value is the left value plus one. If there are children, this function is repeated and the last right value is returned. That value is then used as the right value of the ‘Food' node.

The recursion makes this a fairly complex function to understand. However, this function achieves the same result we did by hand at the beginning of this section. It walks around the tree, adding one for each node it sees. After you've run this function, you'll see that the left and right values are still the same (a quick check: the right value of the root node should be twice the number of nodes).

Adding a Node

How do we add a node to the tree? There are two approaches: you can keep the parent column in your table and just rerun the rebuild_tree() function -- a simple but not that elegant function; or you can update the left and right values of all nodes at the right side of the new node.

The first option is simple. You use the adjacency list method for updating, and the modified preorder tree traversal algorithm for retrieval. If you want to add a new node, you just add it to the table and set the parent column. Then, you simply rerun the rebuild_tree() function. This is easy, but not very efficient with large trees.

The second way to add, and delete nodes is to update the left and right values of all nodes to the right of the new node. Let's have a look at an example. We want to add a new type of fruit, a ‘Strawberry', as the last node and a child of ‘Red'. First, we'll have to make some space. The right value of ‘Red' should be changed from 6 to 8, the 7-10 ‘Yellow' node should be changed to 9-12 etc. Updating the ‘Red' node means that we'll have to add 2 to all left and right values greater than 5.

We'll use the query:

UPDATE tree SET rgt=rgt+2 WHERE rgt>5;

UPDATE tree SET lft=lft+2 WHERE lft>5;

Now we can add a new node ‘Strawberry' to fill the new space. This node has left 6 and right 7.

INSERT INTO tree SET lft=6, rgt=7, title='Strawberry';

If we run our display_tree() function, we'll see that our new ‘Strawberry' node has been successfully inserted into the tree:

Food

 Fruit

   Red

     Cherry

     Strawberry

   Yellow

     Banana

 Meat

   Beef

   Pork

Disadvantages

At first, the modified preorder tree traversal algorithm seems difficult to understand. It certainly is less simple than the adjacency list method. However, once you're used to the left and right properties, it becomes clear that you can do almost everything with this technique that you could do with the adjacency list method, and that the modified preorder tree traversal algorithm is much faster. Updating the tree takes more queries, which is slower, but retrieving the nodes is achieved with only one query.

Conclusion

You're now familiar with both ways to store trees in a database. While I have a slight preference for the modified preorder tree traversal, in your particular situation the adjacency list method might be better. I'll leave that to your own judgement.

One last note: as I've already said I don't recommend that you use the title of a node to refer to that node. You really should follow the basic rules of database normalization. I didn't use numerical ids because that would make the examples less readable.

Further Reading

More on Trees in SQL by database wizard Joe Celko:
http://searchdatabase.techtarget.com/tip/1,289483,sid13_gci537290,00.html [1]

Two other ways to handle hierarchical data:
http://www.evolt.org/article/Four_ways_to_work_with_hierarchical_data/17/4047/index.html [2]

Xindice, the ‘native XML database':
http://xml.apache.org/xindice/ [3]

An explanation of recursion:
http://www.strath.ac.uk/IT/Docs/Ccourse/subsection3_9_5.html [4]