Skip to content
majidkabir edited this page Oct 22, 2018 · 9 revisions

Company Tree Wiki

Here is a description about implementing this application.

Definition

Based on the definition of this problem, we have a tree structure where each node in the tree has the following info:

  • Id
  • Parent node
  • Root node
  • Height

And we have three basic operations that should be served from an HTTP API, and they need to answer quickly: (I separated it into two APIs because in the definition of the challenge noted 'all children')

  1. Get children of a node - first level children
  2. Get all children of a node - all children in every level
  3. Change parent of a node

And this data should be persisted in a storage.

Since the tree models a company structure, I assume that:

  • We have lots of read data requests
  • We have lots of insert requests but less than read data
  • We have few 'change parent' (changing in company structure is not very frequent)
  • The tree is nearly balanced (the growth of the company is mostly at levels rather than the depth)

Discussion

One of the important decision in solving this challenge is choosing the right data structure for the tree, it's fairly dependent to the usage of data, in this situation we have some challenges. First challenge that affects all other things, is changing the parent of a node, by changing the parent of a node, height of the node and all children of this node in all level will be changed. Second challenge is getting all children of a node quickly. Third challenge is getting the height of each node.

There are different data structure for a tree:

  1. Storing parent of each node
  2. Storing children of each node
  3. Storing path to root for each node

Each of the above structure is suitable for a specific situation, in this document, I explain the pros and cons of each structure in our situation, and whether it can solve our problems or not.

Storing parent of each node

Without storing the height of each node

In this data structure, each node knows about its parent

Pros

  • Changing parent of a node in this structure is very simple by changing the parent property
  • Getting the first level children is simple by one query on parent field
  • Data size of each node is very small

Cons

  • Getting all children is expensive and we should run a recursive query for this
  • Getting height of a node is expensive and we should run a recursive query for this

It's not a good choice because of expensive read. (All children, height)

With storing the height of each node

In this data structure, each node knows about its parent and its height.

Pros

  • Getting the parent is very simple
  • Getting the first level children is simple by one query on parent field
  • Data size of each node is very small
  • Getting height of a node is simple

Cons

  • Getting all children is expensive and we should run a recursive query for this
  • Changing parent of a node is expensive and we should change height of all children of that node (Getting them is expensive)

It's not a good choice because of expensive read for all children and expensive change operation.

Storing the children of each node

In this structure each node know about his first level children. It's completely like storing the parent and just have two difference:

  1. For changing parent of a node we should remove that node from children of current parent node, and add it to children of new parent
  2. Data of all first level children are in the node

It does not have any advantage over storing parent.

Storing path to root for each node

In this structure, each node knows its path from the root to its parent.

Pros

  • Getting the first level children is simple by one query on the path field (all nodes that their path are equal to this node path plus this node)
  • Getting all children is simple by one query on path field(all nodes that their path starts with this node path plus this node)
  • Getting height of a node is simple by counting the nodes in the path field
  • Each data node know about his root node (first node in path)
  • Each data node know about his parent node (last node in path)
  • Changing parent is done by one query (find all children and replace prefix of each node's path)

Cons

  • In near linear large trees path field is very large
  • In large data set changing parent can be expensive

Based on my assumption about this challenge, it's a good structure.

  • Reading the data is quick
  • Inserting the data is quick
  • It's good for near balanced trees
  • We don't have a lot of change queries

There are some other structures for storing a tree, for example 'storing first child and right sibling' but they are not suitable for this situation.

Implementation

For implementation of this project I used Spring Boot with PostgreSQL. Spring Boot because it makes it easy to create stand-alone, production-grade Spring-based Applications, that you can run.
For database part, first I chose MongoDB a NoSQL database, but I found out that the change parent query in this database is not efficient, and after some research I realized that sql databases like MySQL and PostgreSQL do this job better. And because of my background I chose PostgreSQL.
In the implementation of this code, there isn't any challenge but the indexing on the path field of a node which solved by using BRIN index for this field, as it's good for the field where the data has redundant prefix values.

Scaling

This solution is implemented as a single node application and database, but it is fairly easy to scale it out. The web application can be easily replicated by having multiple nodes behind a load balancer. To scale the database we can use replication and data sharding in PostgreSQL. Of course there are things to note in data sharding. For example, using 'path' as the sharding key may result in insert contention because many nodes have a common prefix. Techniques like inverting the path before saving to database can be used to solve this, and this change leads to a change in type of index on this field.