Tosser II
Previously codename "Koria"
An improved data structure library
by Steven Noonan (Tycho) and Alastair Lynn (prophile)
article written by Steven Noonan

Why design a new Tosser library?
The original Tosser library has serious performance drawbacks and design flaws which defeat the purpose of the library. The alternative libraries written by Alastair and myself are designed to avoid the serious potential issues in Tosser as well as improve the performance tenfold.

Alastair and I both wrote our own implementations. Mine is written in hand-optimized C++, utilizing my own red-black tree implementation, and a dynamic stack class; Alastair took an easier (and in theory, better) route to write his. His is a wrapper for the standard C++ library (making use of the std::map and std::list classes).

That's a bit vague. What design flaws?
You will realize the major performance flaws when you see the benchmarks below, but the original Tosser classes has a very dangerous design flaw which could cause a stack overflow, given enough entries. I actually noticed this flaw right away when looking at Tosser's design, but didn't expect to ever actually encounter such a problem. However, with a large enough data set, I did accidentally cause a crash during the benchmarks (see round 2).

What is different in Tosser II?
Both Tosser II implementations contain a new class, a Red-Black tree implementation, a much more efficient data structure than the binary search tree used in the original BTree class written by Chris Delay. I will discuss how the new implementation is more efficient later on.

How about some benchmarks?
Certainly. But before I tell the results, you need to know a couple things. First of all, for rounds one and three, we used a dataset containing 23,823 entries, totaling 2,111KB. For rounds two and four, we used a dataset containing 120,420 entries, totaling 10,272KB. In rounds one and two, the lists are presorted to demonstrate worst-case scenarios in the BTree implementation, and in rounds three and four, the datasets are unsorted (the latter probably being a more realistic example of practical application).

Our dataset was an MD5 hash list (created on a Gentoo Linux box):

	07f7210782c173c0ff8f7901924ff2dd /var/cache/edb/dep/usr/portage/sec-policy/selinux-dante-20050201
	5e4694fb64beaa131a2230d4db490eb0 /var/cache/edb/dep/usr/portage/sec-policy/selinux-dante-20050308
	cbf972c955da12077ba0a49262f2600c /var/cache/edb/dep/usr/portage/sec-policy/selinux-openvpn-20050618
	...

So for each entry, we used the MD5 hash as the item's key and used the file name as the corresponding data. Below are the results we got. The winning libraries are in green text, ties (notice some aren't exact ties, but we figure the difference margin is small enough that a tie is deserved) are in gray text,and the losing classes are in red text.

ROUND 1: 23,823 alphabetically sorted items (2,111KB)

  Tree creation time Search time DArray conversion Tree Deletion
Tosser II (Steven Noonan) 00.080717s 00.000003s 00.006409s 00.006919s
Tosser II (Alastair Lynn) 00.110927s 00.000010s 00.472471s 00.441879s
Tosser (Chris Delay) 11.675687s 00.002066s 01.830826s 01.837730s



ROUND 2: 120,420 alphabetically sorted items (10,272KB)
NOTE: The failure noted in this table is a call stack overflow. Chris' BTree recursively does things, rather than using one call to deal with the action. While this may be a working method for small datasets, it clearly is not substantial here.

  Tree creation time Search time DArray conversion Tree Deletion
Tosser II (Steven) 00.415105s 00.000011s 00.027074s 00.035057s
Tosser II (Alastair) 00.556005s 00.000013s 09.538365s 09.418940s
Tosser I (Chris) *FAILED* *FAILED* *FAILED* *FAILED*



ROUND 3: 23,823 unsorted items (2,111KB)

  Tree creation time Search time DArray conversion Tree Deletion
Tosser II (Steven) 00.090151s 00.000003s 00.007964s 00.008738s
Tosser II (Alastair) 00.117588s 00.000010s 00.472753s 00.441949s
Tosser I (Chris) 00.195050s 00.000003s 01.862227s 01.862373s



ROUND 4: 120,420 unsorted items (10,272KB)
NOTE: Tosser v1.x takes a significant amount of time to convert to a DArray in preparation for deletion. The red-black tree is much faster at deletion because it has no need to convert to a DArray.

  Tree creation time Search time DArray conversion Tree Deletion
Tosser II (Steven) 00.494684s 00.000007s 00.038732s 00.051472s
Tosser II (Alastair) 00.639059s 00.000013s 09.297594s 09.261480s
Tosser I (Chris) 00.595013s 00.000011s 44.762491s 44.698571s

In every case, the red-black tree either wins or (in one case) ties with the binary search tree. It's also notable that the BTree class failed when trying round 2. In all 4 rounds, the red-black tree shows to be the best implementation.

The main reason the red-black tree does so well with presorted data is because the red-black tree maintains the balance of all the nodes. When you have an unbalanced tree, it takes (at worst) O(n) time to find the result. In a red-black tree, it always takes (at worst) O(log n) time to find the result. I will now explain this.

B-trees are almost always fast with randomly-ordered data. Below is a diagram of a well-balanced binary search tree. In such a case, in order to get to the worst-case node 'a', which is furthest from the root node 'e', it takes three hops to reach the desired node. Note that there are eight items in this tree.
A typical, balanced B-tree

Below, a very unbalanced binary search tree, which has to go through the nodes in order (as if it were a linked list) to find the worst-case requested data (which would be node 'e' for this diagram). It takes four hops to get to node 'e', even though there are only five items in the list. The best balance for this tree would be to have node 'c' as the root node, whereas it would take only two hops to get to the worst-case nodes of 'a' or 'e'.
A very badly balanced tree

When you add data in an order, the normal B-tree becomes horribly unbalanced. This is referred to as a degenerate tree. Since red-black trees maintain a balance, there is not as much node-hopping to locate the specified data.