Tosser II
Previously codename "Koria"
An improved data structure library
by Steven Noonan (Tycho) and Alastair Lynn (prophile)
article written by Steven Noonan
Why design a new Tosser library?
The original Tosser library has serious performance drawbacks and design flaws which defeat the purpose
of the library. The alternative libraries written by Alastair and myself are designed to avoid the
serious potential issues in Tosser as well as improve the performance tenfold.
Alastair and I both wrote our own implementations. Mine is written in hand-optimized C++, utilizing
my own red-black tree implementation, and a dynamic stack class; Alastair took an easier (and in
theory, better) route to write his. His is a wrapper for the standard C++ library (making use of the
std::map and std::list classes).
That's a bit vague. What design flaws?
You will realize the major performance flaws when you see the benchmarks below, but
the original Tosser classes has a very dangerous design flaw which could cause a stack
overflow, given enough entries. I actually noticed this flaw right away when looking at
Tosser's design, but didn't expect to ever actually encounter such a problem. However,
with a large enough data set, I did accidentally cause a crash during the benchmarks
(see round 2).
What is different in Tosser II?
Both Tosser II implementations contain a new class, a
Red-Black tree implementation,
a much more efficient data structure than the
binary search tree
used in the original BTree class written by Chris Delay. I will discuss how the
new implementation is more efficient later on.
How about some benchmarks?
Certainly. But before I tell the results, you need to know a couple things. First
of all, for rounds one and three, we used a dataset containing 23,823 entries, totaling
2,111KB. For rounds two and four, we used a dataset containing 120,420 entries,
totaling 10,272KB. In rounds one and two, the lists are presorted to demonstrate
worst-case scenarios in the BTree implementation, and in rounds three and four,
the datasets are unsorted (the latter probably being a more realistic example of
practical application).
Our dataset was an MD5 hash list (created on a Gentoo Linux box):
07f7210782c173c0ff8f7901924ff2dd /var/cache/edb/dep/usr/portage/sec-policy/selinux-dante-20050201 5e4694fb64beaa131a2230d4db490eb0 /var/cache/edb/dep/usr/portage/sec-policy/selinux-dante-20050308 cbf972c955da12077ba0a49262f2600c /var/cache/edb/dep/usr/portage/sec-policy/selinux-openvpn-20050618 ...
So for each entry, we used the MD5 hash as the item's key and used the file name as the corresponding data. Below are the results we got. The winning libraries are in green text, ties (notice some aren't exact ties, but we figure the difference margin is small enough that a tie is deserved) are in gray text,and the losing classes are in red text.
ROUND 1: 23,823 alphabetically sorted items (2,111KB)
| Tree creation time | Search time | DArray conversion | Tree Deletion | |
| Tosser II (Steven Noonan) | 00.080717s | 00.000003s | 00.006409s | 00.006919s |
| Tosser II (Alastair Lynn) | 00.110927s | 00.000010s | 00.472471s | 00.441879s |
| Tosser (Chris Delay) | 11.675687s | 00.002066s | 01.830826s | 01.837730s |
ROUND 2: 120,420 alphabetically sorted items (10,272KB)
NOTE: The failure noted in this table is a call stack overflow.
Chris' BTree recursively does things, rather than using one call to deal with the action.
While this may be a working method for small datasets, it clearly is not substantial
here.
| Tree creation time | Search time | DArray conversion | Tree Deletion | |
| Tosser II (Steven) | 00.415105s | 00.000011s | 00.027074s | 00.035057s |
| Tosser II (Alastair) | 00.556005s | 00.000013s | 09.538365s | 09.418940s |
| Tosser I (Chris) | *FAILED* | *FAILED* | *FAILED* | *FAILED* |
ROUND 3: 23,823 unsorted items (2,111KB)
| Tree creation time | Search time | DArray conversion | Tree Deletion | |
| Tosser II (Steven) | 00.090151s | 00.000003s | 00.007964s | 00.008738s |
| Tosser II (Alastair) | 00.117588s | 00.000010s | 00.472753s | 00.441949s |
| Tosser I (Chris) | 00.195050s | 00.000003s | 01.862227s | 01.862373s |
ROUND 4: 120,420 unsorted items (10,272KB)
NOTE: Tosser v1.x takes a significant amount of time to convert
to a DArray in preparation for deletion. The red-black tree is
much faster at deletion because it has no need to convert to a
DArray.
| Tree creation time | Search time | DArray conversion | Tree Deletion | |
| Tosser II (Steven) | 00.494684s | 00.000007s | 00.038732s | 00.051472s |
| Tosser II (Alastair) | 00.639059s | 00.000013s | 09.297594s | 09.261480s |
| Tosser I (Chris) | 00.595013s | 00.000011s | 44.762491s | 44.698571s |
In every case, the red-black tree either wins or (in one case) ties with the binary search tree. It's also notable that the BTree class failed when trying round 2. In all 4 rounds, the red-black tree shows to be the best implementation.
The main reason the red-black tree does so well with presorted data is because the red-black tree maintains the balance of all the nodes. When you have an unbalanced tree, it takes (at worst) O(n) time to find the result. In a red-black tree, it always takes (at worst) O(log n) time to find the result. I will now explain this.
B-trees are almost always fast with randomly-ordered data. Below is a diagram
of a well-balanced binary search tree. In such a case, in order to get to the
worst-case node 'a', which is furthest from the root node 'e', it takes three
hops to reach the desired node. Note that there are eight items in this tree.
Below, a very unbalanced binary search tree, which has to go through the
nodes in order (as if it were a linked list) to find the worst-case requested
data (which would be node 'e' for this diagram). It takes four hops to get to
node 'e', even though there are only five items in the list. The best balance
for this tree would be to have node 'c' as the root node, whereas it would take
only two hops to get to the worst-case nodes of 'a' or 'e'.
When you add data in an order, the normal B-tree becomes horribly unbalanced. This is referred to as a degenerate tree. Since red-black trees maintain a balance, there is not as much node-hopping to locate the specified data.