What's Missing In .NET and Other Collection Libraries

2/24/2004 5:50:43 PM

What's Missing In .NET and Other Collection Libraries

Collections are probably one of the most important components of any library. The framework's collection classes, as of Everett, are somewhat weak, especially compared with Java and STL. With the recent work with collections in Whidbey, much of the gap in functionality with Java will be reduced but not eliminated, although the integration and performance of generics collection should be superior to Java's collection libraries. Some new collections, like linked lists, will be introduced, but I haven't seen any indications of sets or other tree-based data structures from Java being introduced.

STL.NET, a .NET port of STL, will also be introduced by the VC++ teams, and indications are that the library will be usable by other managed languages; unfortunately, the style and naming conventions of that library are the same as that of native STL and will be incompatible with the framework's own style and naming guidelines.

All of these libraries, including STL, are missing an important type of collection--an indexable list with fast (logarithmic or better) insert and deletion times. These libraries simply offer a tradeoff--either the indexing time or the insertion/deletion time must take linear time. Many applications will simply stick with a simple array or ArrayList and accept the linear time performance. The result is the widespread phenomenon where a user will supply a large, but reasonable, amount of data into an application, and that application will soon slow to a crawl.

Many commercial applications will try to be a little more sophisticated by creating a list of lists, which still delivers linear time performance, but improves insertion time substantially by a constant factor. Such an approach still doesn't scale well after a certain point. For random access of elements in large sequential data, true industrial-strength applications will use a logarithmic-time list implementation, which is usually based on a tree or possibly a skip-list.

I wrote a general-purpose list, based on a balanced tree implementation. The list is broken up into chunks, each of which are of length 1 to some maximum specified chunk size. The chunks are all stored as nodes in the binary tree. In addition to the chunks, the nodes also contain a relative index value for keeping track of each node.

Insertion and deletion time are O( lg(n/chunkSize) ), which is practically instantaneous. Indexing time, a little slower, is also logarithmic O( lg(n/chunkSize) ). However, using an enumerator to sequentially access a range of elements via a tree walk yields constant time access per element on average. This leads to the interesting conclusion, that foreach is actually faster than indexing inside a for loop. Foreach takes O(m) on average to access m items, while for takes O( m lg n ) for the same set of items.

This implementation took me less than a week; in addition, I was able to add additional capabilities such as the ability to compactly store a run of elements, making this list also suitable for a run array or a sparse array. With trees, a lot of nice features become possible.

For the small amount of time it took me to write this implementation, I am surprised we don't have such list implementations in any library. It seems to me that a general-purpose scalable list class should be in all the standard collection libraries. It will probably be the case five years from now, but, for now, I am considering publishing my own implementation, as soon as I have converted it to use Whidbey generics and have worked with it for a few months to be sure all the bugs are out.






Net Undocumented is a blog about the internals of .NET including Xamarin implementations. Other topics include managed and web languages (C#, C++, Javascript), computer science theory, software engineering and software entrepreneurship.

Social Media