Big Lists

8/14/2006 9:12:56 PM

Big Lists

As part of my goal of programming in a more functional programming style in C#, I have been looking at scalable, immutable representations of lists. Some may feel bothered by the log(n) allocations required for each operation to a persistent data structure. However, an operation can be arbitrarily complex like a set of assignments or an insertion of an array of values. Plus, persistent data structures enable a whole set of high-level optimizations.

Having built immutable sets and maps, I already had an implementation idea for lists, but I decided to look around on the web. I did build a few years ago a scalable, mutable implementation of a list using trees as detailed in my post on .NET Collections. The ideas from that post were also picked up by Peter Golde in his post on tree-based lists and eventually became the BigList class in the PowerCollections library. Peter’s BigList class can be efficiently used as an immutable list in a functional programming style or as a traditional mutable list.

His list is not as highly optimized for functional programming as I would like though: Immutable list deletion requires two substrings and one concatenation operation; this could have easily been made one operation. In addition, some other operations, such as equality testing remain slow. I liked the ability to perform differencing and merge operations very quickly, since I anticipate writing algorithms that work on two lists derived from a common list.

BigLists are based on ropes, elegantly described in this paper, Ropes: An Alternative to Strings.  A rope is an immutable representation of a list (in this particular case, a string of characters), which provides for logarithmic-time modifications of data and efficient sharing of data. The idea of ropes made me rethink my position on piece tables in an earlier post of mine, Programmer’s Myopia; the use of linked lists is still a bad design decision. Ropes were used as an alternative representation of strings in the original SGI implementation of the STL library:

Ropes are a scalable string implementation: they are designed for efficient operation that involve the string as a whole. Operations such as assignment, concatenation, and substring take time that is nearly independent of the length of the string. Unlike C strings, ropes are a reasonable representation for very long strings such as edit buffers or mail messages.

Though ropes can be treated as Containers of characters, and are almost Sequences, this is rarely the most efficient way to accomplish a task. Replacing an individual character in a rope is slow: each character replacement essentially consists of two substring operations followed by two concatenation operations. Ropes primarily target a more functional programming style.

Advantages:

  • Much faster concatenation and substring operations involving long strings. Inserting a character in the middle of a 10 megabyte rope should take on the order of 10s of microseconds, even if a copy of the original is kept, e.g. as part of an edit history. In contrast, this would take on the order of a second for conventional "flat" string representation. The time required for concatenation can be viewed as constant for most applications. It is perfectly reasonable to use a rope as the representation of a file inside a text editor.
  • Potentially much better space performance. Minor modifications of a rope can share memory with the original… Assignment is simply a (possibly reference counted) pointer assignment. . It is very inexpensive to checkpoint old versions of a string, e.g. in an edit history.
  • It is possible to view a function producing characters as a rope. Thus a piece of a rope may be a 100 MByte file, which is read only when that section of the string is examined. Concatenating a string to the end of such a file does not involve reading the file….

Disadvantages:

  • Single character replacements in a rope are expensive. A character update requires time roughly logarithmic in the length of the string. It is implemented as two substring operations followed by two concatenations.
  • A rope can be examined a character at a time through a const_iterator in amortized constant time, as for vector<char>. However this is slower than for vector<char> by a significant constant factor (roughly a factor of 5 or 10 if little processing is done on each character and the string is long). Nonconst iterators involve additional checking, and are hence a bit slower still…
  • Iterators are on the order of a dozen words in size. This means that copying them, though not tremendously expensive, is not a trivial operation. Avoid postincrementing iterators; use preincrement whenever possible...

Ropes provide locality benefits as well as their own rebalancing scheme, based on Fibonacci series. I’ll likely be using ropes for my own immutable list implementation.

I rediscovered an algorithm in Joe Duffy’s blog that may improve performance of my persistent data structures. Joe describes an algorithm called “Bloom Filters,”  and points to the original paper and the Wikipedia article. I also came across a well-written paper, “Network Applications of Bloom Filters.” Bloom Filters are very good for speeding up a certain category of software problems. They have been used in spelling applications.

I was intrigued by Joe’s remark that filters could improve the best-case performance time of binary-tree based structures to constant time. Operations on my implementation of persistent sets and maps take O(log n). A treap-based averages about 1.7 log2 n traversals—for a 100–element set, that would be about 12 comparisons.  In tight loops in which the same values may appear often, I cache values into a constant-time hashtable. The Bloom Filter provides an alternative screening method with dramatic space savings; however, the filter does require precomputation beforehand.

The Bloom Filter would also have special application in my natural language work. I maintain an ontology, which essentially is graph of about 30 different relationships between all the words of the English language. It is expensive for me to test for a transitive relationship, or even a direct relationship, between any two words, because I have to do a graph search each time. The probability of any two words having any relationship is very small, and the Bloom filter would allow me to quickly eliminate most negative matches from consideration.

Comments

 

Navigation

Categories

About

Net Undocumented is a blog about the internals of .NET including Xamarin implementations. Other topics include managed and web languages (C#, C++, Javascript), computer science theory, software engineering and software entrepreneurship.

Social Media