Tuesday, May 08, 2012

Practical Patterns 2: Iterators


As mentioned last time, this installment of Practical Patterns will focus on a pattern that is (hopefully) familiar to most programmers, even if they don't realize it: the Iterator pattern. This is another broadly useful utility pattern, that shows up in some for or another in almost any production application. It is such a useful pattern that many languages have built it into their standard libraries. So, lets take a look at the Iterator.


Who

The Iterator Pattern - A pattern for accessing contents of a container.

What

The Iterator pattern is an object behavioral pattern that models access to an aggregate object, for the purposes of reading the individual elements contained within. The pattern consists of two parts. The aggregate object, or Container, contains a discrete set of smaller elements that are somehow related to each other. The Iterator is a distinct object that provides a mechanism for extracting individual elements from the container, one at a time, as requested by a client object.

The Iterator object is an abstraction layer that sits on top of various types of containers, and provides a uniform access mechanism for those objects. The implementation of those containers are hidden behind the iterator itself, which is aware of such details as ordering, random vs. sequential access, and the internal structure of the container. The consumer of the Iterator, on the other hand, merely needs to know how to ask for more elements, regardless of how they are obtained.

Depending on the context, you will often hear the Iterator Pattern referred to as an enumerator. The two terms are very closely related, but there is a subtle distinction. (There was an unfortunate choice of names made in the .NET BCL that tends to blur this distinction, which we'll see in a minute.) An enumerator is, in fact, one specific kind of iterator, but it is not the only kind.

Iteration, in general, means to perform a particular set of operations repeatedly for a fixed number of times. An Iterator Pattern applies this concept to the operation of "get an element from a collection", and an object which performs this iterative collection access is called an iterator. Enumeration, on the other hand, is the act of listing out the contents of a set of objects, typically in some order. An object which can perform this operation on some collection is called an enumerator. If a collection has a finite, well-defined set of elements, an obvious choice for an iterator over that collection would be to use an enumerator and return an element at a time. However, a collection may not contain concrete elements, but rather instructions for how to produce elements on demand (LINQ queries are implemented this way, for example.) In this case, the collection does not have an enumerator defined for it, but rather a generator - an object that knows how to produce a single element upon request. The iterator for such a collection would repeatedly ask the generator for elements, and there would be no enumeration involved.

The Iterator Pattern doesn't distinguish between these two cases; any object capable of providing elements in a sequence can be used to implement an Iterator. More formally, the Iterator has the following requirements:
  • Provides access to the elements of an aggregate object (the container),
  • One at a a time as requested,
  • In a particular sequence (but not necessarily a well-defined order),
  • Starting from the first element and continuing until there are no more elements.

When

Iterators are very broadly useful, in the sense that any operation acting on a container object can benefit from an Iterator. However, the abstract, uniform nature of the Iterator Pattern lends itself particularly well to two common use cases: algorithms and built-in looping constructs.

Computer scientists who study in the fields related to computation theory spend a lot of their time discussing and deriving algorithms to help solve problems. These algorithms are created independent of any particular hardware or language or run-time environment, and typically operate on theoretical concepts like "sets" and "groups". Ultimately, though, an algorithm needs to be implemented in some concrete form before it is truly useful. This means translating the mathematical definition into a form that a compiler can understand, which (among other things) means selecting the appropriate data structures on which the algorithm operates.

Ideally, we would like our algorithms to work the same across as many such implementation choices as possible. For example, a quick sort algorithm should work the same, with the same fundamental speed and space constraints, when used on a linked list, an array, or whatever other data structure we might imagine. More importantly, we would really like it if we could write the algorithm code once, and then apply it to whatever data structures we happen to have. For this to work, we need to abstract as many operations as possible from those data structures, into a form we can use to write our algorithms.

The concept of enumerating through a set of items is one of the most basic operations performs by any algorithm that operates on a collection. If our algorithms needed to know the details of accessing elements in an array (by indexing) vs. a list (by following pointers) vs. a hash table (by running a hash function), they would lose much of their re-usability. It also means we'd have to write effectively the same code multiple times, leaving ourselves open to introduce bugs in one or more of those implementations that do not appear in the others. In C++, this is called the "m * n" problem: without iterators, if we have m algorithms and n containers, we need to write m * n concrete methods to implement them. By generating a common Iterator pattern for each such underlying collection, our algorithms can be programmed to use that object, and we now only need to write m + n methods - one per algorithm plus one per iterator. That is a huge net win.

Even absent a formally defined algorithm, though, looping over the contents of a collection-type object is something that almost any decent-sized program is going to have to do very frequently. Many modern languages provide a built-in keyword or other construct that simplifies this operation, essentially serving up each element one at a time to be processed.

For built-in container types (e.g. most languages have a concept of an array, some languages have more than one type) writing a built-in keyword is relatively easy. But the real power of these language elements is that they also work on user-defined container types, which the language compiler knows nothing about. This is only possible if every container provides a uniform means to access their contents - an Iterator. Besides providing the same code reuse benefits we've already talked about, it means that developers only need to remember one mechanism for accessing the contents of a collection. This reduces the number of moving parts in a program, and reduces the chances that you will get something wrong. It also provides a stable target for library developers who want to share their custom container objects with other people: if your objects act just like the built-in ones, people will understand them better, be more likely to use them properly, and become comfortable with them much more quickly.

Where

As mentioned, iterators show up all over the place in programming. Almost any modern programming language or framework will include some mechanism for iterating over collections. The "biggest " of these are the C++ standard library, and the Java and .NET base class libraries. The C++ STL, for example, was built around the core concept of separating algorithms, containers, and iterators from each other. Algorithms are templated on iterators, and containers provide iterators, making it possible to use any of the half-dozen or so containers with any of the hundreds of algorithms.

In C#, iterators are implemented via the (unfortunately named) IEnumerable and IEnumerator interfaces. Note that, unlike Java, all iterators in C# are read-only. (There is no separate Iterator interface with a remove() method.) In early versions of C#, almost all iterators were, in fact, enumerators, and the interface names were likely chosen with this in mind. C# 2.0 introduced a new form of iterator, a generator, by way of the yield keyword. Behind the scenes, a generator function is still implemented as an IEnumerator, but conceptually is does not have to be an enumerator. More importantly, since both are implementations of the Iterator Pattern, from the outside it makes no difference.

As far as other related patterns, Iterators are used quite often with Composite pattern (to enumerate child elements) and the Visitor pattern (to enumerate over the elements being visited). They are a key part of the Memento pattern, which is essentially an iterator with a memory. (If you are familiar with SQL, cursors are a typical implementation of a Memento pattern.) Iterators themselves are frequently produced by a Factory method (as they are in C#).

Why

Iterators are probably the most obvious way to access multiple values from within a collection of objects. Modern languages strongly encourage their use over alternate methods whenever possible. But there are trade-offs to make when using an iterator.

For starters, in C#, iterators are read-only. The Iterator Pattern doesn't require this to be true, but rather defines a concept called a robust iterator which is capable of dealing with underlying changes to the collection. IEnumerable iterators are not robust, by this definition: their contract requires that changes to the underlying collection invalidate the iterator and throws an exception if the iterator is used any further. (Java enumerators and iterators also behave this way, with the exception that an Iterator can use remove() to delete the single most recent element that it produced. The STL has a lot of different iterators, some of which are truly robust.)

In practice, this means that certain common operations on collections don't work with iterators (at least, not in the most obvious implementation). For example, the following code will throw an exception:
foreach (var x in lotsofx)
{
  if (x % 16 == 0)
  {
    lotsofx.Remove(x);
  }
}
Instead, we would need to know something about the collection to know how to safely remove elements from the collection. In most cases, we end up using some form of manual iteration, like the following:
for (var i = lotsofx.Count - 1; i >= 0; i--)
{
  if (lotsofx[i] % 16 == 0)
  {
    lotsofx.RemoveAt(i);
  }
}
Notice that we traverse the list backwards; this is needed because removing an item shifts all of the subsequent items back one toward the beginning. If we traversed our list forward, as the Iterator would try to do, we'd end up skipping elements, and probably falling off the end of the list.

Another problem with iterators is that they are sequential-access only. For some data structures, there are faster and more efficient ways to access elements besides sequentially. For example, if we were to use an Iterator to find elements in a linked list, which only has sequential access, that would work fine. But using an iterator to search a sorted array would probably be a bad idea, when we could use a random access algorithm like a binary search instead. (The C++ STL actually provides something called a random access iterator but it is only available for random access containers, and is basically an Iterator with an additional operation tacked on. The STL algorithms do not use this iterator directly.)

But if we limit ourselves to just the problem domain at hand - sequentially accessing the elements of a collection in order - then there really isn't a good alternative to using an Iterator. Part of the power of this pattern is that it is so universal that any other attempt to solve this same problem essentially implements Iterator under some other name. Particularly in languages (like C#) which provide a foreach-style keyword, using iterators is almost always the correct way to go.

How

We've already seen the easiest way to consume an iterator in C#: we use the foreach keyword. This language feature abstracts away the use of the pattern. Although there's usually no need to do so, we could access the Iterator directly. In the BCL, this requires using the IEnumerable interface to gain access to the IEnumerator that a given collection implements. Typically this is two separate classes: the collection class returns a specialized enumerator class. (It's possible to merge them into a single class but it severely limits what can be done with the enumerator.) Using the pattern directly would look like this:
var list = new List<int> { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
var iter = list.GetEnumerator();
var sum = 0;
while (iter.MoveNext())
{
    sum += iter.Current;
}
The type of list is a List<int>, while the type of iter is an instance of the nested List<int>.Enumerator class. The key is the MoveNext operation; this advances the Iterator to the next element in the set, setting the Current property to that value. The operation returns false once it has exhausted the entire collection.

There are two ways to implement a new iterator in C#: implement the interfaces, and with the yield keyword. The code produced by these two approaches is roughly identical, but the yield keyword hides most of the scaffolding required to implement the interfaces. First, though, lets look at the traditional long-form way.

As mentioned, there are two interfaces needed to make this pattern. In a typical scenario, we will have a class which behaves as some type of set or collection, which implements IEnumerable. This class will include a nested class, which implements IEnumerator, that acts as the class's iterator. Typically, in new code these would be implemented as generic interfaces, specifying the element type of the collection. However, the generic interfaces derive from the non-generic versions, so any implementation will have to implement both. (This is done mostly for backward compatibility; it allows newly created generic collections to operate with older code that expects the non-generic version.) We generally don't want to encourage use of the older non-generic version, so we'll implement it explicitly.

A simple, silly example might look like this:
public class IntegerList : IEnumerable<int>
{
    public class Enumerator : IEnumerator<int>
    {
        public int Current
        {
            get;
            private set;
        }

        object IEnumerator.Current
        {
            get
            {
                return this.Current;
            }
        }

        public bool MoveNext()
        {
            if (this.Current < 100)
            {
                this.Current++;
                return true;
            }

            return false;
        }

        public void Reset()
        {
            this.Current = 0;
        }
    }

    public Enumerator GetEnumerator()
    {
        return new Enumerator();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return (IEnumerator)this.GetEnumerator();
    }
}
As you can see, that's a whole lot of code that does a very little: it returns the integers from 1 through 100 then stops. Almost every implementation of an Iterator looks exactly like this; the only change is the MoveNext procedure logic. Recognizing this, Microsoft added the yield keyword into C#, which triggers the compiler to write all of this IEnumerator plumbing for you. The same example, written using these new features, would look like this:
public class IntegerList : IEnumerable<int> 
{   
    public IEnumerable<int> GetList()
    {
        for ( var x = 0; x < 100; x++ )
        {
             yield return x;
        }
    }
}
The compiler, when seeing this keyword, produces an internally generated class which implements IEnumerator, and includes a state machine implementation of MoveNext. When the code reaches a yield keyword, it sets the Current member of the iterator and returns. When MoveNext is called again, it returns to the place where it left off and keeps going. The method can also include more than one yield keyword, and the state machine code will track where your code was at each point. (You can also yield break, to break out of the enumeration early.)

Conclusion

Iterators are a fundamental part of just about any software application that does something meaningful. The good news is, they are so important and so common, that C# gives us a lot of tools to both consume and produce Iterator objects quickly. But recognizing the pattern behind the keywords can give you a greater understanding of how your code really behaves, and what's really happening behind the scenes.

Next time, we'll look at another pattern that is deeply embedded into C# programming, especially when using UI Frameworks like WPF: The Observer pattern.

0 comments: