Allan Woloshin

Subscribe to Allan Woloshin: eMailAlertsEmail Alerts
Get Allan Woloshin: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: C++ Developer

CPlusPlus: Article

Taking the Leap - C++ containers vs C# Collections

Taking the Leap - C++ containers vs C# Collections

While moving from C++ to C# means giving up template-based containers, that doesn't mean you can't effectively organize your data. And like C++, C# collections have some unique benefits.

The concept of computerized arrays has been around almost as long as computers themselves. It allows a program to deal with large quantities of data almost as simply as dealing with a single unit of data. It underlies almost all sorting algorithms. C++, like most other languages, has built-in language support for arrays.

In C++, arrays are always one-dimensional - but you can allocate arrays of arrays to counter that fact. The name of an array is almost always converted into a pointer to its first element, and most array operations work equally well on pointers. For nonrectangular arrays, C++ works equally well with arrays of pointers - allocating and freeing the odd-shaped array can be inconvenient, but otherwise it works extremely well.

Why would anybody want anything more? There are several problems with arrays.

For one thing, you cannot pass an array to a subroutine; instead, C++ automatically converts the array name into a pointer to the first element. This means that arrays are always passed by reference rather than value - just the opposite of most C++ arguments. If the subroutine changes the content of the array, the caller sees the changes, like it or not. This also means that the subroutine doesn't automatically know the size of the array! Many security holes in C and C++ programs can be attributed to functions such as gets(), which accepts the address of a character buffer but not the maximum length.

Converting an array address to a pointer solves many problems, but it causes some, too. C++ has special rules governing the use of pointers, and not all of them are designed to work correctly with pointers. For instance, you can pass an array of Derived to a function that expects an array of Base and the compiler won't complain - but your program won't work correctly.

Arrays are also inflexible. Your program must know the maximum size in advance or else it must be prepared to re-allocate the array (and copy all elements) whenever the array is full. The latter usually means keeping track of two different sizes - the amount of space reserved so far, and the amount of space actually used. In order to "insert" an element into the array, you must be prepared to shift every existing element after that location. Any time you expand an array or insert elements, you also change the address of elements - if another part of the program is a pointer or reference to a moved element, that pointer or reference is now "dangling" and a potential source of major program errors.

Finally, arrays are pure memory devices - unless your operating system has a "virtual memory" capability, every element takes up memory all of the time.

Most of these problems can be solved. You can pass a length parameter along with your array pointer; you can use special naming conventions that make it easy to spot when you've passed the wrong type of array; and so on. The thing is, you have to know what you need. In other words, it's error-prone.

An alternative to the built-in array is the linked list. C and C++ make it easier to build linked lists than many other languages, and this can solve the problem of having to extend the array and of inserting new elements without moving the others. Also, linked lists usually allow mixed types; putting a Derived object in a linked list of Base probably isn't an error. But the list still needs to fit into memory; the size is still unknown unless it's separately maintained; and it's still not possible to pass the linked list by value. Furthermore, linked lists are very error-prone, especially on multi-threaded environments or in the presence of exceptions, and linked lists cannot be processed in arbitrary order.

In the late 1980s and throughout the 1990s, Alexander Stepanov and David Musser, and later Meng Lee, developed a library of template classes and algorithms. They had several goals, including algorithms that were fast, powerful, and extensible. The library had many different components, but among the most useful were their container classes. In 1997 they offered their work to the ISO/IEC C++ standards committee, which adopted it, with changes.

C++ Containers
The C++ language standard (ISO/IEC 14882-1998[E]) says in Clause 23 that containers are "components that C++ programs may use to organize collections of information," and adds: "Containers are objects that store other objects. They control allocation and deallocation of these objects through constructors, destructors, insert and erase operations." It goes on to list the two general types of containers: sequences and associative containers.

The C++ standard says, "A sequence is a kind of container that organizes a finite set of objects, all of the same type, into a strictly linear arrangement." The three types of sequences are vector, list, and deque.

  • vector: The closest approximation to a built-in array, it avoids most of the problems listed above. A vector maintains both a size and a capacity. vector<bool> is specialized to minimize space by packing true/false values in memory, at the cost of speed and code bloat.
  • list: The closest approximation to the linked list I spoke of above. It's much slower for random access, but it allows items to be inserted quickly at any point
  • deque: A compromise between the two. Like a vector, it's quick for random access. It can also insert elements at either end quickly, although inserting elements in the middle isn't as quick as with list.

    When you iterate through all the elements of a sequence, you get them "in order" - the first element comes first.

    The C++ standard says associative containers "provide an ability for fast retrieval of data based on keys." Practically speaking, this means that the index doesn't have to take an integer. (If it does take an integer, the values don't have to be contiguous; this is known as a "sparse array.") The four types of associative containers are map, multimap, set, and multiset.

  • map: Holds data that can be accessed by a key. For instance, a map of customers might be accessed by customer number.
  • multimap: Much the same thing as a map, except that it allows duplicate keys. For instance, two library books might have the same Dewey Decimal number.
  • set: Much like a map, except that the data is the key. A good example is an English-language dictionary; if the word is in the set, it is properly spelled.
  • multiset: Much the same thing as a set, except that data can be in the multiset more than once.

    The associative containers can only contain elements that can be sorted - you can either make sure that operator less-than works correctly, or else you can construct a class object that knows how to sort the contained objects. When you iterate through all the elements of an associative container, you get them in that sorted order.

    The library also defines container adapters such as queue, which accepts a sequence such as list or deque, and ensures that access is always at the beginning or end; stack, which does much the same thing, except that elements are popped off in reverse order from when they were pushed on; and priority_queue, which returns the lowest element first.

    These standardized containers solved a lot of problems. With such a rich assortment to choose from, there is bound to be one that satisfies performance requirements for almost any need. Standard C++ library routines understand how to access data in these containers, and there is a rich variety of algorithms included in the library - algorithms for copying, sorting, even I/O.

    Unfortunately, there were several problems that standardized C++ containers simply couldn't solve. For instance, the standard containers require all elements to be in memory at once. (However, they provide a framework you can use to develop your own container without this requirement - and your container will work with all Standard C++ algorithms.) Mixed data types are still a major issue.

    Also, it's still a bad idea to have a reference or pointer to a contained element, except under certain very well-defined circumstances. To make it possible to iterate through all of the elements in a container, we need some other mechanism. These are called iterators.

    C++ Containers Cannot Contain Mixed Types!
    We hinted above that C++ containers have a problem with mixed data types, such as a hierarchy of classes. We've written some demonstration code in Listing 1 (Listings 1-4 are available on the Web at www.sys-con.com/webservices/sourcec.cfm). The programmer wanted to display employee information for Wilma, and then the customer information for Barney. But that probably isn't what he'll get. people is a vector of person, so you can't put anything into people except objects convertible to person. employee and customer are convertible to person, so the code above compiles without error or warning messages. Actually, the person subobject is copied into the vector. In other words, the integer i is copied, but not the cname or ename.

    The same thing happens with arrays:

    void bar() {
    Person people[5];
    // Copies the person subobject of Wilma
    People[0] = Wilma;
    // Copies the person subobject of Barney
    People[1] = Barney;
    }
    If you must store hierarchical items in an array or container, you do so by creating a container (or array) of pointers to person, as we've shown in Listing 2. The last part of the code sample illustrates a problem with this technique. You must delete all of the elements whenever the vector is going to be deleted, or else you get memory leaks. This is both tedious and error-prone.

    How C++ Iterators Work
    In C++, an iterator is a generalization of a pointer. In fact, any standard library algorithm that accepts an iterator will also accept pointers to an in-memory array. But iterators can do much more than simple pointers do.

    The C++ standard defines five different types of iterators:

  • Input iterator: Allows you to read data. In other words, you can dereference it on the right side of an assignment expression. You must increment an input iterator once (and only once) between each read.
  • Output iterator: Allows you to write data. In other words, you can dereference it on the left side of an assignment expression. You must increment an output iterator once (and only once) between each write.
  • Forward iterator: Allows you to read and write data and increment the pointer. You can move arbitrary distances using the algorithm std::advance(), but be aware that the time taken might be proportional to the distance you move. For instance, std:: advance(I,5) might take five times as long as std::advance(I,1).
  • Bidirectional iterator: A forward iterator that also allows you to decrement the pointer.
  • Random iterator: A bidirectional iterator that also allows you to move arbitrary distances in constant time. For instance, pointers are random iterators. Besides algorithms such as std::advance(), you can also use the index operator[], and you can do the same type of math operations that you can with pointers. (For instance, you can subtract two random iterators to find the distance between them.)

    All C++ containers have methods (member functions) that return iterators that point into the container. The type of iterator returned depends on the container type.

  • deque and vector have random iterators.
  • list has bidirectional iterators, as do map, multimap, set, and multiset.
  • Nothing in the standard library creates forward iterators that aren't also bidirectional iterators; however, if you create one yourself, it will work with many different standard algorithms.
  • The library defines input iterators that work with an istream, and output iterators that work with an ostream.

    Most algorithms take not one, but two iterators to define a "range" of data. One iterator marks the beginning of the data and another marks the position past the end. This arrangement seems strange at first, but it turns out to be rather convenient. If the iterators are bidirectional or random, you can determine the number of included elements by subtracting the two. There is no need to use special flags to define an empty range; simply pass the same iterator value for beginning and past-the-end. Also, it makes testing for the end of a range simple; with some iterator types, you cannot use less-than or greater-than to tell if the range is finished, but a special sentinel value can be used to mark the end, and when input or output is finished, this value is returned.

    When manipulating a container, it's important to consider its effect on existing iterators. For instance, if you add an element to a vector and the internal capacity has to increase, iterators, except for end(), are not necessarily valid any more. This is especially inconvenient when you try to examine and manipulate a container at the same time; for instance, if you write a program that erases entries from a deque of customers, the act of deleting a customer might make you "lose your place." Many operations return new iterators; for instance, erase() might return an iterator to the element after the one that was deleted. Other operations are more problematic.

    There is a lot more to C++ iterators; we've only scratched the surface in this article. Consult a good C++ textbook to find out more.

    List and Explain C# Collections
    Perhaps the most useful C# collection isn't considered a collection at all. C# arrays are much more powerful than C++ arrays, and they natively support many operations that require a class in C++. For instance, the Length property tells you the number of elements in the array, like the size() method in C++ containers.

    C# arrays can be made multidimensional in the same way that C++ arrays are: by creating an array of arrays.

    int[][] jagged = new int[3][];
    jagged[0] = new int[3]; // 3 elements
    jagged[1] = new int[8]; // 8 more
    jagged[2] = new int[4]; // 4 more, total 15
    int q = jagged[1][4];
    They can also be multidimensional in the classic sense, although of course this must be rectangular.

    // Allocate 20 elements.
    int[] normal = new int[4,5];
    int r = normal[1,4];
    However, C# arrays don't support dynamic resizing or inserting in the middle, any more than C++ arrays do. If you want to do this type of operation, you need to use one of the collection classes. The list of standard C# collection types is even more rich than the list of standard C++ container classes.

  • ArrayList: A dynamically sized array that can contain any sort of object
  • BitArray: A compact array of bit values whose value can be either true or false
  • HashTable: Maintains a sorted list of key/ value pairs that can be accessed by key value
  • Queue: Represents a first-in, first-out collection of objects
  • SortedList: A sorted list of key/value pairs that can be accessed by either key or index
  • Stack: Represents a last-in, first-out list of objects
  • StringCollection: Maintains a collection of strings

    C# Collections Are Weakly Typed
    Unlike C++ containers, C# collections can contain any type of object. You can easily mix employees, customers, and even nonperson objects in any of the C# collections without worrying about memory corruption or unintentional object splicing.

    This eliminates an entire classification of hard-to-explain, hard-to-find errors. But this freedom doesn't come for free. C# collections return the contained object as a pointer to an object; that means most of your custom methods won't work correctly until you fix the pointer type.

    CollectionOfFoo[j].bar(j); // Won't compile
    ((Foo)CollectionOfFoo[j]).bar(j); //
    Okay
    At the root of this minor inconvenience is a more devilish problem. Even though you know that your collection holds only Foo objects, the compiler doesn't know this. Nothing prevents you from inserting an integer into a collection of Foo objects. Every time you access one of the collected objects, the program silently renegotiates the object type. This means that using C# collections will be at least slightly slower than equivalent C++ containers, although the difference will probably be quite minor.

    Comparison of C++ Container Classes with C# Collections
    There are some obvious parallels between C++ containers and C# collections. Although a C++ queue<> isn't the same as a C# queue, the two are equivalent enough that they can usually serve the same purpose. Table 1 shows the comparison.

    C# does support the equivalent of a C++ forward iterator, but this is rarely used. Instead, elements in C# collections are generally accessed in one of two ways: through the indexer or through a foreach statement.

    To use the indexer, simply use familiar array notation to access an element. Like C++, different C# collections can use different data types for the index. For instance, a HashTable might let you look up a customer by Customer Number.

    If you wish to process an entire collection or search through the elements in order, you should use the foreach statement.

    How foreach Works
    One feature that sometimes makes C++ programmers jealous of their VB counterparts is the for each loop.

    Consider a simple for loop and array access in C++:

    for(int j = 0; j < mylist.Count(); j++)
    {
    dosomething(mylist[j]);
    }
    The programmer that writes this has to pay careful attention to a variety of minor issues.

  • Initialization: If j is not initialized properly, the loop may not execute at all.
  • Termination: If the Boolean expression (j < list.Count()) is not correct, the loop might terminate early - or not at all.
  • Incrementing: With integers it won't matter, but sometimes the difference between preincrement and postincrement is phenomenal!
  • Standards: Is count() the correct way to get the number of elements, or should it be size() or getcount() or something else?

    With the introduction of the foreach loop in C#, problems like these are outdated. foreach iterates through each element in a collection. A simple foreach loop may look like:

    int[] numbers = {1,2,3,4};
    foreach (int j in numbers) {
    dosomething(j);
    }
    You must have already noticed the ease of use of foreach loop and how it took care of lot of potential problem areas in the for loop. With the foreach loop, there is no special setup required for the loop, no special exit condition, no need for a function to get the size of the collection, and no need to fetch individual items in the collection. Deciding how many times to loop is the responsibility of collection, relieving the programmer of this burden.

    A foreach loop allows iteration for any class implementing the IEnumerable interface. This includes all C# collections and collection classes included in namespace System.Collec- tions. You can also write your own collection class and make it work with foreach by following IEnumerable guidelines.

    foreach will automatically place the fetched element into a temporary variable. Note that the temporary variable j cannot be modified, i.e., j is read-only. Attempts to modify j will produce a compile error.

    foreach(int j in numbers) {
    j++; //will fail to compile.
    dosomething(j);
    }
    How to Write Your Own C# Collection
    Our first advice is: Don't do it if you don't need to! The collections included with C# are quite comprehensive, and handle a wide variety of different uses. Not every conceivable use, of course. If you have some strange memory constraints, or some need to use more than one key at the same time, you might have good reason to write your own collection.

    The easiest way to create your own collection is to implement the indexer property. A simple indexer property might look like this:

    public object this[int index] {
    get {
    if (index > -1 && index < this.List.Count) {
    return (this.List [index]);
    }
    }

    set {
    if (index > -1 && index < this.List.Count) {
    this.List [index] = value;
    }
    }
    }

    The index does not have to be an int; it can be any data type you wish to support. You use this value to access the correct element. A get access occurs when your indexer is used anywhere in an expression except on the left-hand side of an assignment.

    Foo f = myCollection[3];

    C# automatically calls your property get code with index set to 3. The set access occurs, naturally enough, when the index expression is on the left-hand-side of an assignment statement.

    myCollection[3] = f;

    In Listing 3, we implement an indexer and access objects using it. What happens when client code tries to use a foreach loop instead of using an indexer?

    class MyClass {
    static void Main(string[] args) {
    MyCollection words = new MyCollection();
    foreach(string str in words) {
    Console.WriteLine(str);
    }
    }
    }
    This code will fail to compile, sending a message that it cannot find GetEnumerator(), which is a method exposed by the IEnumerable interface. For a class to be usable in a foreach loop, it must support the IEnumerable interface, which is defined in System.Collections.

    Perhaps surprisingly, IEnumerable has only one method; IEnumerator GetEnumerator(). GetEnumerator() returns an instance of interface IEnumerator, which is also defined in the System.Collections namespace. The IEnumerator allows unidirectional iteration (read-only access) of items in a collection. IEnumerator publishes two methods: bool MoveNext() and void Reset(); and one property, object Current {get;}.

    MoveNext(), like the name suggests, moves the enumerator to the next element in the collection. It returns true if the operation was successful, or false: if there aren't any elements left to enumerate.

    Reset() sets the enumerator to the initial position, which prepares the Enumerator for iteration over the collection. Note: Initial position is not the first element in the collection; rather, it's a virtual position before the first element in the collection.

    Current {get;} gets the current element in the collection. If the last call to MoveNext() returned false, or if MoveNext() hasn't even been called after Reset(), then Current {get;} throws an IndexOutOfRangeException.

    To make our class support all of the same types of access that other C# collections do, all we need to do is derive our class from IEnumerable, define a support class from IEnumerator, and then implement GetEnumerator() by returning an instance of our new class. (Note: It's possible to have our class derive from both IEnumerator and IEnumerable, and GetEnumerator() would just return this. We don't recommend this, because it isn't thread-safe.)

    Listing 4 shows all the changes required to implement an enumerator for use with the foreach construct. This class can be used the same way as any other C# collection class.

    Conclusion
    C# doesn't have templates or the STL, so it can't use C++ container classes. C# does have a rich assortment of collection classes. In C++ it's usually dangerous to mix types within one container; C# always allows this, but there are disadvantages to this approach. The C# for each statement is a convenient way to process all elements in a collection. It's surprisingly easy to write your own C# collections that work just like the library versions.

  • More Stories By Allan Woloshin

    Allan Woloshin, a system architect at SEI Information Technology, has become proficient in a wide array of technical languages and tools during his 25 years as a developer and analyst. He is experienced in the full lifecycle development of accounting, project management, and resource management applications, for industries including finance, education, security, computer services, and real estate. He holds a Bachelor's degree in computer information systems, and is a Microsoft Certified Solutions Developer

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.