The Icon Bar: Building the Dream 1 - Container data structures

Hello and welcome to the Building the Dream, a new series of (regular!) articles at The Icon Bar in which I will be educating you in how to turn your programming dreams into reality. First off, let's get one thing clear - this isn't a beginner's course to programming, or a tutorial in a specific language. Instead it's the place to go once you've finished your programming tutorial and are wondering what to do next. If you have an idea for a program, but are confused about how to implement it, then this is the series for you, as I'll be covering everything from data structures and program design through to project management, optimisation, how to make sure your programs maintain the RISC OS look and feel, and even provide case studies of how certain well-known programs do their stuff.

Container data structures

Deciding which container data structures to use is often one of the most important decisions to make when designing a program. And yet they are a subject that is rarely covered properly in programming tutorials - for the simple fact that tutorials teach you how to read and write a specific programming language, rather than how to actually design a program. For example, English lessons given to a foreigner will teach them how to read and write English, but it will not teach them how to write good jokes, or compelling dramas.

Selection of the correct container data structures for use in a program can do anything from make the code a hundred times faster, to reduce memory requirements, or to make the source code cleaner (in turn making it easier to write, debug, and expand).

Typically, the most commonly used data structures are ones which are implementations of abstract data types. Whereas an abstract data type (or ADT for short) specifies what operations can be performed on a data structure (and in turn some of the ways in which the data can be processed), a data structure is the concrete implementation of that type, specifying how much memory is required and, crucially, how fast the execution of each operation will be. However for most purposes the terms ADT and data structure are interchangeable, so don't worry too much about understanding the differences between them.

Broadly speaking, container data structures are implemented using two core data types/structures - arrays and linked lists.

Arrays

I'm hoping that you already know what an array is, so I won't bother explaining it. However you may not know that a computer's RAM is essentially one big array. The memory management code in the OS then splits this array into sections, providing each program with its own area of workspace. This is an important fact to remember when writing programs, and its importance will become clearer as we go deeper into the workings of the computer in the following articles.

Since arrays are so fundamental to the construction of modern computers, it's no surprise that most programming languages have builtin support for arrays. However not all languages provide the same level of support - BBC BASIC, for example, does not allow you to resize an array once it has been created. C does allow you to resize arrays, but this can only be done to dynamically allocated arrays, and will often require you to manually update pointers (since the location in memory of the array has changed). Additionally, BBC BASIC provides range checks on array indicies, but C typically does not.

Linked lists

Linked lists can be seen as a 'dispersed array'. A typical linked list contains an ordered sequence of data elements, much like an array; but these elements are not necessarily stored in adjacent memory locations. Instead, each element in the list has a pointer to the location of the next element. A pointer to the first element of the list is also maintained - without this pointer, a program would usually have no idea where any of the elements in the list are. Some lists (called doubly-linked lists) also have pointers to the previous element, which makes some operations simpler to implement or quicker to execute. Similarly, some list implementations may keep pointers to both the first and the last element of the list, to allow the elements at either end of the list to be quickly indexed.

Because of their more complex nature, few programming languages provide direct support for linked lists. However linked lists can be implemented in almost all programming languages (either by the use of pointers or arrays).

Differences

There are several key differences between linked lists and arrays - these are surmised below:

An array takes up a contiguous block of memory, whereas a linked list does not
Arrays have their size dictated at creation, whereas linked lists start empty and can grow until all available memory is full
Accessing the Nth element of an array is very quick, but with linked lists it is very slow.
However inserting an element into the middle of an array is very slow, whereas with linked lists it is very quick (providing you have already looked up the address of the element you want to insert before/after)
Similarly, lists can easily be split into sublists or inserted into the middle of other lists, whereas with arrays it is a bit harder (i.e. takes longer)

Very quick and very slow

A formal notation has been developed to indicate how fast algorithms and data structures are - it is known as Big O notation, and it's important that you understand it before we go any further. Big O notation is generally used to indicate how fast an operation on a data structure is, in relation to how many items are currently being stored by that structure. For example, accessing the Nth element of an array always takes O(1) time - because the number of elements in the array has no bearing upon how fast the array lookup is. But accessing the Nth element of a linked list takes (up to) O(n) time, where n is the number of elements within the list at that point in time. (Typically, lowercase n is used to indicate the number of elements within the data structure). Big O notation is usually used to indicate the maximum time the operation will take (as opposed to the average time). Common Big O denotations for data structures are O(1), O(log n), O(n), and O(n²). Depending on how large your data set is, and how frequently you wish to perform specific operations on it, the way the Big O timings affect your choice of data structure will differ. For example, you would usually want to stay away from any data structure that has O(n²) timing for an operation that you want to perform regularly, because the time taken to perform the operation can easily grow from seconds, to hours, to days, to weeks as the number of data items increases.

Abstract data types

Although I could continue this article by listing some of the more exotic data structures, I think it's more worthwhile to you if I instead list the ADTs that those data structures are implementations of. This is because you will usually first choose the ADT that meets your requirements (in terms of how the data can be manipulated), before then going on to choose the implementation that meets your timing requirements (which is where the Big O notation comes in). Or at least, it's a lot simpler to explain it this way.

Although there are many ADTs in total, I'll only be aiming to cover the core group:

List-based ADTs

These are lists, queues, and stacks. These three types are usually very similar in their implementations, which is why I've grouped them together like this. A list data type, in its most general form, allows you to insert and remove items at any point, as well as to navigate back and forth through the list, either looking for a specific item, or performing a specific operation on each item. Queues and stacks are often implemented using a list (or an array if the maximum size is known); but they restrict usage so that elements can only be inserted and removed from specific ends of the list. These restrictions allow for a faster and cleaner implementation of the data type, but at the cost of flexibility should the program's needs change in the future.

Queues are, as their name suggest, like a queue of people at a bank or a theme park - new people (elements) join at the back of the queue, but only the people (elements) at the front are removed. They are known as FIFO (First-in-first-out) data types. Similarly, stacks are often explained as being like a stack of plates - plates can only be added and removed from the top of the stack, else it falls over and someone gets shouted at by their wife. These are known as FILO (or LIFO) data types - First-in-last-out/Last-in-first-out.

The two main ways of implementing lists, queues and stacks are using either a linked list or an array. These two choices provide different tradeoffs between memory usage, insertion/removal speed, and indexing (searching for the Nth element).

Although the core specifications for most ADTs provide limited functionality (typically the minimum level required for use of the type), most implementations provide further functionality, such as checking how many items are in the data structure, checking whether a specific item is contained, or the ability to examine the next item in a stack or queue without actually removing it. It's important to bear this in mind when choosing an implementation of an ADT, as if the implementation doesn't provide the interfaces your code will require, then it's probably quite useless to you.

Maps

Maps (also known as associative arrays) are essentially a list, where each element within the list contains both a key and a value. The key part is used to access the value part, in a similar way to how an integer array index is used to access a specific item in an array. However unlike an array, the key can be of almost any data type, often one specified by the user. Most map implementations even allow keys of different data types to be used within the same map. Similarly the value can be of almost any data type, and does not usually need to the the same type for every entry in the map. Most implementations, however, place a restriction on the content of the map such that each key maps to a maximum of one value (the exception to this rule being multimaps).

There are many different implementations of maps available, each offering different execution speeds, which probably goes to explain why maps have only recently started appearing as built-in types in programming languages. For example Perl and PHP have built-in support for maps. This makes them very easy to use for certain operations, serving to increase their popularity as scripting languages. Unlike list-based ADTs, maps exist only to help you store and organise data, rather than to place restrictions on how that data is processed.

A good-quality map implementation will provide a Big O time of O(log n) or better for insertion, removal, and indexing of elements.

Sets

Sets are probably the simplest collection of objects (i.e. items of data) you can get. Typical set implementations provide three interfaces - to add an item, to remove an specific item, and to list all the items that are present within the set. Usually multiplie copies of the same item can be inserted, and usually no guarantees are made about what order the items will be listed in. Because of this there are many different ways of implementing sets, each resulting in different timing values for operations.

Sets are typically implemented in the same way as maps (using a hash table or (self-balancing) binary search tree), so you can expect performance of O(log n) or better for the common operations.

Trees

Trees are typically used within programs to increase the execution speed of operations such as searching for data or performing operations on data, while at the same time allowing for rapid insertions and removals of data items from the tree structure. The name 'tree' was chosen because, diagramatically, the organisation of data within a tree represents the organisation of branches and leaves on a tree. A tree is a series of nodes; each node has zero or more children; each child is either another node, or a leaf. Depending on implementation, either the nodes or the leaves contain the data elements of the tree. Sometimes both contain data, in which case there is usually little or no distinction between a leaf and a node.

Because trees are such a generic data type, there are many different implementations, each with different restrictions on usage. Binary search trees are a common example - in which each node has a data element and a maximum of two children. Rather than manipulatiing the tree directly, the most common usage of a binary search tree is as the core data structure in the implementation of another data type, such as sorted lists, maps, or sets. This is because binary search trees provide rapid access to items in a sorted data set, as well as rapid insertion and deletion (O(log n) time for a "well-balanced" tree).

Some tree implementations allow the tree to be split into subtrees, and for those subtrees to be glued back together again. As with splitting a list into sublists, this can be a very efficient way of changing the structure of your data, should your application require it.

Priority queues

Priority queues are queues in which each element has a priority associated with it. This priority is used to order the entries in the queue, so that the entry with the highest priority is always the one to be removed next. Because the contents of the queue needs to be kept in a sorted order, linked lists are not usually used for the implementation (because sorting a linked list can take a long time). Instead, trees or heaps are often used instead, providing much better performance.

Support in programming languages

As previously mentioned, different programming languages have different support for different data structures/types. This means that in some cases you may want to base your decision about which programming language to use around what ADTs are directly supported by that language, in terms of what ADTs you think you will be needing to use in your program.

In general, older programming languages, such as assembler, BASIC, and C, have the weakest built-in support for ADTs, so you will have to either write your own code or rely upon 3rd-party libraries (such as GLib for C - a RISC OS port of which can be found here). Newer languages, such as Perl, PHP, C++, Java and C# have better support - whether through the use of ADTs as language primitives (such as maps in Perl and PHP), or through well-defined standard implementations and interfaces that come with the compiler (such as the C++ standard template library, or the Java class library).

Conclusion

Although I haven't gone into as much detail as I'd hoped - in particular about how fast specific operations are on specific implementations of data types, or even how the implementation of each data type works - I hope I've opened your eyes a bit to the different ways that you can store and manipulating data within your programs.

Next time...

Next time I'll be talking about the RISC OS sound system - everything from understanding the terminology used to writing your own sample player.