Comments on programming assignment #0

CSC 323: Software design · Spring, 2012

Department of Computer Science · Grinnell College

Choosing the element type

Regardless of whether you decide to implement queues with arrays or with lists, the type of the values you store into that underlying data structure should be void *, since that is the form in which the enqueue function receives them and the form in which dequeue and front_of_queue are expected to deliver them. It is pointless to use any other pointer type for this purpose and dangerous to use int, since C makes no guarantee that casting a void pointer to an int value and back again preserves its value.

Indeed, on a computer that uses a 64-bit architecture, a void pointer is likely to occupy eight bytes of storage and an int only four, so that casting a void pointer to int drops the four most significant bytes, with no hope of recovery. Bad idea.

Nor may you dereference the void pointer and store whatever is on the other end by casting it to, say, int. The problem with an expression like

*(int *) value

is that the caller did not necessarily store an int at the other end of that pointer. It may have started out as a char * or as a pointer to some complicated struct, and so that chopping out four bytes from the value at the other end of the pointer and storing those four bytes in a int often loses part of the datum. In any case, storing only this int value during enqueueing will not enable you to reconstruct the void pointer that the dequeue function should return.

Nor can you use malloc to allocate storage dynamically for the values at the other end of void pointers. You have no way to figure out how large that value is. Writing something like

malloc(sizeof(new_item->data))

is incorrect, because the size of the pointer is likely to be different from (and less than) the size of the value to which it points. Writing something like

unsigned int size = 0; char *traverser = (char *) new_item->data; while (*traverser++ != '\0') size++; void *result = malloc(size); (void) memcpy(result, new_item->data, size); /* or char *result = malloc(size + 1); (void) strncpy(result, (char *) new_item->data, size + 1); */

is very likely to fail unless there really is a string at the other end of the data pointer, since a zero-valued byte can occur in the middle of a struct; and if it doesn't, there's no reason to expect one to occur immediately after the struct.

Dereferencing a void pointer is almost always the wrong thing to do.

Choosing a data structure (0): Linked-list implementations

Since theoretically there is no upper bound on the number of elements that a queue can contain, and that number can vary widely and erratically during the execution of a program, it is natural to store the elements of a queue in a singly linked list, so that a new, dynamically allocated component can be added every time a value is enqueued. Such a component will be freed when the value, having passed through the queue, reaches the front and is dequeued, or when the entire queue is freed by a call to free_queue.

Each component of a singly linked list is a structure, looking something like this:

struct queue_node { void *stored_value; struct queue_node *next; }

An empty singly linked list is then conventionally represented by (struct queue_node *) NULL, and a non-empty singly linked list by a pointer to a struct queue_node in which the stored_value field is the first element of the list (the car, in Scheme terminology) and the next field is the rest of the list (the cdr).

It is possible to write a queue implementation in which struct queue_structure and struct queue_node are identical, but more commonly the fields of struct queue_structure provide information that speed up some common operations on queues. For instance, the interface to a queue type sometimes includes a queue_size function that reports the number of values currently stored in a given queue, with the prototype

unsigned int queue_size(queue q);

To support this function, struct queue_structure might well include a size field that make_queue initializes to 0 (don't forget this step!), enqueue increments, and dequeue decrements. In the singly linked list implementation, one would otherwise have to compute the size by traversing the list, counting the components along the way. Under that approach, queue_size would have an O(n) (linear) running time, whereas maintaining a size field allows the function to return a value in O(1) (constant) time.

Similarly, a naive algorithm for enqueue would traverse the list to get the last component before attaching the component containing the new value, again yielding an O(n) running time. A better idea is to add a field to struct queue_structure containing a pointer to the most recently appended component: a rear pointer. The definition of struct queue_structure might look like this:

struct queue_structure { struct queue_node *front; struct queue_node *rear; // unsigned int size; /* for an O(1) queue_size function */ }

The front pointer points to the struct queue_node containing the least recently enqueued value in a non-empty queue; in an empty queue, front will be (struct queue_node *) NULL. It doesn't make any difference what value is stored in the rear field when a queue is empty, but it is good practice to have make_queue store a null pointer and also to have dequeue store a null pointer there when removing the last component of the linked list, just so that only valid, non-dangling pointers occur in the structure.

For the same reason, the enqueue function should store a null pointer in the next field of the struct queue_node that it allocates.

It's a little dangerous to have three kinds of pointers around (void pointers for the actual values stored in the queues, pointers to linked-list components, and pointers to struct queue_structure headers), so programmers who adopt the linked-list implementation need to take care that their local variables and, especially, return values are conceptually of the correct type. (The C compiler is no help here, since it tacitly casts any pointer to whichever type it thinks is needed.)

Choosing a data structure (1): Array implementations

Alternatively, the elements of a queue can be stored in a (dynamically allocated) array, in which case the struct queue_structure typically comprises two fields, each of which holds an index into that array. In one good implementation, the “front” index is the position occupied by the least recently added element, and the “rear” index is the (vacant) position that will be occupied, when the next enqueue operation occurs, by the new element that it adds. In this representation, a queue is empty if the two indices are equal (note: not if the values stored at those indices are equal, but if the indices themselves are equal). The rear index is incremented every time an element is enqueued, the front index every time an element is dequeued. In each case, the incrementation occurs after the array operation has been completed.

It is usual for both indices to “wrap around” to 0 when the high end of the array is reached, so that the positions at the low end of the array, occupied by values that are no longer in the queue, can be recycled for new values. The size of the array is then an upper bound on the number of elements that can be stored in the queue simultaneously, rather than on the total number of values that can pass through the queue during the entire execution of the program.

The array implementation is simpler and faster when the programmer can guarantee that the upper bound on the size of the queue will never be exceeded. In the absence of such a guarantee, the programmer must decide what course of action to take if enqueue is called at a time when there is no vacant position in which to store the new value. (In the wraparound implementation described above, a problem arises even when there is one vacant position remaining, because filling it and incrementing the rear index would cause the two indices to be equal, a situation indistinguishable from an empty queue.)

One possibility is to dynamically allocate a new, larger array, copy the values in the existing array into it, and free the existing array. Since this is a slow process, it is probably a good idea to allocate an array that is, say, twice as large as the existing array, thus leaving room for a lot more elements, in order to avoid or at least postpone another resizing operation. On this plan, the struct queue_structure needs a third field, to keep track of the size of the currently allocated array (not the number of elements currently in the queue, but the amount of storage allocated for queue elements, including both the positions that are occupied by current elements and those that are not).

If it is inconvenient or impossible to allocate storage for a larger array, the enqueue function may have to report an error and halt. It is better to do this than to access storage locations outside of the array (by allowing the indices to grow too large) or to overwrite queue elements (by wrapping around into array positions that are already in use).

If you adopt this approach, don't forget to free the old array once the elements have been copied into the new one!

Handling precondition failure

The dequeue and front_of_queue functions (and, less obviously, the free_queue function) have preconditions that the caller is supposed to satisfy. It is almost inevitable, in the use of any program library, that someone will eventually call the procedure without ensuring that the precondition is satisfied. The programmer of the queue library must consider what her code should do in such a situation.

One possibility, of course, is to do nothing. In the case of dequeue or front_of_queue, the function then attempts to dereference a null pointer and the caller's program will crash with the unhelpful error message “Segmentation fault”. Well, after all, it is the caller's fault -- he should have checked the documentation and made sure that the documented precondition was satisfied!

Another possibility is to use an assertion to check that the precondition is met, writing something like

assert(q->front != (struct queue_structure *) NULL);

or even

assert(q != (queue) NULL); assert(q->front != (struct queue_structure *) NULL);

to catch the case in which the user tries to dequeue from a queue that has been freed, or one that was never allocated in the first place.

A more traditional approach, which gives the programmer greater flexibility in reporting and handling precondition failures, is to test the precondition explicitly and write a warning or error message to standard error:

if (q->front == (struct queue_structure *) NULL) { (void) fputs("dequeue failed because the queue was empty\n", stderr); return (void *) NULL; }

If you want program execution to stop when a precondition is not met, invoke the library procedure exit, giving it a nonzero argument:

if (q->front == (struct queue_structure *) NULL) { (void) fputs("dequeue failed because the queue was empty\n", stderr); exit(EXIT_FAILURE); }

EXIT_FAILURE is defined in stdlib.h to be a standard nonzero value. You can define your own status codes, of type int, if you want to give a more precise diagnosis of the cause of the error. On GNU/Linux and Unix systems, the shell running in a terminal window can report the status code of the last program that it ran. On MathLAN, for instance, typing the command

echo $?

reports that status code. (It's typically 0, which is the conventional status code for a program that terminated normally.)

Off-by-one traversal of a linked list

If you're trying to perform some operation on each of the elements of a singly linked list, it won't work to write

struct list_node *traverser = ls; while (traverser->next != (struct list_node *) NULL) { /* Operate on *traverser here. */ traverser = traverser->next; }

This is wrong for two reasons: (0) It skips over the last element of a non-empty list, which never gets operated on, and (1) it crashes if the list is empty. (There is no traverser->next when traverser is a null pointer.)

The code you want is

struct list_node *traverser = ls; while (traverser != (struct list_node *) NULL) { /* Operate on *traverser here. */ traverser = traverser->next; }

Freeing the queue

The code at the end of the preceding section, for traversing a singly linked list, doesn't quite work if the operation that you're performing on each element is to free the storage that it is occupying! Since this operation invalidates the value of traverser itself, you can't turn around and use that value to advance to the next node in the list. Instead, write

struct list_node *leader = ls; struct list_node *trailer; while (leader != (struct list_node *) NULL) { trailer = leader; leader = leader->next; free(trailer); }

It is important to traverse the linked-list structure and free each component of the linked list, since the queue module itself allocated those components (in calls to enqueue) and knows how to get their addresses. The caller has no other way to free the components except by invoking dequeue repeatedly until the queue is empty, and not all users of the queue module will even consider that approach.

Note, that it is not necessary and not desirable to include the step

free(trailer->datum);

to deallocate the storage to which the void pointer in the datum field points. The queue functions are not responsible for that storage. They didn't allocate it, and they have no way of knowing where it came from or even whether it was dynamically allocated to begin with. Moreover, the caller may still have another copy of the pointer stored in the datum and may have some use for the value at the other end. Don't take it away just because the queue no longer needs access to it!

The free_queue function should also deallocate the storage for the "header" structure:

free(*address_of_queue);

However, it is an error to write

free(address_of_queue);

in the free_queue function, since address_of_queue itself is not dynamically allocated (it does not acquire its value from malloc, either directly or indirectly).

In an array implementation, free_queue should deallocate the space for the array and for the header:

queue q = *address_of_queue; /* ... */ free(q->elements); free(q);

but, again, not for the stored data (free(q->elements[index]); would be incorrect) and not for the storage location containing the address of the queue (free(address_of_queue); would still be incorrect).

Nulling out address_of_queue

The specification for the free_queue function says that, after the queue has been freed, the pointer to that queue should be overwritten with a null pointer. (This practice helps to prevent the circulation of “dangling” pointers -- pointers to storage locations that have been deallocated, so that the pointers are no longer valid -- and also makes it easier to detect and correct the common error of attempting to free the same queue more than once.)

The code that performs this step is simply

*address_of_queue = (queue) NULL;

and it would be conventional to place this step at the end of the free_queue procedure, after the traversal and deallocation of the queue components.

Incidentally, it's probably a good idea to program defensively against users of the library who try to free the same queue twice by enclosing the body of free_queue in an if-statement:

if (*address_of_queue != (queue) NULL) { /* ... Free the queue ... */ }

or, more aggressively, by writing

assert(*address_of_queue != (queue) NULL);

at the beginning.

I saw a few variations like this:

address_of_queue = NULL;

(without the initial *). This is pointless. It changes the value of the parameter within the free_queue function, but has no effect on the caller's variables.

The argument to malloc

In order to allocate enough storage for a value, the malloc function needs to know how big that value will be -- that is, how many bytes such a value occupies. You can use the sizeof operator to compute the size of a value of a specified type. For instance,

sizeof(struct queue_structure)

gives you the number of bytes that one value of type struct queue_structure occupies. Do not attempt to compute this explicitly by adding up the sizes of the fields of the structure -- the C compiler is allowed to leave holes in the middle of a structure, to rearrange its fields, or to pad it with additional bookkeeping information, all of which can affect the value that sizeof will return.

The expression

sizeof(queue)

gives you the number of bytes needed to store one value of type queue -- that is, one value of type struct queue_structure *. This is unlikely to be what you want. An allocation statement like

queue new_queue = malloc(sizeof(queue));

where the pointer that malloc returns is of the type that is provided as the argument to sizeof, is almost certainly wrong, since normally the type of a pointer differs from the type of the object to which it points, and there is no reason to expect the sizes of values of these types to be equal.

Refactoring

If you find yourself typing in the same code sequence, or nearly the same code sequence, over and over again, stop it! You are creating code that is tedious to write, tedious to read, difficult to maintain, and extremely error-prone.

Look around for some way to rearrange or refactor the code so that the part you're repeating occurs only once. For instance, if the consequent and the alternative of an if-statement are both long and are almost alike, reduce the scope of the if-statement so that it encloses only the part that differs. If the same loop occurs in several different places, with only the starting value of the loop control variable and the number of iterations being different, put the loop in a function and pass in the two varying items as arguments. And so on -- there are lots of refactoring techniques.

If you find yourself using copy and paste to write large swatches of code quickly, stop it! Really, just stop it! -- you're abusing the editing tools. The resulting code is slightly less tedious to write, but it's still just as tedious to read, difficult to maintain, and error-prone as if you had retyped it. Stop and think instead.

Checking the result of malloc

The specifications for the library functions malloc, calloc, and realloc acknowledge that it may be impossible for these functions to allocate dynamic storage on demand. If any of these functions fails in its attempt to allocate dynamic storage, it returns (void *) NULL.

On modern computers, the amount of storage available is very large. On the other hand, many of the programs that we tend to write demand extraordinary amounts of storage. On balance, it's probably a good idea to get into the habit of checking the value returned by a call to a storage-allocation function before trying to dereference the pointer:

queue q = (queue) malloc(sizeof(struct queue_structure)); assert(q != (queue) NULL);

Casting pointers

I have the habit of writing explicit casts in many cases where C allows implicit ones, for instance in

queue q = (queue) malloc(sizeof(struct queue_structure));

and

assert(q != (queue) NULL);

and even

(void) fputs("Here be dragons!\n", stdout);

These make absolutely no difference in the execution of compiled code, since such casts are no-ops at the machine level anyway. They are basically reminders to me and my readers to bear in mind the data types to which the expressed values belong and to acknowledge the need for adapters to accommodate the differences between what libraries provide and what applications need. You can adopt this practice if you find it useful, but a “reading knowledge” is sufficient if you find that it gets in your way.

Standard libraries

If you use identifiers that are defined in one or more of the standard libraries, you must include the header files for those libraries in your code, preferably using the #include directive.

Some C compilers will fill in the missing #include directives for you automatically, at least in some cases. Don't rely on this.

If you find it difficult to remember which of the libraries is the one that contains, say, malloc, open a terminal window and type

man 3 malloc

The synopsis at the top of the manual page tells you which header file to include.

The main function

When you're writing a library that will be compiled separately and linked to a variety of application programs, don't include a function called main. C reserves that function name as the entry point for a stand-alone executable.

To write the queue library to which queue.h is the interface and a test program that tests the functions in the library, you should create two separate files, queue.c and test-queue.c. Both would contain the line

#include "queue.h"

to ensure that they agree on the interface. The command to compile the library should be something like

gcc -Wall -c queue.c

and the (subsequent) command to compile the test program, linking the compiled library to it, should be something like

gcc -Wall -o test-queue test-queues.c queue.o

This yields a standalone executable program, test-queue, which you can run just by typing its name:

test-queue

Modifying queue.h

Two submitters were foolish enough to modify queue.h, even though the assignment unconditionally says not to do that. In particular, it is important to separate the implementation of the opaque data type struct queue_structure and of the six specified functions from their public interface, which is all that a .h file should contain.

If you do rewrite some code file that I supply, amend the authorship comment at the top of the file to reflect your contributions.

Concision

Many C programmers strive for concision and efficient execution, sometimes even at the expense of clarity. For instance, a C programmer is not likely to write

q->rear++; int rear_index = q->rear; q->elements[rear_index] = value;

when she can instead write

q->elements[++(q->rear)] = value;

with the same effect.

Implicit return type in functions

C allows programmers to omit the return type from the function prototype and header, but this is probably bad style. Many readers will assume that the unspecified return type is void, but it's actually int.

Failing to return a value in a non-void function

Many of you did some precondition testing and executed an early function return when the precondition was not met, often after printing an error message. Unless the function is declared with the result type void, however, you should not do this by executing a return-statement with no expression in it! The danger in this practice is that the caller, expecting to receive a useful value, will get whatever garbage bits happen to be in the position on the run-time stack where the return value should have gone, and is likely to try to treat those garbage bits as if they constituted a useful value.

If you're going to return after detecting an error (instead of invoking the exit function or failing an assertion), you must return some value or other -- usually a null pointer, if the result type is some pointer type, or 0 or -1 if it is an integer type, etc.

Range checking in arrays

Nothing in the definition of the C language stops you from attempting to index into an array with a negative index expression, or with one that has a value greater than or equal to the declared size of the array. Nevertheless, such an attempt usually ends unhappily.

Format specifications in printf

To bring up the manual page on MathLAN that describes format specifications for the library function printf, type

man 3 printf

at the prompt in a terminal window. Use the format specification appropriate to the type of value that you want to print.

In particular, if you want to print out a pointer, cast it to void * and use the %p format specification. On MathLAN workstations, you'll get a base-sixteen numeral specifying the address of the storage location to which the pointer points.

Assignment to nowhere

It is an error to write

int *sample; *sample = 42;

The first line declares sample to be a variable, of type “pointer to integer,” but it does not initialize that variable, by storing a valid address into it; it leaves that variable uninitialized, so that the bits stored in the variable are whatever random garbage happens to be in the memory location associated with that variable.

The second line then tries to perform the dereferencing operation on the value of sample, so as to recover the address stored in that variable and to place the integer 42 at that address. Since the “address” consists of garbage bits, this is unlikely to succeed.

There are two ways to get a valid address to store into sample. One can call the dynamic storage allocator to get one:

int *sample = (int *) malloc(sizeof(int)); *sample = 42;

or one can take the address of a variable or parameter of the relevant type, provided that it is in scope:

int target; int *sample = ⌖ *sample = 42;

In the latter case, the value of target becomes 42, as a side effect of the assignment through sample.

Conventional prose mechanics

Many programmers use a “telegraphic” prose style when writing comments. In this style, the subject of a sentence is omitted when it would be obvious or conventional (e.g., “This function”), as are articles and full stops at the ends of declarative and imperative sentences. For instance, Rob Pike's comments in the regular-expression matcher discussed in chapter 1 of Beautiful code are written in telegraphic style.

In my opinion, the use of this style is an obsolete convention, dating from a time when there was a measurable cost for every byte consumed by a program's source code, so that there was an economic incentive to keep comments very short. Those days are long gone. My advice today is to follow the same conventions when writing comments as when writing expository essays.