» Overview» Initial Considerations» Implementation Details» Results of Implementation» Final Observations
Input/output (I/O) is a performance killer, there's no denying it. Consider that a fast disk can read data at 100MB/sec. under ideal circumstances. More typical are rates of about one-third that level. Meanwhile, main memory can move data at more than 10GB/sec. This gives the processor at least a 100-to-1 advantage over standard disk I/O, meaning in effect the processor can handle in 1 minute what a disk would take two hours to read. The comparisons deteriorate quickly when slower forms of I/O are taken into account. For example, common wireless connections today run at 54Mbps or 5.4MB per second-a ratio of 1000:1 to the processor internal data transfer rate-which is actually closer to 2000:1, when actual throughput is considered. That's a big disparity, and it can have significant impact on application performance, unless you plan for it.
Parallel programming is ideally suited to address this problem of slow I/O, especially as it occurs with wireless, standard Ethernet, CD-ROMs and similarly slow devices. Threading enables your program to process data, while the I/O is ongoing. The basic approach is to have one thread handle I/O, while another thread, operating in parallel, processes the incoming data. The relationship between these two threads is known as producer-consumer: the producer generates the data that the consumer thread uses.
An important consideration of this particular solution is to figure out how the producer thread hands off data to the consumer thread. The most common approach is to have the producer thread read data in blocks and place those blocks in a queue from which one or more consumer threads can retrieve them.
Here, we'll discuss how such a threaded queue is implemented. I presume that you are familiar with the basics of threaded programming, including familiarity with threads in general, locks, and signaling mechanisms. The code implementation is in C/C++ and uses the Windows threading API. However, it would not be difficult to change the code to use POSIX threads on Linux. All the operations are basic threading tasks.
We'll use a circular queue, which is a common data structure that is designed so that items are placed in the queue at one end (the head) and removed from the other end (the tail). Each slot in our queue contains: the data block from the producer thread, an integer specifying the amount of data in the block (useful since the last block will probably be partially filled), and a sequence number. The sequence number is particularly important if more than one producer or consumer thread is used.
In such a case, it's not a given that the threads blocks will be processed in the same order in which they were read. So, if the sequencing matters, these sequence numbers are required. In our implementation, however, we will use only one producer and only one consumer thread for purposes of simplicity. In some situations, you might want to add more consumer threads. This will greatly add to the complexity of this code, so do so only if you are sure this is the best way to improve performance. Adding producer threads won't help much: I/O is a single threaded operation most of the time.
Our implementation consists of a producer thread that reads blocks of 100 bytes and dumps them into a queue that contains 50 slots. (As we'll see shortly, this particular configuration is for demo purposes. In real life, you would use much larger blocks of data.) A separate consumer thread removes entries from the queue and optionally prints the data to the screen. It also logs to the screen the state of the queue at the time of the fetch. In our particular implementation, there are exactly two threads. The consumer thread is started up by the main program thread and that main thread then becomes the producer thread: reading from a file that is specified on the command line and writing the data to the queue. At EOF, the main thread waits for the consumer thread to finish. Then, the program exits.
For the queue to work correctly, several threading constructs are used. The first is locking, which keeps two threads from competing for the same resource. The head and tail are not shared: only the producer accesses or changes the head, and only the consumer changes the tail. However, both threads must update the queue's counter: the producer when adding a block, the consumer when removing it.
There are several ways to implement this locking, the most common of which is a mutex or critical section around the code that updates the counter. This would permit only one thread to update the counter at any given time. In our code, we use an alternative device, available in Microsoft Windows' threading API, called InterlockedExchangeAdd(&queue_counter, longIntegerToAdd). This function adds the value indicated in the second argument to a long integer located at the address specified in the first argument.
Windows assures that this is done as an atomic operation, which means that only one thread can perform this update at a time. This call is much faster than a mutex, but less portable. In POSIX threads, you would just use a standard mutex to do this.
If this were all there were to it, we'd be done. But there is one problem: what happens when the producer thread wants to add a block of data when the queue is full? A facile solution would be for it to enter a loop that repeatedly reads the queue counter until it dips below the maximum. At that point, it can add the new block and resume reading. This approach, known as a spin-lock, is highly discouraged. It has several key drawbacks: the first is that it chews up CPU cycles doing nothing but waiting. It also fools the operating system, which unknowingly thinks that serious work is being done. Because of this, the OS will not want to swap out the thread, and so other threads are blocked from processor access, even though this thread is really just waiting.
A better solution is to have the producer thread wait for a notification from the queue that a slot is available. In parallel programming, this is called waiting for an event. Another is a means provided by the native threading interface. In POSIX, this mechanism is called a condition variable; in Windows it's called simply an event.
In general terms, when the event occurs, a signal is broadcast and all threads waiting for that event then respond. In our code, we broadcast that a slot is available anytime the consumer thread removes a block and detects that prior to removal the queue was full. In our code, this is done with: SetEvent(hSpaceOpen). SetEvent is the Windows function that announces the event has occurred. The lone parameter is the event itself, which is created earlier in the program by another Windows call, CreateEvent().
A similar situation to the full queue occurs when the queue is empty and the consumer thread wants to retrieve a block. Again, it could use a spin-lock, but this would be wasteful. So, an event is again employed. This time, the event is triggered by the producer thread, when it adds a block to a previously empty queue.
The accompanying program (included are the source code and an executable version) expects a command line argument, which is an input file to process. It then prints out the status of the queue each time the consumer thread goes to fetch a block.
By removing the highlighted comment, the code can print the contents of the file to the screen, so you can verify that the queuing mechanism actually works. The attached executable has this turned off, so that we can focus on the queue output. Below are the partial results of running this code on a file located on a remote system that was accessed via a fairly slow wireless link. (pq->count refers to the counter for the queue.)
If you are running this under Windows, be sure to execute it from a command line, so that you can see the output. The application is run by typing queue1 <filename>, where filename is any file on your hard disk which can be read,
getQueue: pq->count: 0
getQueue: pq->count: 0
getQueue: pq->count: 0
getQueue: pq->count: 0
getQueue: pq->count: 35
getQueue: pq->count: 34
getQueue: pq->count: 33
getQueue: pq->count: 32
getQueue: pq->count: 50
getQueue: pq->count: 50
...
getQueue: pq->count: 50
getQueue: pq->count: 49
getQueue: pq->count: 50
getQueue: pq->count: 49
getQueue: pq->count: 48
getQueue: pq->count: 47
getQueue: pq->count: 46
getQueue: pq->count: 45
getQueue: pq->count: 44
getQueue: pq->count: 43
getQueue: pq->count: 42
getQueue: pq->count: 41
getQueue: pq->count: 40
getQueue: pq->count: 39
getQueue: pq->count: 38
getQueue: pq->count: 37
getQueue: pq->count: 36
getQueue: pq->count: 35
getQueue: pq->count: 34
getQueue: pq->count: 33
getQueue: pq->count: 32
getQueue: pq->count: 31
getQueue: pq->count: 30
getQueue: pq->count: 29
getQueue: pq->count: 28
getQueue: pq->count: 27
getQueue: pq->count: 26
getQueue: pq->count: 50
getQueue: pq->count: 50
getQueue: pq->count: 50
getQueue: pq->count: 50
...steady decline from 50 to 10...
getQueue: pq->count: 10
getQueue: pq->count: 9
getQueue: pq->count: 8
getQueue: pq->count: 7
getQueue: pq->count: 6
getQueue: pq->count: 5
getQueue: pq->count: 4
getQueue: pq->count: 3
getQueue: pq->count: 2
getQueue: pq->count: 1
Read a total of 21504 bytes
As you can see, there was a start-up delay and then a sudden jump to 35 entries. Recall that files are typically stored in multi-KB blocks, so the first block in-fills many slots before the consumer thread can receive the signal and start removing them. After that, block counts move up and down, until they start a steady decline from 50 to 0, which corresponds to the period of the queue being emptied after the last I/O has completed. What is important to notice is what the fluctuating queue counts mean: work is being done while I/O is occurring. In other words, time is being saved.
The fact that the queue is frequently full might suggest that we need more consumer threads. This is not so in this case; recall the very small block size used here so that there would be greater variance in the numbers shown in this example. In real life, the blocks should be the size of the blocks on disk, typically 4KB or 8KB. When run configured that way, performance improves and there is less variation in results.
This approach is ideal for I/O on slow devices or for programs that must read very substantial amounts of data into memory. For small or medium sized files read from a local hard disk, these routines provide only a minor lift, so the added coding and complexity probably are not worth it.
Also, this approach will provide only a minor performance lift where the computation portion of a program is minimal. This seems counterintuitive. However, suppose you have a program with slow I/O and very fast computation. If the computation on one block can be completed before the next block has finished being read, then the system is in a state of constant waiting on the I/O process and performance approaches that of the serial, non-queued I/O implementation. (You save only the minimal time spent on computation.) However, let's say the processing is time-consuming such that half the program's time is spent in I/O and half in computation, now the time saved by being able to process blocks before the I/O completes is substantial.
So, whenever slow I/O and considerable computation describe your program, think of using a threaded queue – for nearly all situations, it's your best tool. But of course, there's always another way: In a future article in this series, we'll discuss another approach which you can add to your programmer's toolbox: I/O Completion Ports, which are a set of threads created solely for processing async I/O requests. But more on those later.
Anderson Bailey is a developer with a longstanding interest in the techniques for using code to exploit processor features. He can be reached at chip.coder@gmail.com.