AMD Logo AMD Developer Central

The File I/O Subsystem and Its Affect on Software Performance 

Skip Navigation LinksHome > Docs & Articles > Articles & Whitepapers
Greg Fry  10/21/2008 
» Filesystems and Storage Devices
» Filesystem Metadata and Operating Overhead
» Filesystem Fragmentation
» Storage Device Interfaces and Command Overhead
» Filesystem Caching
» Filesystem Read-ahead
» Optimal Filesystem Use Techniques
» Conclusion
» Resources

Accessing data stored in a filesystem (typically on a disk drive) takes much longer than accessing the same data already in memory – by an orderof magnitude or more. However, software that uses file I/Ocan be designed using certain rules, overcoming potential performance bottlenecks. In order to understand how and when to apply these guidelines, a clear understanding of the file I/O subsystem isneeded. After exploring the workings of the file I/O subsystem, this article will go over the rules of how to structure software that uses file I/O to help achieve improved performance. These rules will be applied in an example application that splits up the work amongst three threads, thus using the AMD multi-core processor capabilities to realize optimal performance.

Our three threaded C example

The C code embedded in this article is offered as an example that makes use of the filesystems knowledge provided by this article and applies several of the rules for good filesystem usage covered in this article. The sample code (the full example is available for download) was created for use on a 32-bit Windows® platform, and its purpose is to encrypt a file and store the encrypted contents in a new file. The example is simplified to focus on the design of the file I/O and threading. There is no error checking to avoid cluttering the example, and the encryption used is very simplistic and shouldn’t actually be used to secure data.It is purely to show how something can be done to the data in between the reading and writing without being directly in the execution flow of those threads.


Filesystems and Storage Devices

Filesystems exist to allow computer users to more easily store and retrieve information in an organized manner. And in typical situations, a filesystem is also the way of storing and retrieving this organized data on a computer's permanent storage device, typically the hard disk (a.k.a the disk drive or hard drive).Permanent storage devices can also be other things such as a CD or DVD disc, or a USB “flash” drive, or a solid state (memory) disk. In all of these cases, the storage devices are structured so that they are accessed as fixed size blocks of data.

Although other sizes are possible, 512 byte blocks are by far the most common size on storage devices, with CD and DVD devices being the exception, having typically 2048 byte blocks. All of the disk drive storage devices can be thought of as big arrays of fixed size blocks, where each block holds 512 bytes. The storage device does not “understand” what the bytes mean that are stored in the blocks. It simply takes 512 bytes that the filesystem gives it and puts it in the block the filesystem specifies. Thus, it is the filesystem’s job to keep track of which blocks of the storage device it has used, and for what purpose.

In our three threaded example for this article we want to ensure that our buffer size is a multiple of the block size. For the purposes of minimizing the number of file I/O operations required to get the data and to maximize the amount of data transferred with each I/O, we will use 128 blocks for each I/O request, which ends up using 64KBs:

#define IO_BUF_SIZE (128*512)

Next, we need to decide how many buffers to use. This can vary depending upon the application. For our example 256 buffers times 64KB per buffer gives us a maximum of 16MBs of data buffered at any one time and seems reasonable:

#define MAX_BUFS(256)

BufType DataBufs[MAX_BUFS] = { { 0, NULL } };

Filesystem Metadata and Operating Overhead

Aside from the blocks holding the data that make up the contents of the files, the filesystem needs to store other management information about each file. In many filesystems, the filesystem will set aside one block for each file that exists (a “file info block”) and store in it information about which other blocks hold the contents of the file. The file info block will typically also hold other info such as a file modification timestamp, usage (read-only / hidden / etc.), and security (i.e. which users can access the file).

Beyond file info blocks, a filesystem will dedicate some blocks to hold the directories of the files. These directory blocks are used by the filesystem to store the names of files and sub-directories as well as the locations of the files’ info blocks.All of these blocks can be lumped under the term filesystem “metadata”. Managing the metadata blocks is what adds overhead to the filesystem.

Each time a program requests that the filesystem do something (an operation), the filesystem generally has to access some of the metadata to perform the requested operation. Because the software that runs a filesystem typically exists as a kernel mode driver, most filesystem operations cause a transition from user-mode execution to kernel-mode execution, which has a fairly significant cost to it. Often a filesystem driver will be processing more than one operation at a time (from different processes) and thus it has to guard against one operation changing the metadata while another is using the metadata. To do this, it must acquire and release operating locks, all of which add time to completing the requested operation. For these reasons, filesystem operations tend to be fairly expensive with regards to execution time.

Filesystem Fragmentation

As files and directories are created and deleted, and as data in files is added and removed, the blocks that make up a file and its contents can end up getting scattered all over the hard disk instead of being neatly laid out in a contiguous area of the hard disk. This is referred to as “fragmentation”. Because most hard disks are mechanical devices, there can be a significant amount of time involved when a hard disk has to move internal mechanical parts to access non-adjacent blocks (commonly referred to as “seeking”).A good filesystem will try to optimize which blocks are used for the various purposes mentioned in the previous paragraphs in order to limit the amount of mechanical motion required to access a file and its data. Although filesystems hide the seeking and low level block manipulation from most types of software, it is important to keep in mind that it is occurring behind the scenes.

Let's take a look at our example again. To minimize seeking we will open the input file one way (no buffering) and the output file normally (i.e. with caching).This means that data that is written will end up being stored in system memory until there is a lot of it and the filesystem will wait until it has a reasonably large amount of it before actually writing to the device. This saves the storage device from doing a read in one place, then immediately having to seek to another place to do a write. By only using one output file at a time, we have a very good chance of the filesystem placing all of its data optimally, thus avoiding fragmentation:

hReadFile = CreateFile(inputFilename, GENERIC_READ, 0, NULL, OPEN_EXISTING,

FILE_FLAG_NO_BUFFERING, NULL);

We will also allow the filesystem to buffer the writes, thus not causing contention with the reads. When enough data has been written to make it worthwhile, the filesystem will interrupt the reads to write all of the cached/buffered data:

hWriteFile = CreateFile(outputFilename, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS,

FILE_ATTRIBUTE_NORMAL, NULL);

Storage Device Interfaces and Command Overhead

Aside from the physical motion mentioned above, all storage devices have to pass data across a physical interface (a.k.a. a “bus”) of one kind or another (IDE, SATA, USB, etc.).These interface busses typically run at speeds much lower than system memory (RAM) speeds. While system memory may run at 5 to 10GB/s, typical storage devices work on interface busses with theoretical speeds of 40MB/s to 200MB/s.Aside from the speed of the interface bus, setting up the controllers that run these busses as well as indicating to the storage device what is wanted (the device “command”), all adds time to getting (reading) or putting (writing) the actual block(s).This is known as “device command overhead”. Because of this overhead, the actual throughput rates are lower than the theoretical rates. One additional part of command overhead is that the buffers holding the data must have their virtual addresses locked into physical memory while the data transfer is going on. Locking buffers also adds overhead to setting up and executing a data transfer to/from a storage device.

For our example we will allocate ahead of time all of the buffers we will need and lock them immediately into memory to avoid having the I/O subsystem do this each time a buffer is used for a read or write:

for (idx = 0; idx < MAX_BUFS; idx++ )

{

We will also use the VirtualAlloc function to get buffers that are aligned on page boundaries as well as can be locked into memory:

DataBufs[idx].buf = (unsigned char *)VirtualAlloc(NULL, IO_BUF_SIZE,

MEM_COMMIT, PAGE_READWRITE);

By locking the buffers in physical memory once, at the start, we avoid the filesystem having to lock them repeatedly:

VirtualLock(DataBufs[idx].buf, IO_BUF_SIZE);

}

Filesystem Caching

Filesystems also employ caching or buffering as a means of overcoming the performance issues caused by accessing a storage device. The blocks making up the directory information as well as frequently used file info blocks (metadata) will often get cached in system memory (RAM) by the filesystem. In this way it doesn’t have to constantly re-read those blocks from the storage device when common filesystem operations occur. This caching of metadata results in much better overall filesystem performance.

Filesystems will also often cache the contents of files as they are read, just in case a software application needs to reuse the contents. In order to do this the file data (the file’s contents) ends up being copied twice – once from the storage device into the system memory set aside by the filesystem for caching (not to be confused with processor cache (L1 / L2 / L3 cache)) and once into the buffer that the requesting software application is using. Although under the right circumstances, this all saves time, “double buffering” of data does require some up-front cost of time and extra processor cycles.

Keep in mind that in our example we want to avoid filesystem caching of the input file’s data since it will only be used once then discarded. To this end we'll use the FILE_FLAG_NO_BUFFERING on the open call:

hReadFile = CreateFile(inputFilename, GENERIC_READ, 0, NULL, OPEN_EXISTING,

FILE_FLAG_NO_BUFFERING, NULL);

Caching comes with another price tag as well – it can use a lot of the system’s RAM. Unless filesystems were to do some caching, most of the commonly used operating systems would perform miserably slow. This is because of how much longer it takes to read data from the storage device as opposed to accessing it in main memory. A typical end-user computer system will have a 200GB to 500GB hard disk, but only 2GB or maybe 4GB of RAM. So there is no way that the filesystem can cache all of the data / metadata on the hard drive. It must constantly switch which blocks it is caching in order to keep the right ones for the current set of operations being requested.

If the filesystem ends up using too much RAM for caching, or if the demand for the system’s RAM increases (i.e. the user starts running another application program), the OS will let the filesystem know that it needs some RAM back for other uses. The filesystem dutifully obeys and gives back sections of RAM for the OS to use for these other purposes. This means that the data or metadata that was cached in that RAM will have to be re-read from the storage device next time it is needed.

Filesystem Read-ahead

Most filesystems also use a technique called “read-ahead” to boost performance when it is recognized that a program is reading a file sequentially. Read-ahead is basically the filesystem guessing that the program will also want the next section of the file’s contents and therefore starts a read of that data while the program is still processing the data from its last read. This provides for some automatic overlapped I/O processing on the program’s behalf and tends to boost filesystem performance. Once again this performance enhancement method comes with a cost of double buffering, since the filesystem doesn’t know at the time of the read-ahead, where the program will request the data to be placed in memory.One other issue with read-ahead is that there may be times when the filesystem guesses wrong and the pre-read data doesn’t end up being requested by the program, thus wasting the time and effort used to read that data. In typical use cases there are very few read-aheads that are wasted compared to the many read-aheads that do get used. But situations can occur where the filesystem’s read-ahead technique repeatedly guesses wrong and slows things down.

In our example we will essentially do our own read-ahead by using our own buffers and having this thread do nothing but read data:

DWORD WINAPI ReadingThread(LPVOID param1)

{

bool eofHit = false;

size_treadBufHead = 0;

while (!eofHit)

{

WaitForSingleObject(hReadWakeUpEvent, INFINITE);

We will keep reading the data from the file in this thread for as long as possible. As each buffer is filled we will signal the processing thread (the encryptor) to begin working on that buffer, rather than wait for it. We want to continue filling the buffers for as long as we can and hopefully the other threads will have finished using the buffer by the time we get back to needing to fill it again. This provides our own version of read-ahead, without requiring the filesystem to double buffer:

while ( DataBufs[readBufHead].readyFlag == ReadyForRead )

{

DWORD bytesRead = 0;

size_t lclidx = readBufHead++;

if (readBufHead >= MAX_BUFS)

readBufHead = 0;

ReadFile(hReadFile, DataBufs[lclidx].buf,

IO_BUF_SIZE, &bytesRead, NULL);

if (bytesRead != IO_BUF_SIZE)

{

finalBufLen = bytesRead;

eofHit = true;

DataBufs[lclidx].readyFlag = ReadyForEncryptFinal;

break;

}

else

{

DataBufs[lclidx].readyFlag = ReadyForEncrypt;

}

SetEvent(hEncryptWakeUpEvent);

}

}

SetEvent(hEncryptWakeUpEvent);

return 0;

}

Optimal Filesystem Use Techniques

There are a number of rules that can be derived from the filesystem concepts described so far in this article. Applying these rules should allow software to achieve good performance when using a filesystem to store data. Although it would be nice to have examples and sample code for each of these rules, with performance data showing how much of a difference they each make, there simply is not the time or space in a single article to cover all of that. It is also understood that a given piece of software’s purpose as well as its other design constraints may preclude following all of these rules. So, here we simply state the rules and then give any examples where appropriate rules (not necessarily all of them) have been applied to our example software’s initial design:

  1. Reads and Writes should take place in buffers that are multiples of 512 bytes in size, preferably in buffers that are 64KB (or multiples thereof) being optimal for most systems. This helps minimize operation and command overhead up and down the chain of filesystem and storage device drivers.
  2. The read/write buffers should be aligned on 4KB memory boundaries and locked in memory if they are going to be used more than once – this is designed to save the filesystem, the OS, and storage device drivers a great deal of time in locking the buffers over and over again. This really only makes a difference if your reads/writes are by-passing the filesystem cache.
  3. Avoid using the C run-time file I/O functions as these may do additional buffering of data.
  4. When using the C run-time file functions (open, fopen) use binary mode for opening files “_O_BINARY” or “b”, even when the files are ASCII/text files. You may have to do your own newline handling to make this work as you are used to.
  5. If you are reading a file from beginning to end that you know your software will not need to re-read:
    1. Set the appropriate flags during the file open so the filesystem will skip caching the data and skip read-ahead to avoid double buffering (in the example below, this is the “FILE_FLAG_NO_BUFFERING” flag).
    2. Use a dedicated thread to do the reads which allows for optimized reads (the disk drive will be doing its own read-ahead into its own cache thus providing good performance still).
    3. Use multiple aligned read buffers so that the next read can take place without having to wait for the data from the previous read to be processed.
  1. If writing large files that you don’t plan on re-opening immediately:

a. Set appropriate hint flags on the file open so that the data won’t be cached, thus avoiding the overhead of double buffering

b. If you are also reading a file simultaneously, allow the writes to be cached, thus avoiding seeking between each read and write. The double buffering overhead is much smaller than all the extra seek time.

c. Use a separate writing thread that uses multiple buffers so that each buffer can be getting filled independent of the whether the last write has completed.

For our example, we write the data as soon as it becomes available by using this dedicated thread:

DWORD WINAPI WritingThread(LPVOID param1)

{

bool eofHit = false;

DWORD bytesToWrite = IO_BUF_SIZE;

size_twriteBufHead = 0;

while (!eofHit)

{

WaitForSingleObject(hWriteWakeUpEvent, INFINITE);

while ( DataBufs[writeBufHead].readyFlag == ReadyForWrite ||

DataBufs[writeBufHead].readyFlag == ReadyForWriteFinal)

{

DWORD bytesWritten = 0;

size_t lclidx = writeBufHead++;

if (writeBufHead >= MAX_BUFS)

writeBufHead = 0;

if (DataBufs[lclidx].readyFlag == ReadyForWriteFinal)

{

bytesToWrite = (DWORD)finalBufLen;

eofHit = true;

if (bytesToWrite == 0)

break;

}

WriteFile(hWriteFile, DataBufs[lclidx].buf, bytesToWrite,

&bytesWritten, NULL);

DataBufs[lclidx].readyFlag = ReadyForRead;

SetEvent(hReadWakeUpEvent);

}

}

return 0;

}

  1. If the software will re-read or seek around in the file, try reading the entire file sequentially the first time it is used by the software, thus increasing the chance of the filesystem providing good caching and optimal read performance.
  2. If your access patterns for reading the file do require seeking around in the file, and the file is small enough (< 100MB), try using memory mapped files. The filesystem and OS will work together to try and read them all into memory and keep them that way while your program accesses them. In this case don’t worry about the filesystem doing caching and double buffering – you want that to occur in order to get maximum performance.
  3. Do as much as possible with your data in memory, before handing it over to the filesystem, thus avoiding seeking around in the file after it has been written.

In our example, we do all of our data manipulation here, in this dedicated thread, which avoids stalling I/O reads while we process the data:

unsigned char encryptKey[IO_BUF_SIZE] = { 0 };

void EncryptData(unsigned char *dataBuf, unsigned char *keyBuf, size_t bufLen)

{

size_t idx = 0;

for (idx = 0; idx < bufLen; idx++)

{

unsigned char tmpch = *dataBuf;

*dataBuf++ = tmpch ^ *keyBuf++;

}

}

DWORD WINAPI EncryptingThread(LPVOID param1)

{

bool eofHit = false;

size_t lclReadyFlag = ReadyForWrite;

size_t bytesToEncrypt = IO_BUF_SIZE;

size_t encryptBufHead = 0;

while (!eofHit)

{

WaitForSingleObject(hEncryptWakeUpEvent, INFINITE);

while ( DataBufs[encryptBufHead].readyFlag == ReadyForEncrypt ||

DataBufs[encryptBufHead].readyFlag == ReadyForEncryptFinal)

{

size_t lclidx = encryptBufHead++;

if (encryptBufHead >= MAX_BUFS)

encryptBufHead = 0;

if (DataBufs[lclidx].readyFlag == ReadyForEncryptFinal)

{

bytesToEncrypt = finalBufLen;

lclReadyFlag = ReadyForWriteFinal;

eofHit = true;

}

EncryptData(DataBufs[lclidx].buf, encryptKey, bytesToEncrypt);

DataBufs[lclidx].readyFlag = lclReadyFlag;

SetEvent(hWriteWakeUpEvent);

}

}

return 0;

}

  1. Avoid using lots of temporary files to hold data since this incurs a lot of overhead and adds to filesystem fragmentation. Just use system memory to hold temp data if possible (there are obvious limits to this of course – there is only so much system memory).
  2. Reading multiple files simultaneously will almost always cause significantly more seeks by the storage device. Although it is not always possible, try to structure the code so one file is read, then another, etc.
  3. Avoid filesystem fragmentation by writing each file one at a time instead of writing two or more files at the same time. This allows the filesystem the best chance of placing the file contents contiguously on the storage device.
  4. Run a defragmentor occasionally – or suggest that the end-user does.
Conclusion

So take some time when doing software design to consider how the software will use the filesystem. Planning appropriately can make the difference between having a slow program that people have to constantly wait for versus having a fast program that people enjoy using.

Resources

Using Threaded Queues to Get Around Slow I/O - This article discusses how to use a circular queue in C++ to address an I/O bottleneck.

About the Author

Greg Fry of Portland, Oregon, A 20+ year veteran of professional software development, including a decade in commercial software development (large volume, end-user/shrink-wrapped software). B.S. in Computer Science, Instrument rated private pilot, and outdoor enthusiast.

Back to top
© 2009 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.