Mac/PC Abstraction, Part 3:
Enumerating Files

Overview

Mac and Windows have different mechanisms for enumerating the files in a directory. This article discusses one approach to creating an abstraction class that presents a unified interface which allows files to be enumerated on either platform.

Grab this zip file: crossplat.zip. It contains test projects that can be built with both DevStudio 7 and Xcode 3. There's quite of bit of extra framework in there for something I've just started working on (which you can ignore), along with some platform-specific abstractions I've used on a number of other projects built on both Mac and PC.

The code I will be discussing here is defined in QzDirList.h, QzDirListWin.cpp, and QzDirListMac.cpp. This code gets a bit of exercise in QzUnitTest.cpp, in the TestDirList function, which scans the project directory and dumps a list of all files and folders out to the log file.

This code also relies on lower-level conventions implemented in QzSystem.cpp and UtfString.cpp, which includes converting all strings to UTF-8 format to avoid any dependency on a particular platform's favored representation of strings (UTF-16 on Windows, UTF-8 and UTF-32 on Macs, and plain old ASCII on some versions of Linux/Unix). If you need to write platform-indepedent code, you will have to find a string library that abstracts away these issues (especially if your app stores strings in files, and needs those files to be readable on multiple platforms).

The intent of this article is to document how to enumerate files, not to provide a single implementation that everyone can use as-is.

Physical Drives

On Windows, extra work needs to be done to allow enumerating physical drives (as well as mapped drives and network drives). For MacOS, this issue does not exist, since all devices are mounted as yet another directory under the root folder.

The approach I use to enumerating drive names is by using the GetLogicalDriveStrings function. This returns a string containing all of the drive names, with each name separated by a '\0' terminator.

An alternative is to use GetLogicalDrives and GetDriveType, iterating over the bitmask returned by GetLogicalDrives. I personally prefer GetLogicalDriveStrings, since it allows Windows to format the name string, leaving the system configuration to be responsible for capitalization. This helps keep the capitalization of the drive names consistent with other apps on the system.

When traversing up the directory structure, FindFirstFile and FindNextFile stop when reaching the root directory of the current logical drive. A Linux-based implementation can stop here, but a Windows implementation would need to allow going one level higher and listing the names of all logical drives (both physical drives and mapped drives to remote machines).

The Windows implementation of QzDirList::ScanPath handles this by treating an empty path as a request to enumerate all drive labels instead of as being the root directory of a drive.

As a side note: some systems may list the A: drive as being valid, even when there is no physical device. This is a hardware/motherboard configuration issue. Attempting to access the A: drive on one of these systems will result in a time out (during which the app stops responding) while the system attempts to read from the missing drive. From a programming point of view, it will appear as if there is no disk in the A: drive, when in fact there is no disk drive there at all. Do not assume there is a physical device if the system claims that A: exists — yet at the same time, do not ignore this drive label since this disk does exist on some systems.

Basic Enumeration

For Win32, the basic enumeration loop uses FindFirstFile and FindNextFile to iterate over all files in a directory:

    HANDLE hFind = FindFirstFile(m_Path, &data);

    if (INVALID_HANDLE_VALUE != hFind) {
        do {
            ... do something ...
        } while (FindNextFile(hFind, &data));

        FindClose(hFind);
    }

Under MacOS/Linux, opendir and readdir are used in much the same way:

    DIR *pDir = opendir(reinterpret_cast<char*>(m_Path));

    if (NULL != pDir) {
        dirent *pInfo;
        while (NULL != (pInfo = readdir(pDir))) {
            ... do something ...
        }

        closedir(pDir);
    }

The one subtle difference in the looping logic is that FindFirstFile returns a valid file entry, so the loop needs to process this entry before calling FindNextFile for the first time. Since opendir does not return any data, readdir must be called to obtain the first file entry.

And it bears mentioning: FindFirstFile is one of the few Win32 functions that returns INVALID_HANDLE_VALUE on failure instead of NULL. Make certain you are testing against the correct symbol.

For MacOS/Linux, that there is another function that can be used: readdir_r. This is intended as a thread-safe version of readdir, which can be used if you have multiple threads doing file transactions. Some programmers consider it to be harmful, since using it may result in race conditions or buffer overruns. However, this may only be true for certain implementations. I am not familiar enough with the subject to make any recommendations. If you think you may need to use readdir_r, do some research on it and make your own evaluation.

Name Issues

Do not use wildcards in paths. Windows needs to have *.* appended to the path when calling FindFirstFile (or you could just use * as the wildcard, but I have had problems in the past on older versions of Windows — the legacy *.* still works, and is more compatibility safe in my experience). On the other hand, opendir needs to have the exact folder name for the search, without any wildcards. As such, you need to keep the platform-specific wildcard hidden inside the implementation, with the rest of your app using only the folder name for the search.

Win32 functions can use wildcards and other routines to scan only files that have a specific file extension (or in the case of higher level functions, arrays of file extensions). This is not directly supported by opendir, so filtering files according to their extension should be done by the app itself. By using the same filtering code on all platforms, you can avoid platform-specific quirks in filtering behavior.

Always use forward slashes ('/') for directory names. Windows will correctly handle these in filenames (at least when using fopen and other standard routines), whereas MacOS does not handle backslashes ('\') in file names. Avoid having double slashes ('//') in file paths. Most operating systems will ignore these, but I have seen some systems reject path names containing double slashes. For complete generality, it is best to condition the file name to remove duplicate slashes.

Parent Folders

A significant issue with Linux-based systems is the common use of soft links to folders in other parts of the directory structure. The problem arises when changing directory through a soft link: attempting to return to the parent directory will not return to the previous folder, instead it changes to the parent of the linked folder.

An example of this from the command line would look something like this:

$ pwd
/home/lee
$ cd foo
$ pwd
/home/bob/some/other/path/foo
$ cd ..
$ pwd
/home/bob/some/other/path

In this example, "/home/lee/foo" is a link to "/home/bob/some/other/path/foo". Issuing the commands "cd foo" and "cd .." does not return the user back to the original directory.

This same problem arises when traversing directory trees from within a program. The QzDirList class handles this by initially finding the absolute path of the starting directory, then using this string for all relative traversals. When traversing up to the parent directory with UpOneLevel, the code trims the current folder name off of the absolute path, which returns to the previous directory, even after traversing through a link to some other location within the directory tree.

On the other hand, if you really did need to traverse to the parent folder of a soft-linked folder, you would need a different implementation of UpOneLevel that would first update the absolute path, then traverse up from there. Having never needed that kind of functionality, it is not supported by QzDirList.

This problem is not as significant an issue on Windows. The most likely source of aliasing on Windows is creating a drive mapping, either to a networked drive, or to a folder within a local drive. It is possible to create a mapped drive to a directory on a local drive, so two different paths can lead to the same physical directory. But path names are always relative to the drive name on Windows (even when the drive name is not explicitly stated in the file name), so Linux's traverse-to-parent problem does not exist.

File Attributes

For Windows, file attributes are returned as part of the WIN32_FIND_DATA structure (as well as GetFileAttributes). MacOS, however, does not return file attributes from calls to readdir. You will need to call stat to find out additional information about the file, which stores the attributes information in struct stat. (Aside: Yes, some addlepated programmer really did use the exact same name for both a function and a struct.)

Windows allows files to be marked as "system". Mostly this is used to denote files that are part of the OS runtime, or are hidden files that the user should not touch (such as the pagefile or recycle bin). Essentially, this is just a second "hidden" flag, which can be used to hide system files from naïve end users. There is no direct equivalent on MacOS, so this flag is ignored in that implementation.

Detecting the read-only flag is useful when preparing to write files: if the file is write-only, the user can be prompted with a meaningful error message, instead of generic "write failed" error. This makes it easy to filter out read-only files when enumerating a directory on Windows.

However, MacOS/Linux does not have a single "read-only" flag for files. This information is stored in the struct stat::st_attr field, which includes the read/write/execute permissions for owner/group/others — in other words, the values set by chmod. I have left this test out of the MacOS code, since it is more of a policy question as to which field should be used to determine the read-only-ness of a file.

Hidden files are denoted by a file attribute in Win32, whereas MacOS/Linux considers any file that starts with a '.' to be hidden. Since hidden files are easy to detect on both platforms, both implementations of QzDirList can filter out hidden files.

File Timestamps

Most files on Windows are stored using the FILETIME structure, which contains a 64-bit timestamp. However, FAT-16 and FAT-32 file systems (which includes some flash devices) store the timestamps with lower precision — the problem here being that FAT timestamps are truncated to multiples to 2 seconds, with the fraction of the time being stored separately at 10 ms precision. The fraction time is not always reported, and is discarded when moving files from NTFS to FAT, so the timestamp can be of higher precision, but this is not reliable.

Another inconsistency is that all of the FAT file systems timestamp the files with the local time (as opposed to NTFS, which uses UTC time), so you need to know the timezone and DST of the system upon it was created to recover the correct time.

On MacOS/Linux, file timestamps are stored using time_t values, which are only accurate to the nearest second.

For compatibility, you need to decide on the precision of timestamps to store with files. I round everything to the nearest second, since higher precision is seldom necessary. With Windows, my code uses time64_t, but stored as a full 32-bit value, so roll-over issues can be ignored until 2106. But Mac only has the 31-bit (signed) time_t value, which will roll over in 2038.

The Code

I'm not going to reproduce all of the code in this article. The two versions of the QzDirList class are found in this zip file, which is close to a thousand lines of code in total. Look at QzDirListMac.cpp and QzDirListWin.cpp to see how the same functions are implemented differently for the two platforms.

The following code snippet shows how the example class could be used:

    QzDirList dir;

    // Before scanning the directory, set these flags to
    // indicate what types of properties should be reported.
    dir.m_ShowDirs      = true;
    dir.m_ShowFiles     = true;
    dir.m_ShowHidden    = false;
    dir.m_ShowReadOnly  = true;
    dir.m_ShowSystem    = false;

    // Now scan the current directory.  This will build up
    // a table of names that is stored in the object.
    dir.ScanDirectory(CharToUtf("."), NULL);

    printf("path: %s\n", dir.m_Path);

    // Now loop over the array of file names and print them out. 
    for (U32 i = 0; i < dir.m_EntryCount; ++i) {
        printf("%s: %s\n",
            dir.m_pList[i].IsDir ? "dir " : file",
            dir.m_pList[i].pName);
    }