DRAFT DRAFT DRAFT DRAFT DRAFT

Close-To-Open Cache Consistency
in the Linux NFS Client

Chuck Lever, Network Appliance, Inc.

DRAFT DRAFT DRAFT DRAFT DRAFT

Abstract

To support close-to-open cache consistency, the Linux NFS client aggressively times out its DNLC entries. This doesn't guarantee total consistency, however, and results in unpredictable behavior. In this report, we describe the current Linux DNLC entry revalidation mechanism, compare the network behavior of the Linux NFS client implementation with other client implementations, and discuss possible improvements. We describe a solution that makes the Linux NFS client guarantee cache consistency. Finally, we show how to shut off consistency checking for the sake of performance in specific environments.

Introduction

UNIX(tm) was one of the first operating systems to support hierarchical file systems [3]. A hierarchical file system allows users to organize files with a common purpose into directories. It also allows directories to contain other directories, creating a hierarchy of directories that often resembles the branches of an inverted tree. Generally, an operating system invokes a lookup operation to find files contained within a directory. File system operations on files stored in a hierachical file system traverse the directory structure via multiple lookup operations, one directory at a time.

Lookup operations are so numerous on active systems that a mechanism is required to help speed lookup operations so that they don't become a bottleneck for system performance. Modern flavors of UNIX(tm) maintain a cache of results from recent file system directory lookup operations [9]. In this report we refer to this cache as the operating system's directory name lookup cache, or DNLC for short. In the Linux kernel, this cache is called the directory entry cache, or dcache [1]. In most UNIX(tm) systems, the DNLC is only part of the pathname resolution logic, but the dcache is integrated into the Linux kernel's virtual file system (VFS) layer.

For file systems where data is accessed on the same system where it is stored permanently, entries in a system's DNLC can last as long as there is room to keep them in the cache. In this instance, applications run on the same operating system that controls the disk and file system metadata. The operating system is fully aware of any change to local filenames, so the DNLC is always kept up-to-date.

However when files are stored on remote systems, some kind of cache coherency must be maintained for any file metadata stored on systems where remote file data is accessed and modified. Clients of NFSv2 and v3 file servers, for example, usually expire file system metadata periodically so that it will be revalidated the next time it is accessed. This applies to any entries in a client's DNLC, and to file attributes cached by the client. Network file systems such as AFS go to great lengths to help a client maintain a coherent view of file systems it shares with other clients [2, 7].

On Linux, every lookup operation that results in a DNLC hit invokes a file system dependent operation to revalidate the cached entry before the entry is made available for use by other parts of the operating system. Most file systems that maintain file data locally do not need any cache entry revalidation. The Linux NFS client, however, takes this opportunity to revalidate the cached entry. If the entry is considered invalid, the NFS client requests a fresh on-the-wire lookup to validate the file's name and parent directory, it's file handle, and any file attributes corresponding to the cached entry.

To support certain aspects of the NFS standard, the Linux client aggressively times out its DNLC entries under certain circumstances. This is not enough to guarantee cache consistency, however. In this report, we describe the current Linux dcache entry revalidation mechanism, compare the network behavior of the Linux NFS client with other client implementations, and discuss possible improvements.

Dcache Operation

Dcache entries are linked together in a tree-like structure. Each entry points to its parent. Each entry that refers to a directory contains a list of its known children (i.e. children discovered via previous lookup operations). Each entry contains a name, and a pointer to an associated inode (the in-memory structure that Linux uses like a vnode).

Dcache entries are also inserted into a hash table, which constitutes the system's directory entry cache. They are hashed via their name and the address of their parent dcache entry. A lookup operation in this cache starts with the parent's dentry and the name to be looked up, and returns a dentry that matches on the name and the parent.

Dcache entry life cycle

Linux dcache entries behave in some ways like vnodes behave on other flavors of UNIX(tm) [4]. On Linux, in-memory file metadata is split between dcache entries and inodes. The Linux VFS layer stores filename metadata in the dcache, and per-file metadata in the inode cache. In most instances, the VFS layer first looks up a file's dcache entry, then retrieves the inode from the d_inode field in the dcache entry. This makes inode cache lookups rare.

Inodes can outlive dcache entries. If a file is renamed, its inode stays, but its dcache entry is deleted and a new one is created for the new name. Dcache entries can also represent negative lookup results, if the entry's inode pointer is NULL valued.

Dcache entry operations vector

Every dcache entry has an optional operations vector associated with it. This vector allows file system implementors to overload standard dentry operations. File systems can choose to leave this operations vector NULL, in which case the VFS layer uses default behavior. File systems may also choose to implement only some of the operations, in which case they leave the unimplemented function pointers as NULL values.

In the Linux kernel current as of this writing (2.4.2), this vector contains six virtual functions:

int d_revalidate(struct dentry * dentry, int flags)

When a VFS lookup request finds a result in the dcache, this operation, if it exists, is invoked to ensure the cached result is still valid. The flags argument contains flags that indicate the type of lookup operation (last in pathname, return the parent of this dentry, require a directory as the final component, and so on).

This function returns an integer value: one if the dcache entry is valid or has been revalidated, and zero if the dcache entry should be invalidated and refreshed with a file system dependent lookup operation. When the VFS layer encounters a zero return from d_revalidate, it unhashes the dentry from its parent and does a fresh real lookup to attempt to replace it.

Most filesystems leave this NULL, because all their dentries in the dcache are always valid. The NFS client defines this operation, using it to expire and revalidate dcache entries.

int d_hash(struct dentry * dentry, struct qstr * name)

If it exists, d_hash is invoked in place of the dcache's standard hash function. It can be used for file systems that have special naming requirements which make the standard hash function inefficient or unusable. The dcache's hash function hashes on a file's name to determine its location in the cache. This function returns zero normally, and a negative errno-value if some error occurs.

The NFS client leaves this operation as NULL, since it can use POSIX naming conventions supported by the VFS layer by default.

int d_compare(struct dentry * dentry, struct qstr * name1, struct qstr * name2)

This function, if it exists, is invoked to determine whether two names are equivalent. It can be used for file systems that have special naming requirements which make the standard name comparison function inefficient or unusable (for example, if the file system uses case insensitive file names). The dentry argument is the parent directory that contains the two names. This function returns one if the two names are considered the same by the file system, or zero if the two names are not equivalent.

The NFS client leaves this operation as NULL, since it can use POSIX naming conventions supported by the VFS layer by default.

int d_delete(struct dentry * dentry)

This function, if it exists, is invoked when a dcache entry's reference count lapses. If this function returns one, the dcache entry is immediately removed from the dcache. If this function returns zero, the dcache entry is made dormant but is left in the cache, unless it is unreachable, in which case it is removed from the dcache.

The NFS client defines this operation to clean up after silly renames.

void d_iput(struct dentry * dentry, struct inode * inode)

This function, if it exists, is invoked just before the VFS layer unbinds the associated inode from the dcache entry. This happens when a file is deleted or when the kernel's memory manager needs to reclaim dcache entries and inodes.

The NFS client defines this operation to clean up after silly renames.

void d_release(struct dentry * dentry)

This function, if it exists, is invoked when a dcache entry is returned to the dentry SLAB cache (that is, when it is deallocated).

The NFS client, like most file systems, leaves this operation as NULL.

Linux NFS Client Implementation

The difference between dcache entries and file attributes

The dcache is responsible for caching filename information, while the file attribute cache retains per-file metadata information. This distinction is made very clear when considering the rename operation. During a rename, the file attributes remain the same (except for ctime), but the name of the file, and hence any cached directory information, must change.

The file attribute cache and the dcache use separate time out values. Attribute cache time out logic uses time out values stored in the inode field nfs_i.attrtimeo. Dcache time out logic uses time out values stored in the dentry field d_time. This field is reserved specifically for use by file systems; the VFS layer does not touch this field. In certain special cases, of course, the time out values can be ignored. These time out values themselves vary as the client discovers how often an object changes.

Ramfication of stale file handles

A stale file handle represents a file object that was deleted, and a new file object has been created with the same name as the deleted file. What does this mean for the dcache and for file attribute caches? When a client discovers a file handle is stale, it should invalidate the DNLC entry for the file and re-read the file attributes. For Linux, this means it must unhash the dcache entry, free the inode, then re-instantiate a new dcache entry and new inode.

Close-to-open cache coherency

The NFS standard requires clients to maintain close-to-open cache coherency when multiple clients access the same files [5, 6, 10]. This means flushing all file data and metadata changes when a client closes a file, and immediately and unconditionally retrieving a file's attributes when it is opened via the open() system call API. In this way, changes made by one client appear as soon as a file is opened on any other client.

The ext2 file system is the standard local file system on Linux, and is the most frequent local file system exported by the Linux NFS server. It uses 32-bit wide timestamps that count the number of seconds after January 1, 1970. Thus ext2 on-disk inodes don't resolve changes that happen during the same second. This is acceptable for local file access. However, the Linux NFS server exports timestamps with resolution down to only a second; changes to a file or directory that happen during the same second are not reflected in the timestamps. In order to detect sub-second changes to directories, the Linux NFS client currently uses dcache entry revalidation to achieve close-to-open cache coherency.

Each operation that involves altering a directory, such as rmdir, create, and so on, time stamps the parent's directory entry. These operations store updated directory attributes returned by server requests into the attribute cache. Whenever a directory's inode attributes are updated as a result of one of these operations, its dcache entry time stamp is updated to the current time on the client.

When a dcache entry is revalidated, the dcache entry's time stamp is compared with the current time on the client. In most cases, if the difference is larger than the directory's attribute timeout value, the dcache is revalidated by executing an on-the-wire lookup request, and comparing the result to information cached on the client. Normally this information doesn't change, so the dentry may be used as-is. If the information has changed (for example, if the file has been renamed) the dentry is invalidated, and another on-the-wire lookup is requested by the VFS layer to acquire the new information.

The last component of pathname lookups are a special case, however. If the last component's parent directory has changed recently, the time out value is set to zero, causing the dcache entries of files in active directories to be revalidated immediately.

Simple Benchmark

First we present the results of a simple benchmark that compares the number of lookup operations required by an unmodified client with the number required by a client that is less aggressive about timing out dcache entries. The Linux kernel, in both cases, is 2.4.1. The client is an AMD K6 running at 233Mhz, with 128M of memory, and attached to the server via a 100Mb ethernet switch. The client mounts the server via NFSv3, and uses an rsize and wsize of 8,192 bytes.

Our workload is generated by building the Linux kernel in an NFS-mounted file system. The build configuration is held fixed for each build. The kernel is built immediately after the system is booted, providing a cold cache. NFS statistics are gathered via nfsstat.

The modification consists of removing logic in nfs_dentry_force_reval that shortens the attribute time-out value of the last component of pathnames.

Operation type Linux kernel, unmodified Linux kernel, modified

Total packets 108,810 62,718

Fragments 16,388 16,407

Lookup requests 54,405 11,176

Getattr requests 284 384

Write requests 7,913 7,914

Read requests 3,361 3,364

Create requests 770 770

Readdir requests 409 410

While the number of read, write, create, and readdir requests remain about equal for both runs, the modified kernel generated considerably fewer lookup requests, resulting in a packet count that is nearly half that of the unmodified client.

This test illustrates an artificial lower bound for client packet count. In the case of a single client and a single server, the client can trust that it is the only accessor of these files, and thus can safely dispense with extra lookup operations. Our results show how good network traffic could be without these operations. Our goal is to create a client that approaches this lower bound, but effectively implements close-to-open cache coherency.

Workload Analysis

What workloads exacerbate this problem, and what workloads and applications require the extra time-out logic we removed? A compilation-intensive workload is especially onerous in this case, but we can identify other common workloads that would be impacted by excessive DNLC timeouts:

Compilation
Web browser caches
MH mail folders
MTA mail spool directories

We feel that, in fact, most common workloads don't share directories between NFS clients, and that when they do, the applications themselves can easily be responsible for notifying remote instances of changes. Thus this type of excessive timeout is likely unnecessary for all but a few unique types of workload.

The theory of least surprise, however, requires that close-to-open cache consistency be maintained by default. System administrators might find a mount option useful to identify file systems that don't require strict close-to-open cache consistency.

Possible Solutions

The fundamental problem here is that many things in the Linux VFS layer cause a dcache lookup, but only some things require an immediate entry revalidation. Consider these possible solutions:

Make f_op->open revalidate the dentry

Summary:	Remove faulty timeout trashing logic from `nfs_dentry_force_reval`. Add logic to `nfs_open` to revalidate the dcache entry before returning to the VFS layer.
Pros:	This is the most straight-forward design. It makes it clear that a file's attributes are refreshed immediately and unconditionally whenever a file is opened on a client.
Cons:	The `f_op->open` method in the case of the NFS client is `nfs_open`. Invoking this function occurs late during open processing; the dentry has already been looked up. If `nfs_open` should find that the file handle is stale, it must dissociate the dcache entry's current inode and get a new one; the only safe way for this to happen is for `nfs_open` to return `-ESTALE` and have the VFS layer handle the problem. Note that the Solaris VFS layer recovers from this by invalidating DNLC and dropping file attributes, then reacquires them. If we don't expect recovering from stale file handles in open processing to be a performance path, this might be the cleanest solution.

Add more nfs_revalidate calls to the VFS layer

Summary:	Remove faulty timeout trashing logic from `nfs_dentry_force_reval`. Add `nfs_revalidate` to `open_namei` and `open_exec`.
Pros:	This takes a familiar approach.
Cons:	The VFS layer invokes `nfs_revalidate` before calls such as `stat`. (Why doesn't it use this before open?) If `nfs_revalidate` discovers a stale file descriptor, it must dissociate the dcache entry's current inode and get a new one. Extra logic must be added to recover a new file handle; see above. Finally, `nfs_revalidate` uses the normal timeout mechanism, so some indication that the timeout should be ignored must be passed to it.

Zero the d_time field when a file is closed

Summary:	Remove faulty timeout trashing logic from `nfs_dentry_force_reval`. Zero the `d_time` field when closing a file to force `d_revalidate` to revalidate a dcache entry immediately if it is looked up again.
Pros:	No changes are necessary to the VFS layer.
Cons:	The first open of a file will find no dcache entry; the dcache entry will be looked up properly. A close on one client will cause that client to retrieve the file's attributes again properly. A second lookup after only an open will not cause the file's attributes to be retrieved from the server.

Set a flag during lookups that require immediate revalidation

Summary:	Define a flag to `d_revalidate` that `open_namei` and `open_exec` can use to indicate to file system specific routines that when looking up a dentry, it will need immediate revalidation. Replace faulty timeout trashing logic from `nfs_dentry_force_reval` with a check for the new flag. If the new flag is present, trigger an immediate revalidation.
Pros:	This is an easy-to-implement solution, requiring few changes to the VFS layer and NFS client.
Cons:	This solution relies on a side effect of on-the-wire lookup requests. The lookup request revalidates cached filename information, but also returns a fresh set of file attributes.

Note that only open and fopen need to guarantee that they get a consistent handle to a particular file for reading and writing. stat and friends are not required to retrieve fresh attributes, in fact. Thus, for the sake of close-to-open cache coherence, only open and fopen are considered an "open event" where fresh attributes need to be fetched immediately from the server.

Solaris handles the case where client A has an open file, and tries to open the same file again, but discovers that it has been replaced by client B, thereby making the file handle cached on client A "stale." In this case, Solaris's VFS layer invalidates any DNLC entries and attributes for the file, then rebuilds its state.

Other interesting behaviors

The case of open(".") is interesting. On Linux, when a pathname containing simply "." is resolved, the implementation simply retrieves the dentry for the process's current working directory from it's task structure, and returns it to open(). Because neither a lookup nor a revalidation is done to obtain this dentry, it is possible for the pathname resolution logic to return a dentry for a deleted directory. This is the only case on Linux where an application is allowed to open a deleted directory.

This is a problem both for local file system implementations such as ext2 and for NFS. If a directory that is a current working directory of some process is deleted, that process is still allowed to open("."). If the directory is deleted on a remote client, there is no way to tell it is gone until something tries to use the directory.

No lookup for "." also means that the NFS client implementation is not invoked to retrieve or refresh the directory's attributes. With the current implementation of pathname resolution on Linux, it is impossible to guarantee close-to-open cache consistency for current working directories.

We also note that the extra do_revalidate code in the VFS layer support for stat and friends is, at this time, redundant. Each of these system calls uses path_walk to find the dentry for the target object, and path_walk eventually invokes cached_lookup which will revalidate both the DNLC and inode cache. Following the path_walk call in each of these system routines, there appears a do_revalidate which invokes the inode's i_op->revalidate method.

Conclusions

We implemented the last solution (passing a flag to the dentry revalidation operation), and found that, while NFS close-to-open behavior improved, performance was slightly worse due to an increase in the number of on-the-wire lookups.

We mitigated the performance problem by implementing support for a pre-existing client mount option called "nocto" (which stands for "no close-to-open"). For certain workloads where we know there will be little or no data sharing, we can dispense with extra lookup operations to verify file attributes during open() processing, and rely simply on attribute and dcache timeouts. Using this mount option, we obtain very close to optimal on-the-wire lookup counts.

Next we tried implementing the first solution from above; namely, adding logic to nfs_open to retrieve attributes from the server during open processing. This solution was easier to implement than we had estimated, and provided three benefits over our first attempt. First, open(".") is correctly supported. Second, we are closer to removing nfs_lookup_revalidate entirely. Finally, instead of on-the-wire lookups, this client implementation uses on-the-wire GETATTR requests, which results in a measureable performance improvement for lookup-intensive workloads.

Future Work

Dentry revalidation is a performance path for lookup-intensive workloads like compilation. Special attention to the efficiency of this part of the NFS client could have as much payoff as making the read and write paths fast. Entirely eliminating the need to revalidate lookup results could improve NFS performance.

Allowing servers and clients to form a special agreement about directories such that clients can have exclusive access to them might help tremendously. Clients would no longer be burdened by checking back with the server to see if directories have changed, reducing the number of on-the-wire lookup requests significantly.

References

1.	Gooch, Richard. "Overview of the Virtual File System." `www.atnf.csiro.au/~rgooch/linux/vfs.txt`.
2.	Kazar, Michael Leon. "Synchronization and Caching Issues in the Andrew File System." USENIX Conference Proceedings, pp. 27-36. Winter 1988.
3.	Ritchie, Dennis M., and Thompson, Ken. "The Unix time-sharing system." Communications of the ACM, 17(7):365-375. October 1974.
4.	Kleiman, S. R. "Vnodes: An Architecture for Multiple File System Types in Sun Unix." USENIX Conference Proceedings. Atlanta 1986.
5.	Sun Microsystems, Inc. "RFC 1094 - NFS: Network File System Protocol specification." IETF Network Working Group. March 1989.
6.	Sun Microsystems, Inc. "RFC 1813 - NFS: Network File System Version 3 Protocol Specification." IETF Network Working Group. June 1995.
7.	Howard, John H. Kazar, Michael L. Menees, Sherri G. Nichols, David A. Satyanarayanan, M. Sidebotham, Robert N. West, Michael J. "Scale and performance in a distributed file system." ACM Transactions on Computer System, volume 6(1). February 1988.
8.	McKusick, Marshall Kirk. Joy, William N. Leffler, Samuel J. Fabry, Robert S. A Fast File System for UNIX. ACM Transactions on Computer Systems 2, Volume 3, pp. 181-197. August 1984.
9.	Leffler, Samuel J. McKusick, Marshall Kirk. Karels, Michael J. Quarterman, John S. The Design and Implementation of the 4.3BSD UNIX Operating System. Addison-Wesley Publishing Company, 1990.
10.	Callaghan, Brent. NFS Illustrated, Addison-Wesley Longman, Inc., 2000.

Last Modified: Tue Jul 10 11:14:52 EDT 2001

Operation type	Linux kernel, unmodified	Linux kernel, modified
Total packets	108,810	62,718
Fragments	16,388	16,407
Lookup requests	54,405	11,176
Getattr requests	284	384
Write requests	7,913	7,914
Read requests	3,361	3,364
Create requests	770	770
Readdir requests	409	410

Close-To-Open Cache Consistency in the Linux NFS Client