Projects: Linux scalability: Brief: bforget() bug fix Projects: Linux scalability: Brief: bforget() bug fix

Brief: Linux 2.2.3 bforget() bug fix

Chuck Lever, Netscape Communications Corp.
chuckl@netscape.com

Andrea Arcangeli
andrea@e-mind.com

$Id: bforget.html,v 1.5 1999/11/12 20:12:54 cel Exp $

Abstract

Linux 2.2.x sports a redesigned VM and buffer cache system that can self-tune based on offered system load. This brief describes a bug in the buffer cache that was identified and fixed by the authors.



This document is Copyright © 1999 Netscape Communications Corp., all rights reserved. Trademarked material referenced in this document is copyright by its respective owner.


Introduction

Linux is an open-source POSIX-compliant operating system that runs on commodity hardware. A feature of recent releases of Linux is its ability to self-tune system parameters, such as buffer cache size, according to offered system load. Part of this self-tuning ability is implemented in a varying partitioning of physical RAM between the traditional-style buffer cache and the virtual memory system. However, allowing the buffer cache to grow and shrink on-demand introduces some interesting design problems, and, in this case, a significant bug.

In this brief, we describe the bug, and how the bug was identified. We provide performance measurements to show the significance of the problem. Finally, we detail the fix as we proposed it and as it was adopted by the Linux kernel, and report on the performance improvement.


Identifying the bug

Several developers, including the authors, had noticed that early releases of Linux 2.2.x were experiencing a memory leak of some kind under heavy file system and VM loads. Symptoms included poor system performance after copying or removing large files, sporadic "out of memory" errors, and performance degradation on long-running jobs. In order to stress the system and perhaps identify the cause of the problem, the SPEC S-DET benchmark suite was used to generate significant loads on a 4-way Dell PowerEdge 6300-450 with 512M of RAM and an 18G Ultra2 LVD SCSI hard drive.

The SPEC S-DET benchmark consists of a script that is designed to emulate a software developer by running programs such as cc, nroff, cpio, and ed. The software developer workload stresses many aspects of an operating system, including the file and VM subsystems. Multiple concurrent instances of the script can run to simulate reproducible and increasing amounts of system load. The S-DET benchmark is carefully designed to provide a consistent workload across runs, and to error-check the output of each script to quickly catch problems.

It became quickly apparent that running S-DET with a large number of scripts could reproduce the performance degradation scenario quickly.

Figure 1. Consecutive runs of 128 scripts on Linux 2.2.3 [ raw data ]

Notice below how the buffer cache continues to grow without bounds during the benchmark runs. Eventually, the buffer cache causes the system to begin flushing pages unnecessarily. This pressure on the VM system results in, among other things, pages being stolen back from the buffer cache. Since the buffer cache doesn't have any aging or replacement policy, any buffer can be stolen, even buffers for heavily-used data. The system doesn't usually recover from this condition until it is rebooted.

 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
 1 0 0     0 413404 74064  7940   0   0    0  759  178 1852  14  13  73
 0 0 0     0 412948 74508  7940   0   0    1   79  119 1754  14  14  72
 1 0 0     0 412572 74944  7940   0   0    1   89  117 1749  13  14  72
161 0 0     0 375724 75916 10572   0   0   69   88  140 1529  19  30  50
149 3 0     0 319992 97796 34716   0   0  697  144  256  793  15  85   0
116 0 0     0 255888 136664 49480   0   0  284 2184 1217 3828  37  53  10
119 4 0     0 232072 153492 46196   0   0   17  138  277  688  15  85   0
139 0 0     0 234232 161756 49000   0   0   25  408  228  638  21  79   0

...

 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
15 107 0    88  2892 388056 42528   0   0  480   81  320 10401  24  43  33
29 92 1    88  3092 387488 43832   0   0  376  116  310 13246   9  14  77
10 111 0    88  1584 387376 45524   0   0  360   94  307 13222   2   9  89
13 108 1    88  6076 385548 42692   0   0  359  224  310 12467   3   9  88
 2 122 1    88  4812 383928 44592   0   0  444   67  304 10600   6  12  82
 9 111 0    88  1504 381636 48740   0   0  400  143  303 10244   3  13  83
 3 117 1    88  3040 380332 47396   0   0  392  122  299 7481   4   6  91
43 78 1    88  3936 380432 50056   0   0  427  180  301 10556  19  28  52
21 98 1    88  2724 382424 48456   0   0  455   55  311 12743   9  10  81
19 100 1    88  7484 379236 46612   0   0  336   84  302 11554   7  11  83

Figure 2. "vmstat" output excerpts of benchmark runs on Linux 2.2.3

Other features of this scenario include low CPU utilization and a large number of blocked processes. The size of the buffer cache and the size of the free memory list are continually fluctuating, suggesting a high flow of pages into and out of the buffer cache.

After some number of unsuccessful guesses at what might be the problem, it was noticed that the rate at which blocks were being read was low during benchmark runs with good performance, but increased significantly during poorly performing runs. This suggested that the buffer cache was somehow becoming ineffective over time. The elevated read rate interferes with disk write bandwidth, since Linux prioritizes read requests over write requests, since usually a read request is more time-critical.

The original bforget()

The authors, independently of one another, discovered that buffers were being orphaned. In other words, many buffers were ending up unlinked from the hash, such that they would not be found during a find_buffer() request. Each time a buffer is orphaned, another buffer is allocated for the same logical block of data, and another read operation is requested to pull the same data in from the disk. This also causes the buffer cache to grow in size as more and more copies of the same data blocks appear in the cache.

A review of the source that manages buffers (linux/fs/buffer.c) revealed that the bforget() function, used by ext2 when files are truncated or deleted, removes buffers from the hash table, but then it abandons them without recycling them.

/*
 * bforget() is like brelse(), except it removes the buffer
 * from the hash-queues (so that it won't be re-used if it's 
 * shared).
 */
void __bforget(struct buffer_head * buf)
{
        mark_buffer_clean(buf);
        clear_bit(BH_Protected, &buf->b_state);
        remove_from_hash_queue(buf);
        buf->b_dev = NODEV;
        refile_buffer(buf);
        if (!--buf->b_count)
                return;
        printk("VFS: forgot an in-use buffer! (count=%d)\n",
                buf->b_count);
}

Figure 3. The original bforget()

In Linux 2.0, refile_buffer() was probably responsible for ensuring that the buffer was properly added to the free list. However, rewrites of the VM and buffer cache subsystems in Linux have since removed that functionality from refile_buffer().

Proposed and final versions

The authors proposed the following replacement to correct this problem.
void __bforget(struct buffer_head * buf)
{
        clear_bit(BH_Protected, &buf->b_state);
        remove_from_queues(buf);
        put_last_free(buf);
        if (!--buf->b_count)
                return;
        printk("VFS: forgot an in-use buffer! (count=%d)\n",
                buf->b_count);
}

Figure 4. Proposed fix

After more analysis and testing, Linus Torvalds' final version of the routine looks like this (from Linux 2.2.5):
/*
 * bforget() is like brelse(), except it puts the buffer on the
 * free list if it can.. We can NOT free the buffer if:
 *  - there are other users of it
 *  - it is locked and thus can have active IO
 */
void __bforget(struct buffer_head * buf)
{
        if (buf->b_count != 1 || buffer_locked(buf)) {
                __brelse(buf);
                return;
        }
        buf->b_count = 0;
        remove_from_queues(buf);
        put_last_free(buf);
}

Figure 5. Final fix

The "if" construct is simply defensive programming. Studying the current implementation of ext2, it appears unlikely that bforget() will be called with a buffer that has a usage count other than one. However, it was thought that eventually, other file systems may want to use bforget(), or ext2 may change how it uses bforget(), so leaving the safety check in there would help catch software problems that may be introduced later.

The rest of the routine ensures that the buffer's usage count is zero before finally inserting it into the free list. In general, a non-zero usage count prevents try_to_free_buffers() from releasing pages containing buffers. If buffers with non-zero usage counts appear in the buffer free lists, they will be skipped over by try_to_free_buffers() in favor of buffers that may still have useful data. Having a large number of these buffers in the free list can even cause a severe system-wide memory shortage.

 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
 1 0 0     0 414080 73704  7636   0   0    1  123  130 1759  13  14  73
 0 0 0     0 413516 74140  7636   0   0    1   89  117 1745  13  15  72
 0 0 0     0 413136 74576  7636   0   0    1   90  117 1749  13  15  72
159 0 0     0 372752 75372 11088   0   0  130   88  145 1379  19  40  41
134 1 0     0 332664 89228 32172   0   0  688   75  259  749  13  87   0
110 30 0     0 322096 89232 38256   0   0   79  466  337  746  22  78   0
132 0 0     0 303544 89232 47352   0   0  104   68  166  751  54  46   0
144 0 0     0 301464 89232 45012   0   0   36  285  244  777  39  61   0
139 1 0     0 305996 89232 49180   0   0    6  211  235  666  36  65   0
142 0 0     0 307616 89232 52908   0   0   15  141  161  606  31  69   0
140 0 0     0 303904 89232 49568   0   0   27  199  249  669  27  73   0
139 3 0     0 299460 89232 51920   0   0    5  317  310  924  31  69   0
139 0 0     0 307004 89232 51676   0   0   33   61  167  458  26  74   0
138 2 0     0 304716 89232 49880   0   0    5  415  287  728  34  66   0

...

138 0 0     0 286980 91856 53820   0   0   10   20  126  510  33  67   0
137 7 0     0 291312 91856 56588   0   0    8  374  293  716  31  69   0
147 0 0     0 293088 91856 58432   0   0    7   78  144  593  38  62   0
141 0 0     0 287608 91856 57548   0   0    0  290  230  667  36  64   0
142 1 0     0 306696 91856 51956   0   0    0  178  203  643  35  65   0
135 0 0     0 303112 91856 54436   0   0    1  127  144  480  27  73   0

Figure 6. "vmstat" output excerpts of benchmark runs on Linux 2.2.3 with final bforget()

The final bforget() has corrected the performance degradation scenario. CPU utilization is maximized, and few processes are blocked. Block read rate is low, and the buffer cache size expands to a reasonable working set, then remains at a steady size. The following graph shows that performance is flat for each consecutive run of the 128 script benchmark.

Figure 7. Consecutive runs of 128 scripts on Linux 2.2.3 with final bforget() [ raw data ]

The fixed version of bforget() is contained in Linux 2.2.5 and later.


This document was written as part of the Linux Scalability Project. For more information, see our home page.
If you have comments or suggestions, email
linux-scalability@umich.edu Copyright 1999 Netscape Communications Corporation. All rights reserved.