projects techreports press lab location staff
citi top.2 top.3
citi mid.3
bot.1 bot.2 bot.3

Projects: Linux scalability: Annotated bibliography

T. E. Anderson, B. N. Bershad, E. D. Lazowska, H. M. Levy, "Scheduler activations : effective kernel support for the user-level management of parallelism," ACM Transactions on Computer Systems Vol.10, No. 1 (Feb. 1992), pp. 53-79

Threads are the vehicle for concurrency in many approaches to parallel programming. Threads can be supported either by the operating system kernel or by user-level library code in the application address space, but neither approach has been fully satisfactory.

This paper addresses this dilemma. First, we argue that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations; managing parallelism at the user level is essential to high-performance parallel computing. Next, we argue that the problems encountered in integrating user-level threads with other system services is a consequence of the lack of kernel support for user-level threads provided by contemporary multiprocessor operating systems; kernel threads are the wrong abstraction on which to support user-level management of parallelism. Finally, we describe the design, implementation, and performance of a new kernel interface and user-level thread package that together provide the same functionality as kernel threads without compromising the performance and flexibility advantages of user-level management of parallelism.
[ ACM Citation ]

M. Aron, P. Druschel, "Soft timers: efficient microsecond software timer support for network processing," Proceedings of the 17th Symposium on Operating Systems Principles (SOSP'99), Kiawah Island Resort, SC, December 1999.

This paper proposes and evaluates soft timers, a new operating system facility that allows the efficient scheduling of software events at a granularity down to tens of microseconds. Soft timers can be used to avoid interrupts and reduce context switches associated with network processing without sacrificing low communication delays.

More specifically, soft timers enable transport protocols like TCP to efficiently perform rate-based clocking of packet transmissions. Experiments show that rate-based clocking can improve HTTP response time over connections with high bandwidth-delay products by up to 89% and that soft timers allow a server to employ rate-based clocking with little CPU overhead (2-6%) at high aggregate bandwidths.

Soft timers can also be used to perform network polling, which eliminates network interrupts and increases the memory access locality of the network subsystem without sacrificing delay. Experiments show that this technique can improve the throughput of a Web server by up to 25%.

[ Postscript ]

H. Balakrishnan, V. Padmanabhan, S. Seshan, M. Stemm, R. H. Katz, "TCP Behavior of a Busy Internet Server: Analysis and Improvements," Proceedings of IEEE Infocom, March 1998.

The rapid growth of the World Wide Web in recent years has caused a significant shift in the composition of Internet traffic. Although past work has studied the behavior of TCP dynamics in the context of bulk-transfer applications and some studies have begun to investigate the interactions of TCP and HTTP, few have used extensive real-world traffic traces to examine the problem. This interaction is interesting because of the way in which current Web brouwsers use TCP connections: multiple concurrent short connections from a single host.

In this paper, we analyze the way in which Web browsers use TCP connections based on extensive traffic traces obtained from a busy Web server (the offical Web server of the 1996 Atlanta Olympic games). At the time of operation, this Web server was one of the busiest on the Internet, handling tens of millions of requests per day from several hundred thousand clients. We first describe the techniques used to gather these traces and reconstruct the behavior of TCP on the server. We then present a detaied analysis of TCP's loss recovery and congestion control behavior from the recorded transfers. Our two most important results are: (1) short Web transfers lead to poor loss recovery performance for TCP, and (2) concurrent connections are overly aggressive users of the network. We then discuss techniques designed to solve these problems. To improve the data-driven loss recovery performance of short transfers, we present a new enhancement to TCP's loss recovery. To improve the congestion control and loss recovery performance of parallel TCP connections, we present a new integrated approach to congestion control and loss recovery that works across the set of concurrent connections. Simulations and trace analysis show that our enhanced loss recovery scheme could have eliminated 25% of all timeout events, and that our integrated approach provides greater fairness and improved startup performance for concurrent connections. Our solutions are more general than application-specific enhancements such as the use of persistent connections in P-HTTP and HTTP/1.1, and addresses issues, such as improved TCP loss recovery, that are not considered by them.
[ Postscript ]

A. Balsa, "Linux Benchmarking HOWTO," Linux Documentation Project, August 1997.

The Linux Benchmarking HOWTO discusses some issues associated with the benchmarking of Linux systems and presents a basic benchmarking toolkit, as well as an associated form, which enable one to produce significant benchmarking information in a couple of hours.
[ HTML ]

G. Banga, J. C. Mogul, "Scalable kernel performance for Internet servers under realistic load," Proceedings of USENIX Annual Technical Conference, June 1998.

UNIX Internet servers with an event-driven architecture often perform poorly under real workloads, even if they perform well under laboratory benchmarking conditions. We investigated the poor performance of event-driven servers. We found that the delays typical in wide-area networks cause busy servers to manage a large number of simultaneous connections. We also observed that the select() system call implementation in most UNIX kernels scales poorly with the number of connections being managed by a process. The UNIX algorithm for allocating file descriptors also scales poorly. These algorithmic problems lead directly to the poor performance of event-driven servers.

We implemented scalable versions of the select() system call and the descriptor allocation algorithm. This led to an improvement of up to 58% in Web proxy and Web server throughput, and dramatically improved the scalability of the system.
[ Postscript ]

G. Banga, P. Drushel, J. C. Mogul, "Better operating system features for faster network servers," SIGMETRICS Workshop on Internet Server Performance, June 1998.

Widely-used operating systems provide inadequate support for large-scale Internet server applications. Their algorithms and interfaces fail to efficiently support either event-driven or multi-threaded servers. They provide poor control over the scheduling and management of machine resources, making it difficult to provide robust and controlled service. We propose new UNIX interfaces to improve scalability, and to provide fine-grained scheduling and resource management.
[ Postscript ]

G. Banga, P. Drushel, J. C. Mogul, "Measuring the Capacity of a Web Server," Proceedings of USENIX Symposium on Internet Technologies and Systems, 1997.

The widespread use of the World Wide Web and related applications places interesting perforrnance demand on network servers. The ability to measure the effect of these demands is important for tuning and optimizaing the various software components that make up a Web server. To measure these effects, it is necessary to generate realistic HTTP client requests. Unfortunately, accurate generation of such traffic in a testbed of limited scope is not trivial. In particular, the commonly used approach is unable to generate client request-rates that exceed the capacity of the server being tested even for short periods of time. This paper examines pitfalls that one encounters when measuring Web server capacity using a synthetic workload. We propose and evaluate a new method for Web traffic generation that can generate bursty trafic, with peak loads that exceed the capacity of the server. Finally, we use the proposed method to measure the performance of a Web server.
[ HTML Postscript ]

M. Beck, H. Boehme, M. Dziadzka, U. Kunitz, R. Magnus, D. Verworner, Linux Kernel Internals, 2nd ed., tr. Addison Wesley Longman, 1998. ISBN 0-201-33143-8

This is the second edition of our book about the LINUX kernel. The book has been updated to cover the 2.0 version of the kernel, which is a milestone in the development of LINUX.
[ Info ]

E. D. Berger, R. D. Blumofe, "Hoard: A Fast, Scalable, and Memory-Efficient Allocator for Shared-Memory Multiprocessors," The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-99-22. September 1999.

In this paper, we present Hoard, a memory allocator for shared-memory multiprocessors. We prove that its worst-case memory fragmentation is asymptotically equivalent to that of an optimal uniprocessor allocator. We present experiments that demonstrate its speed and scalability.

[ Postscript ]

L. S. Brakmo, L. L. Peterson, "TCP Vegas: End to End Congestion Avoidance on a Global Internet," IEEE Journal on Selected Areas in Communication, 13(8):1465-1480, October 1995.

Vegas is an implementation of TCP that achieves between 37 and 71% better throughput on the Internet, with one-fifth to one-half the losses, as compared to the implementation of TCP in the Reno distribution of BSD Unix. This paper motivates and describes the three key techniques employed by Vegas, and presents the results of a comprehensive experimental performance study-- using both simulations and measurements on the Internet-- of the Vegas and Reno implementations of TCP.

[ Postscript ]

E. Bugnion, J. M. Anderson, T. C. Mowry, M. Rosenblum, and M. S. Lam., "Compiler-Directed Page Coloring for Multiprocessors," Proceedings of The Seventh International Symposium on Architectural Support for Programming Languages and Operating Systems, 1998.

This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler's knowledge of the access patterns of the parallelized applications to direct the operating system's virtual memory page mapping strategy. We demonstrate that this technique can lead to significant performance improvements over two commonly used page mapping strategies for machines with either direct-mapped or two-way set-associative caches. We also show that it is complementary to latency-hiding techniques such as prefetching.

We implemented compiler-directed page coloring in the SUIF parallelizing compiler and on two commercial operating systems. We applied the technique to the SPEC95fp benchmark suite, a representative set of numeric programs. We used the SimOS machine simulator to analyze the applications and isolate their performance bottlenecks. We also validated these results on a real machine, an eight-processor 350MHz Digital AlphaServer. Compiler-directed page coloring leads to significant performance improvements for several applications. Overall, our technique improves the SPEC95fp rating for eight processors by 8% over Digital UNIX's page mapping policy and by 20% over a page coloring, a standard page mapping policy. The SUIF compiler achieves a SPEC95fp ratio of 57.4, the highest ratio to date.
[ Postscript ]

B. Calder, C. Krintz, S. John, T. Austin, "Cache-Conscious Data Placement", Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, October 1998.

As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache performance by mapping code with temporal locality to different cache blocks in the virtual address space eliminating cache conflicts. These code placement techniques can be applied directly to the problem of placing data for improved data cache performance.

In this paper we present a general framework for Cache Conscious Data Placement. This is a compiler directed approach that creates an address palcement for the stack (local variables), global variables, heap objects, and constants in order to reduce data cache misses. The placement of data objects is guided by a temporal relationship graph between objects generated via profiling. Our results show that profile driven data placement significantly reduces the data miss rate by 24% on average.
[ Postscript ]

M. Crovella, R. Frangioso, M. Harchol-Balter, "Connection Scheduling in Web Servers," Proceedings of the USENIX Conference on Internet Technologies, October 1999.

Under high loads, a Web server may be servicing many hundreds of connections concurrently. In traditional Web servers, the question of the order in which concurrent connections are serviced has been left to the operating system. In this paper we ask whether servers might provide better service by using non-traditional service ordering. In particular, for the case when a Web server is serving static files, we examine the costs and benefits of a policy that gives preferential service to short connections. We start by assessing the scheduling behavior of a commonly used server (Apache running on Linux) with respect to connection size and show that it does not appear to provide preferential service to short connections. We then examine the potential performance improvements of a policy that does favor short connections (shortest-connection-first). We show that mean response time can be improved by factors of four or five under shortest-connection-first, as compared to an (Apache-like) size-independent policy. Finally we assess the costs of shortest-connection-first scheduling in terms of unfairness (i.e., the degree to which long connections suffer). We show that under shortest-connection-first scheduling, long connections pay very little penalty. This surprising result can be understood as a consequence of heavy-tailed Web server workloads, in which most connections are small, but most server load is due to the few large connections. We support his explanation using analysis.
[ Postscript ]

G. Ganger, Y. Patt, "Metadata Update Performance in File Systems," USENIX Symposium on Operating Systems Design and Implementation, November, 1994.

Structural changes, such as file creation and block allocation, have consistently been identified as file system performance problems in many user environments. We compare several implementations that maintain metadata integrity in the event of a system failure but do not require changes to the on-disk structures. In one set of schemes, the file system uses asynchronous writes and passes ordering requirements to the disk scheduler. These scheduler-enforced ordering schemes outperform the conventional approach (synchronous writes) by more than 30 percent for metadata update intensive benchmarks, but are suboptimal mainly due to their inability to safely use delayed writes when ordering is required. We therefore introduce soft updates, an implementation that asymptotically approaches memory-based file system performance (within 5 percent) while providing stronger integrity and security guarantees than most UNIX file systems. For metadata update intensive benchmarks, this improves performance by more than a factor of two when compared to the conventional approach.
[ Postscript ]

R. Gooch, "I/O Event Handling Under Linux," Part of the LINUX FAQ, June 1998.

I/O Event handling is about how your Operating System allows you to manage a large number of open files (file descriptors in UNIX/POSIX, or FDs) in your application. You want the OS to notify you when FDs become active (have data ready to be read or are ready for writing). Ideally you want a mechanism that is scalable. This means a large number of inactive FDs cost very little in memory and CPU time to manage.

First I should start off by stating my goal: to develop a thin interface that makes best use of the facilities the OS provides scaling as much as possible. Where possible, use of standard POSIX facilities are preferred. A seconday goal is to provide a convenient interface for applications. The reason for preferring standard POSIX interfaces is that it introduces less conditional blocks in the interface code.
[ HTML ]

J.L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 2nd ed., Morgan Kaufmann, 1996.

The task the computer designer faces is a complex one: Determine what attributes are important for a new machine, then design a machine to maximize performance while staying within cost constraints. This task has many aspects, including instruction set design, functional organization, logic design, and implementation. The implementation may encompass integrated circuit design, packaging, power, and cooling. Optimizing the design requires familiarity with a very wide range of technologies, from compilers and operating systems to logic design and packaging.

J. C. Hoe, "Improving the Start-up Behavior of a Congestion Control Scheme in TCP," Proceedings of ACM SIGCOMM, pp. 270-280, August 1996.

Based on experiments conducted in a network simulator and over real networks, this paper proposes changes to the congestion control scheme in current TCP implementations to improve its behavior during the start-up period of a TCP connection.

The scheme, which includes Slow-start, Fast Retransmit, and Fast Recovery algorithms, uses acknowledgments from a receiver to dynamically calculate reasonable operating values for a sender's TCP parameters governing when and how much a sender can pump into the network. During the start-up period, because a TCP sender starts with default parameters, it often ends up sending too many packets and too fast, leading to multiple losses of packets from the same window. This paper shows that recovery from losses during the start-up period is often unnecessarily time-consuming.

In particular, using the current Fast Retransmit algorithm, when multiple packets in the same window are lost, only one of the packet losses may be recovered by each Fast Retransmit; the rest are often recovered by Slow-start after a usually lengthy retransmission timeout. Thus, this paper proposes changes to the Fast Retransmit algorithm so that it can quickly recover from multiple packet losses without waiting unnecessarily for the timeout. These changes, tested in the simulator and on the real networks, show significant performance improvements, especially for short TCP transfers. The paper also proposed other changes to help minimize the number of packets lost during the start-up period.
[ Postscript ]

J. C. Hu, I. Pyarali, D. C. Schmidt, "Measuring the Impact of Event Dispatching and Concurrency Models on Web Server Performance Over High-Speed Networks," Proceedings of the 2nd IEEE Global Internet Conference, November 1997.

This paper provides two contributions to the study of high-performance Web servers. First it outlines the optimizations necessary to build efficient and scalable Web servers and illustrates how we've applied some of these optimizations to create JAWS. JAWS is a high-performance Web server that is explicitly designed to alleviate overheads incurred by existing Web servers on high-speed networks. It consistently outperforms existing Web servers (such as Apache, Java Server, PHTTPD, Zeus, and Netscape Enterprise) over 155 Mbps ATM networks on UNIX platforms.

Second, this paper describes how we have customized JAWS to leverage advanced features of Windows NT for multi-processor hardware over ATM. The Windows NT features used in JAWS include asynchronous mechanisms for connection establishment and data transfer. Our previous benchmarking studies demonstrate that once the overhead of disk I/O is reduced to a negligible constant factor (e.g., via memory caches), the primary determinants of Web server performance are the concurrency and event dispatching strategies.

Our performance results over a 155Mbps ATM link indicate that certain Windows NT asynchronous I/O mechanisms (i.e., TransmitFile) provide superior performance for large file transfers compared with conventional synchronous mutli-threaded servers. On the other hand, synchronous event dispatching performed better for files less than 50 Kbytes. Thus, to provide optimal performance, Web servers should be adaptive, choosing to use different mechanisms (e.g., TransmitFile) to handle requests for large files, while using alternative I/O mechanisms (e.g., synchronous event dispatching) for requests for small files.
[ Postscript PDF ]

J. C. Hu, D. C. Schmidt, "JAWS: A Framework for High-Performance Web Servers"

Developers of communication software face many challenges. Communication software contains both inherent complexities, such as fault detection and recovery, and accidental complexities, such as the continuous re-rediscovery and re-invention of key concepts and components. Meeting these challenges requires a thorough understanding of object-oriented application frameworks and patterns. This paper illustrates how we have applied frameworks and patterns for communication software to develop a high-performance Web server called JAWS.

JAWS is an object-oriented framework that supports the configuration of various Web server strategies, such as a Thread Pool concurrency model with asynchronous I/O and LRU caching vs. a Thread-per-Request concurrency model with synchronous I/O and LFU caching. Because JAWS is a framework, these strategies can be customized systematically and measured both independently and collaboratively to determine the best strategy profiles. Using these profiles, JAWS can statically and dynamically adapt its behavior to deploy the most effective strategies for a given software/hardware platform and workload. JAWS' adaptive software features make it a powerful application framework for consturcting high-performance Web servers.
[ Postscript PDF ]

K. Lai, M. Baker, "A Performance Comparison of UNIX Operating Systems on the Pentium," Proceedings of USENIX Technical Conference, January 1996.

This paper evaluates the performance of three popular versions of the UNIX operating system on the x86 architecture: Linux, FreeBSD, and Solaris. We evaluate the systems using freely available micro- and application benchmarks to characterize the behavior of their operating system services. We evaluate the currently available major releases of the systems "as is," without any performance tuning.

Our results show that the x86 operating systems and system libraries we tested failed to deliver the Pentium's full memory write performance to applications. On small-file workloads, Linux is an order of magnitude faster than other systems. On networking software, FreeBSD provides two to three times higher bandwidth than Linux. In general, Solaris performance usually lies between that of the other two systems.

Although each operating system out-performs the others in some area, we conclude that no one system offers clearly better overall performance. Other factors, such as extra features, ease of installation, or freely available source code, are more convincing reasons for choosing a particular system.
[ Postscript HTML ]

D. Maltz, P. Bhagwat, "TCP Splicing for Application Layer Proxy Performance," IBM Research Report RC 21139, March 1998.

Application layer proxies already play an important role in today's networks, serving as firewalls and HTTP caches -- and their role is being expanded to include encryption, compression, and mobility support services. Current application layer proxies suffer major performance penalties as they spend most of their time moving data back and forth between connections, context switching and crossing protection boundaries for each chunk of data they handle. We present a technique called TCP Splice that provides kernel support for data relaying operations which runs at near router speeds. In our lab testing, we find SOCKS firewalls using TCP Splice can sustain a data throughput twice that of normal firewalls, with an average packet forwarding latency 30 times less.
[ Postscript ]

M. Mathis, J. Mahdavi, "Diagnosing Internet Congestion with a Transport Layer Performance Tool," INET '96, Montreal, Quebec.

The Internet is once again suffering from its own success. The massive increase in presented load has left many large users with insufficient bandwidth to support their applications. This problem is exacerbated by the lack of performance measures by which by which users can compare IP providers.

We have focused on one particular metric, bulk transfer capacity, and a tool, "TReno," which can function as a basis for a formal bulk transfer metric. Bulk transfer capacity requires a very clean network, and has other properties that make it an important gauge of network performance. We introduce our tool, TReno, and describe its relation to current TCP implementations. Our research with TReno led us to key observations about TCP behavior and influenced details of the new specifications for the TCP Selective acknowledgment (SACK) option. Currently TReno does not precisely estimate TCP's performance in the Internet. As SACK is deployed, and the research community reaches consensus regarding its proper behavior, TReno and state-of-the-art TCP will converge to identical performance and behavior.

The IP Provider Metrics subcommittee of the IETF is developing standardized formal metrics for providers. There is a substantial amount of work to be completed in the areas of measurement procedures and statistics before TReno can become a standardized metric. Our goal is for the TReno development effort and the IPPM standards efforts to converge.

TReno will maximize its potential as a metric if it also functions well as a diagnostic. If a long multi-provider path between two users is not performing as well as desired, it would be extremely valuable to use TReno to measure the path hop-by-hop or provider-by-provider to localize the problem.
[ HTML ]

M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, "TCP Selective Acknowledgment Options," RFC 2018, October 1996.

TCP may experience poor performance when multiple packets are lost from one window of data. With the limited information available from cumulative acknowledgments, a TCP sender can only learn about a single lost packet per round trip time. An aggressive sender could choose to retransmit packets early, but such retransmitted segments may have already been successfully received.

A Selective Acknowledgment (SACK) mechanism, combined with a selective repeat retransmission policy, can help to overcome these limitations. The receiving TCP sends back SACK packets to the sender informing the sender of data that has been received. The sender can then retransmit only the missing data segments.

This memo proposes an implementation of SACK and discusses its performance and related issues.
[ Text ]

M. Mathis, J. Mahdavi, G. L. Huntoon, "TCP Enhancements to Support High Performance Applications," research proposal published on the web, August 1998.

Existing TCP implementations are inadequate to support high performance applications and services over national scale, high speed networks. The Pittsburgh Supercomputing Center requests funding for a project to design, develop, implement and test performance enhancements to TCP implementations. These enhancements will improve TCP based application performance over high latency, high bandwidth production networks. To accomplish this, we will focus on the Center's existing high performance computer and network environment: production supercomputers attached to the present and future wide area high speed networks.

Our ultimate goal is to eliminate TCP itself as a bottleneck, so that applications use appropriate shares of the underlying IP infrastructure with only minor latency or bandwidth penalties beyond those present in the lower layers. In other words, for all TCP connections, performance should be limited only by the application itself or congestion on an underlying network link. The only exceptions are during transients in the data rate, such as during connection startup. Furthermore it should not require network expertise to write such applications. Ultimately, all straightforward, non-privileged application codes using the standard network socket interface on a high performance computer will benefit from this work.

To accomplish this goal, we propose a three year research and development project. Our primary objective is to produce an enhanced TCP that will better support high performance applications over national scale, high speed networks. We will address other important objectives and produce valuable tools for future work in this area.
[ HTML ]

L. McVoy, "Splice -- a push pull I/O model for Unix"

This document describes a strawman proposal for a new model for I/O in a Unix Operating System. The goals of the new model include low overhead I/O, scatter-gather for zero copy networking to and from disks, async I/O, simple implemenation, and compatibility with the existing kernel implementation. The model is fundamentally different than the existing model, but easily provides compatibility with the standard read and write interfaces. If desired, a kernel could move entirely to this model, building the old interfaces on top, somewhat similarly to what Sun did with mmap. If this idea ever lives up to its full potential, it could be as useful as the mmap work.
[ Postscript ]

J. C. Mogul, K. K. Ramakrishnan, "Eliminating Receive Livelock in an Interrupt-driven Kernel," Proceedings of USENIX Technical Conference, January 1996.

Most operating systems use interface interrupts to schedule network tasks. Interrupt-driven systems can provide low overhead and good latency at low offered load, but degrade significantly at higher arrival rates unless care is taken to prevent several pathologies. These are various forms of receive livelock, in which the system spends all its time processing interrupts, to the exclusion of other necessary tasks. Under extreme conditions, no packets are delivered to the user application or the output of the system.

To avoid livelock and related problems, an operating system must schedule network interrupt handling as carefully as it schedules process execution. We modified an interrupt-driven networking implementation to do so; this eliminates receive livelock without degrading other aspects of system performance. We present measurements demonstrating the success of our approach.
[ Postscript ]

Mosberger, D., Jin, T. "httperf--- A Tool for Measuring Web Server Performance," SIGMETRICS Workshop on Internet Server Performance, June 1998.

This paper describes httperf, a tool for measuring web server performance. It provides a flexible facility for generating various HTTP workloads and for measuring server performance. The focus of httperf is not on implementing one particular benchmark but on providing a robust, high-performance tool that facilitates the construction of both micro- and macro-level benchmarks. The three distinguishing characteristics of httperf are its robustness, which includes the ability to generate and sustain server overload, support for the HTTP/1.1 protocol, and its extensibility to new workload generators and performance measurements. In addition to reporting on the design and implementation of httperf this paper also discusses some of the experiences and insights gained while realizing this tool.
[ HTML Postscript ]

B. J. Murphy, S Zeadally, C. J. Adams, "An Analysis of Process and Memory Models to Support High-Speed Networking in a UNIX Environment," Proceedings of USENIX Technical Conference, January 1996.

In order to reap the benefits of high-speed networks, the performance of the host operating system must at least match that of the underlying network. A barrier to achieving high throughput is the cost of copying data within current host architectures. We present a performance comparison of three styles of network device driver designed for a conventional monolithic UNIX kernel. Each driver performs a different number of copies. The zero-copy driver works by allowing the memory on the network adapter to be mapped directly into user address space. This maximises performance at the cost of: 1) breaking the semantics of existing network APIs such as BSD sockets and SVR4 TLI; 2) pushing responsibility for network buffer management up from the kernel into the application layer. The single-copy driver works by copying data directly between user space and adapter memory obviating the need for an intermediate copy into kernel buffers in main memory. This approach can be made transparent to existing application code but, like the zero-copy case, relies on an adapter with a generous quantity of on-board memory for buffering network data. The two-copy driver is a conventional STREAMS driver. The two-copy approach sacrifices performance for generality. We observe that the STREAMS overhead for small packets is significant. We report on the benefit of the hardware cache in ameliorating the effect of the second copy, although we note that streaming network data through the cache reduces the level of cache residency seen by the rest of the system.

A barrier to achieving low jitter is the non-deterministic nature of many operating system schedulers. We describe the implementation and report on the performance of a kernel streaming driver that allows data to be copied between a network adapter and another I/O device without involving the process scheduler. This provides performance benefits in terms of increased throughput, increased CPU availability and reduced jitter.
[ Postscript ]

V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: A unified I/O buffering and caching system," Rice University CS Technical Report TR97-294.

This paper presents the design, implementation, and evaluation of IO-Lite, a unified I/O buffering and caching system. IO-Lite unifies all buffering and caching in the system, to the extent permitted by the hardware. In particular, it allows applications, interprocess communication, the filesystem, the file cache, and the network subsystem to share a single physical copy of the data safely and concurrently. Protection and security are maintained through a combination of access control and read-only sharing. The various subsystems use (mutable) buffer aggregates to access the data according to their needs. IO-Lite eliminates all copying and multiple buffering of I/O data, and enables various cross-subsystem optimizations. Performance measurements show significant performance improvements on Web servers and other I/O intensive applications.
[ Postscript ]

V. Paxson, "Automated Packet Trace Analysis of TCP Implementations," Proceedings of ACM SIGCOMM, 1997.

We describe tcpanaly, a tool for automatically analyzing a TCP implementation's behavior by inspecting packet traces of the TCP's activity. Doing so requires surmounting a number of hurdles, including detecting packet filter measurement errors, coping with ambiguities due to the distance between the measurement point and the TCP, and accomodating a surprisingly large range of behavior among different TCP implementations. We discuss why our efforts to develop a fully general tool failed, and detail a number of significant differences among 8 major TCP implementations, some of which, if ubiquitous, would devastate Internet performance. The most problematic TCPs were all independently written, suggesting that correct TCP implementation is fraught with difficulty. Consequently, it behooves the Internet community to develop testing programs and reference implementations.
[ Postscript ]

M. Russinovich, "Inside I/O Completion Ports,", August, 1998.

The goal of a server is to incur as few context switches as possible by having its threads avoid unnecessary blocking, while at the same time maximizing parallelism by using multiple threads. The ideal is for there to be a thread actively servicing a client request on every processor and for those threads not to block if there are additional requests waiting when they complete a request. For this to work correctly however, there must be a way for the application to activate another thread when one processing a client request blocks on I/O (like when it reads from a file as part of the processing).

Windows NT 3.5 introduced a set of APIs that make this goal relatively easy to achieve. The APIs are centered on an object called a completion port. In this article I'm going to provide an overview of how completion ports are used and then go inside them to show you how Windows NT implements them.
[ HTML ]

M. L. Schmit, Pentium(tm) Processor Optimzation Tools, Academic Press, Inc., 1995.

This book shows C/C++ and assembly language programmers how to improve the performance of their programs by taking advantage of the superscalar architecture of the Intel(R) Pentium(tm) CPU. Programs written originally for the 8088 through 80486 can be optimized for the Pentium, resulting in 100% to 400% performance increase.

J. Semke, J. Mahdavi, M. Mathis, "Automatic TCP Buffer Tuning," Proceedings of ACM SIGCOMM, pp. 315-323, September 1998.

With the growth of high performance networking, a single host may have simultaneous connections that vary in bandwidth by as many as six orders of magnitude. We identify requirements for an automatically-tuning TCP to achieve maximum throughput across all connections simultaneously within the resource limits of the sender. Our auto-tuning TCP implementation makes use of several existing technologies and adds dynamically adjusting socket buffers to achieve maximum transfer rates on each connection without manual configuration.

Our implementation involved slight modifications to a BSD-based socket interface and TCP stack. With these modifications, we achieved drastic improvements in performance over large bandwidth*delay paths compared to the default system configuration, and a significant reduction in memory usage compared to hand-tuned connections, allowing many more simultaneous connections.
[ Postscript ]

C. Staelin, L. McVoy, "lmbench: Portable tools for performance analysis," Proceedings of USENIX Technical Conference, pp. 279-284, January 1996.

lmbench is a micro-benchmark suite designed to focus attention on the basic building blocks of many common system applications, such as databases, simulations, software development, and networking. In almost all cases, the individual tests are the result of analysis and isolation of a customer's actual performance problem. These tools can be, and currently are, used to compare different system implementations from different vendors. In several cases, the benchmarks have uncovered previously unknown bugs and design flaws. The results have shown a strong correlation between memory system performance and overall performance. lmbench includes an extensible datsbase of results from systems current as of late 1995.
[ Postscript ]

B. Zorn, "Malloc/Free and GC Implementations,", March 1998.

This is an annotated table of contents for well-known memory allocation implementations, including Wolfram Gloger's ptmalloc, used in glibc-2.0.
[ HTML ]

If you have comments or suggestions, email

Revision: $Id: annbib.html,v 1.17 2000/02/21 23:32:59 cel Exp $ projects | techreports | press
lab | location | staff |
for info, email or call +1 (734) 763-2929.
Copyright 1999 University of Michigan Regents. All rights reserved.