Projects: Linux scalability: Annotated bibliography
T. E. Anderson, B. N. Bershad, E. D. Lazowska, H. M. Levy,
"Scheduler activations : effective kernel support
for the user-level management of parallelism,"
ACM Transactions on Computer Systems Vol.10,
No. 1 (Feb. 1992), pp. 53-79
Threads are the vehicle for concurrency in many approaches to
parallel programming. Threads can be supported either by the
operating system kernel or by user-level library code in the
application address space, but neither approach has been fully
This paper addresses this dilemma. First, we argue that
the performance of kernel threads is inherently worse than that of
user-level threads, rather than this being an artifact of
existing implementations; managing parallelism at the user level is
essential to high-performance parallel computing. Next, we argue that
the problems encountered in integrating user-level threads
with other system services is a consequence of the lack of kernel support
for user-level threads provided by contemporary
multiprocessor operating systems; kernel threads are
the wrong abstraction on which to support user-level management of
parallelism. Finally, we describe the design, implementation,
and performance of a new kernel interface and user-level thread
package that together provide the same functionality as kernel threads
without compromising the performance and flexibility
advantages of user-level management of parallelism.
M. Aron, P. Druschel, "Soft timers: efficient microsecond
software timer support for network processing,"
Proceedings of the 17th Symposium on Operating Systems Principles
(SOSP'99), Kiawah Island Resort, SC, December 1999.
This paper proposes and evaluates soft timers, a new operating system
facility that allows the efficient scheduling of software events
at a granularity down to tens of microseconds.
Soft timers can be used to avoid interrupts and reduce context
switches associated with network processing without sacrificing
low communication delays.
More specifically, soft timers enable transport protocols like TCP
to efficiently perform rate-based clocking of packet transmissions.
Experiments show that rate-based clocking can improve HTTP response
time over connections with high bandwidth-delay products by up to
89% and that soft timers allow a server to employ rate-based clocking
with little CPU overhead (2-6%) at high aggregate bandwidths.
Soft timers can also be used to perform network polling, which
eliminates network interrupts and increases the
memory access locality of the network subsystem without
Experiments show that this technique can improve the throughput
of a Web server by up to 25%.
H. Balakrishnan, V. Padmanabhan, S. Seshan, M. Stemm, R. H. Katz, "TCP
Behavior of a Busy Internet Server: Analysis and Improvements,"
of IEEE Infocom, March 1998.
The rapid growth of the World Wide Web in recent years has caused
a significant shift in the composition of Internet traffic.
Although past work has studied the behavior of TCP dynamics in the context
of bulk-transfer applications and some studies have begun to investigate
the interactions of TCP and HTTP, few have used extensive
real-world traffic traces to examine the problem.
This interaction is interesting because of the way
in which current Web brouwsers use TCP connections:
multiple concurrent short connections from a single host.
In this paper, we analyze the way in which Web browsers use TCP
connections based on extensive traffic traces obtained from a busy
Web server (the offical Web server of the 1996 Atlanta Olympic
games). At the time of operation, this Web server was one of the
busiest on the Internet, handling tens of millions of requests per day
from several hundred thousand clients. We first describe the techniques
used to gather these traces and reconstruct the behavior of TCP on the
server. We then present a detaied analysis of TCP's loss
recovery and congestion control behavior from the recorded
transfers. Our two most important results are: (1) short Web transfers
lead to poor loss recovery performance for TCP, and (2) concurrent
connections are overly aggressive users of the network.
We then discuss techniques designed to solve these problems.
To improve the data-driven loss recovery performance of short
transfers, we present a new enhancement to TCP's loss recovery.
To improve the congestion control and loss recovery performance
of parallel TCP connections, we present
a new integrated approach to congestion control and loss recovery
that works across the set of concurrent connections.
Simulations and trace analysis show that our enhanced loss recovery scheme could
have eliminated 25% of all timeout events, and that
our integrated approach provides greater fairness
and improved startup performance for concurrent connections.
Our solutions are more general than application-specific enhancements
such as the use of persistent connections in P-HTTP and HTTP/1.1,
and addresses issues, such as improved TCP loss recovery,
that are not considered by them.
A. Balsa, "Linux Benchmarking HOWTO,"
Linux Documentation Project,
The Linux Benchmarking HOWTO discusses some issues associated with the
benchmarking of Linux systems and presents a basic benchmarking toolkit,
as well as an associated form, which enable one to produce significant
benchmarking information in a couple of hours.
G. Banga, J. C. Mogul, "Scalable kernel performance for Internet servers
under realistic load,"
Proceedings of USENIX Annual Technical Conference,
UNIX Internet servers with an event-driven architecture often perform
poorly under real workloads, even if they perform well under laboratory
We investigated the poor performance of event-driven servers.
We found that the delays typical in wide-area networks cause busy servers
to manage a large number of simultaneous connections.
We also observed that the select() system call implementation
in most UNIX kernels scales poorly with the number of connections being
managed by a process.
The UNIX algorithm for allocating file descriptors also scales poorly.
These algorithmic problems lead directly to the poor performance
of event-driven servers.
We implemented scalable versions of the select() system call and the
descriptor allocation algorithm.
This led to an improvement of up to 58% in Web proxy and Web server throughput,
and dramatically improved the scalability of the system.
G. Banga, P. Drushel, J. C. Mogul, "Better operating system features for
faster network servers,"
SIGMETRICS Workshop on Internet Server Performance, June 1998.
Widely-used operating systems provide inadequate support for
large-scale Internet server applications.
Their algorithms and interfaces fail to efficiently support
either event-driven or multi-threaded
They provide poor control over the scheduling and management of
machine resources, making it difficult to provide robust
and controlled service.
We propose new UNIX interfaces to improve scalability, and to
provide fine-grained scheduling and resource management.
G. Banga, P. Drushel, J. C. Mogul, "Measuring the Capacity of a Web Server,"
of USENIX Symposium on Internet Technologies and Systems, 1997.
The widespread use of the World Wide Web and related applications places
interesting perforrnance demand on network servers. The ability to measure
the effect of these demands is important for tuning and optimizaing the
various software components that make up a Web server. To measure these
effects, it is necessary to generate realistic HTTP client requests.
accurate generation of such traffic in a testbed of limited scope is not
trivial. In particular, the commonly used approach is unable to generate
client request-rates that exceed the capacity of the server being tested
even for short periods of time. This paper examines pitfalls that one encounters
when measuring Web server capacity using a synthetic workload. We propose
and evaluate a new method for Web traffic generation that can generate
bursty trafic, with peak loads that exceed the capacity of the server.
Finally, we use the proposed method to measure the performance of a Web
M. Beck, H. Boehme, M. Dziadzka, U. Kunitz, R. Magnus,
D. Verworner, Linux Kernel Internals, 2nd ed., tr.
Addison Wesley Longman, 1998. ISBN 0-201-33143-8
This is the second edition of our book about the LINUX kernel.
The book has been updated to cover the 2.0 version of the kernel,
which is a milestone in the development of LINUX.
E. D. Berger, R. D. Blumofe,
"Hoard: A Fast, Scalable, and Memory-Efficient Allocator
for Shared-Memory Multiprocessors,"
The University of Texas at Austin, Department of Computer Sciences.
Technical Report TR-99-22. September 1999.
In this paper, we present Hoard,
a memory allocator for shared-memory multiprocessors.
We prove that its worst-case memory fragmentation
is asymptotically equivalent to that
of an optimal uniprocessor allocator.
We present experiments that
demonstrate its speed and scalability.
L. S. Brakmo, L. L. Peterson, "TCP Vegas: End to End Congestion
Avoidance on a Global Internet," IEEE Journal on Selected Areas
in Communication, 13(8):1465-1480, October 1995.
Vegas is an implementation of TCP that achieves between 37 and 71%
better throughput on the Internet, with one-fifth to one-half the
losses, as compared to the implementation of TCP in the Reno
distribution of BSD Unix. This paper motivates and describes the
three key techniques employed by Vegas, and presents the results
of a comprehensive experimental performance study-- using both
simulations and measurements on the Internet-- of the Vegas and
Reno implementations of TCP.
E. Bugnion, J. M. Anderson, T. C. Mowry, M. Rosenblum, and M. S. Lam.,
"Compiler-Directed Page Coloring for Multiprocessors,"
Proceedings of The Seventh International Symposium on Architectural
Support for Programming Languages and Operating Systems, 1998.
This paper presents a new technique, compiler-directed page coloring,
that eliminates conflict misses in
multiprocessor applications. It enables applications to make better use
of the increased aggregate cache size
available in a multiprocessor. This technique uses the compiler's knowledge
of the access patterns of the
parallelized applications to direct the operating system's
virtual memory page mapping strategy. We demonstrate
that this technique can lead to significant performance improvements
over two commonly used page mapping
strategies for machines with either direct-mapped or
two-way set-associative caches. We also show that it is
complementary to latency-hiding techniques such as prefetching.
We implemented compiler-directed page coloring in the SUIF parallelizing
compiler and on two commercial
operating systems. We applied the technique to the SPEC95fp benchmark
suite, a representative set of numeric
programs. We used the SimOS machine simulator to analyze the applications
and isolate their performance
bottlenecks. We also validated these results on a real machine,
an eight-processor 350MHz Digital AlphaServer.
Compiler-directed page coloring leads to significant performance
improvements for several applications. Overall,
our technique improves the SPEC95fp rating for eight processors
by 8% over Digital UNIX's page mapping policy
and by 20% over a page coloring, a standard page mapping policy.
The SUIF compiler achieves a SPEC95fp ratio of
57.4, the highest ratio to date.
B. Calder, C. Krintz, S. John, T. Austin, "Cache-Conscious Data
Placement", Proceedings of the Eighth International Conference
on Architectural Support for Programming Languages and Operating
Systems, October 1998.
As the gap between memory and processor speeds continues to widen,
cache efficiency is an increasingly important component of processor
performance. Compiler techniques have been used to improve instruction
cache performance by mapping code with temporal locality to different
cache blocks in the virtual address space eliminating cache conflicts.
These code placement techniques can be applied directly
to the problem of placing data for improved data cache performance.
In this paper we present a general framework for
Cache Conscious Data Placement.
This is a compiler directed approach that
creates an address palcement for the stack (local variables),
global variables, heap objects, and constants in order to reduce data cache
misses. The placement of data objects is guided by a temporal
relationship graph between objects generated via profiling.
Our results show that profile driven data placement significantly reduces
the data miss rate by 24% on average.
M. Crovella, R. Frangioso, M. Harchol-Balter,
"Connection Scheduling in Web Servers,"
Proceedings of the USENIX Conference on Internet Technologies,
Under high loads, a Web server may be servicing many hundreds of
In traditional Web servers, the question of the order in which
concurrent connections are serviced
has been left to the operating system.
In this paper we ask whether servers might provide better
service by using non-traditional service ordering.
In particular, for the case when a Web server
is serving static files, we examine the costs and benefits
of a policy that gives preferential
service to short connections.
We start by assessing the scheduling behavior of a commonly
used server (Apache running on Linux) with respect to connection size
and show that it does
not appear to provide preferential service to short connections.
We then examine the potential
performance improvements of a policy that does favor short connections
We show that mean response time can be improved by factors of four
or five under shortest-connection-first, as compared to an
(Apache-like) size-independent policy.
Finally we assess the costs of shortest-connection-first scheduling
in terms of unfairness (i.e., the degree to
which long connections suffer).
We show that under shortest-connection-first scheduling,
long connections pay very little penalty.
This surprising result can be understood as a consequence
of heavy-tailed Web server workloads,
in which most connections are small, but most server load
is due to the few large connections.
We support his explanation using analysis.
G. Ganger, Y. Patt, "Metadata Update Performance in File Systems,"
USENIX Symposium on Operating Systems Design and Implementation,
Structural changes, such as file creation and block allocation,
have consistently been identified as file system performance
problems in many user environments.
We compare several implementations that maintain metadata integrity
in the event of a system failure but do not require
changes to the on-disk structures.
In one set of schemes, the file system uses asynchronous writes
and passes ordering requirements to the disk scheduler.
These scheduler-enforced ordering schemes outperform the
conventional approach (synchronous writes) by more than 30 percent
for metadata update intensive benchmarks, but are suboptimal
mainly due to their inability to safely use delayed writes when
ordering is required.
We therefore introduce soft updates, an implementation that
asymptotically approaches memory-based file system performance
(within 5 percent) while providing stronger integrity and security
guarantees than most UNIX file systems.
For metadata update intensive benchmarks, this improves performance
by more than a factor of two when compared to the conventional approach.
R. Gooch, "I/O Event Handling Under Linux," Part of the LINUX FAQ,
I/O Event handling is about how your Operating System allows you
to manage a large number of open files (file descriptors in UNIX/POSIX,
or FDs) in your application.
You want the OS to notify you when FDs become active (have
data ready to be read or are ready for writing).
Ideally you want a mechanism that is scalable. This means a large number
of inactive FDs cost very little in memory and CPU time to manage.
First I should start off by stating my goal: to develop a thin interface
that makes best use of the facilities the OS provides
scaling as much as possible.
Where possible, use of standard POSIX facilities are preferred.
A seconday goal is to provide a convenient interface for applications.
The reason for preferring standard POSIX interfaces is that it introduces
less conditional blocks in the interface code.
J.L. Hennessy and D. A. Patterson, Computer Architecture:
A Quantitative Approach, 2nd ed., Morgan Kaufmann, 1996.
The task the computer designer faces is a complex one:
Determine what attributes are important for a new machine,
then design a machine to maximize performance while staying
within cost constraints.
This task has many aspects, including instruction set design,
functional organization, logic design, and implementation.
The implementation may encompass integrated circuit design,
packaging, power, and cooling.
Optimizing the design requires familiarity with a very wide range
of technologies, from compilers and operating systems to logic
design and packaging.
J. C. Hoe, "Improving the Start-up Behavior of a Congestion Control Scheme
Proceedings of ACM SIGCOMM, pp. 270-280, August 1996.
Based on experiments conducted in a network simulator and over
real networks, this paper proposes changes to the congestion control
scheme in current TCP implementations to improve its behavior during
the start-up period of a TCP connection.
The scheme, which includes Slow-start, Fast Retransmit, and Fast
Recovery algorithms, uses acknowledgments from a receiver to
dynamically calculate reasonable operating values for a sender's
TCP parameters governing when and how much a sender can pump into the
During the start-up period, because a TCP sender starts with default
parameters, it often ends up sending too many packets and
too fast, leading to multiple losses of packets from the same window.
This paper shows that recovery from losses during the start-up
period is often unnecessarily time-consuming.
In particular, using the current Fast Retransmit algorithm,
when multiple packets in the same window are lost, only one of the
packet losses may be recovered by each Fast Retransmit; the rest are
often recovered by Slow-start after a usually lengthy retransmission
Thus, this paper proposes changes to the Fast Retransmit algorithm so
that it can quickly recover from multiple packet losses without
waiting unnecessarily for the timeout.
These changes, tested in the simulator and on the real networks, show
significant performance improvements, especially for short TCP transfers.
The paper also proposed other changes to help minimize the number of
packets lost during the start-up period.
J. C. Hu, I. Pyarali, D. C. Schmidt, "Measuring the Impact of Event Dispatching
and Concurrency Models on Web Server Performance Over High-Speed Networks,"
of the 2nd IEEE Global Internet Conference, November 1997.
This paper provides two contributions to the study of high-performance
Web servers. First it outlines the optimizations necessary to build efficient
and scalable Web servers and illustrates how we've applied some of these
optimizations to create JAWS. JAWS is a high-performance Web server that
is explicitly designed to alleviate overheads incurred by existing Web
servers on high-speed networks. It consistently outperforms existing Web
servers (such as Apache, Java Server, PHTTPD, Zeus, and Netscape Enterprise)
over 155 Mbps ATM networks on UNIX platforms.
Second, this paper describes how we have customized JAWS to leverage advanced
features of Windows NT for multi-processor hardware over ATM. The Windows
NT features used in JAWS include asynchronous mechanisms for connection
establishment and data transfer. Our previous benchmarking studies demonstrate
that once the overhead of disk I/O is reduced to a negligible constant
factor (e.g., via memory caches), the primary determinants of Web server
performance are the concurrency and event dispatching strategies.
Our performance results over a 155Mbps ATM link indicate that certain Windows
NT asynchronous I/O mechanisms (i.e., TransmitFile) provide superior
performance for large file transfers compared with conventional synchronous
mutli-threaded servers. On the other hand, synchronous event dispatching
performed better for files less than 50 Kbytes. Thus, to provide optimal
performance, Web servers should be adaptive, choosing to use different
mechanisms (e.g., TransmitFile) to handle requests for large files,
while using alternative I/O mechanisms (e.g., synchronous event dispatching)
for requests for small files.
J. C. Hu, D. C. Schmidt,
"JAWS: A Framework for High-Performance Web Servers"
Developers of communication software face many challenges.
Communication software contains both inherent complexities,
such as fault detection and recovery, and accidental complexities,
such as the continuous re-rediscovery and re-invention
of key concepts and components.
Meeting these challenges requires a thorough understanding of
object-oriented application frameworks and patterns.
This paper illustrates how we have applied frameworks and patterns
for communication software to develop
a high-performance Web server called JAWS.
JAWS is an object-oriented framework that supports the
configuration of various Web server strategies, such as a
Thread Pool concurrency model with asynchronous I/O and
LRU caching vs. a Thread-per-Request concurrency model
with synchronous I/O and LFU caching. Because JAWS is a
framework, these strategies can be customized systematically
and measured both independently and collaboratively to
determine the best strategy profiles.
Using these profiles, JAWS can statically and dynamically adapt
its behavior to deploy the most effective strategies for
a given software/hardware platform and workload.
JAWS' adaptive software features make it a powerful application framework
for consturcting high-performance Web servers.
K. Lai, M. Baker, "A Performance Comparison of UNIX Operating Systems on
Proceedings of USENIX Technical Conference, January
This paper evaluates the performance of three popular versions of the UNIX
operating system on the x86 architecture: Linux, FreeBSD, and Solaris.
We evaluate the systems using freely available micro- and application
benchmarks to characterize the behavior of their operating system services.
We evaluate the currently available major releases of the systems
"as is," without any performance tuning.
Our results show that the x86 operating systems and system libraries we
tested failed to deliver the Pentium's full memory write performance to
On small-file workloads, Linux is an order of magnitude faster than
On networking software, FreeBSD provides two to three times higher
bandwidth than Linux.
In general, Solaris performance usually lies between that of the other two
Although each operating system out-performs the others in some area,
we conclude that no one system offers clearly better overall
Other factors, such as extra features, ease of installation,
or freely available source code, are more convincing reasons for choosing
a particular system.
D. Maltz, P. Bhagwat, "TCP Splicing for Application Layer Proxy Performance,"
IBM Research Report RC 21139, March 1998.
Application layer proxies already play an important role in today's networks,
serving as firewalls and HTTP caches -- and their role is being expanded to
include encryption, compression, and mobility support services. Current
application layer proxies suffer major performance penalties as they spend
most of their time moving data back and forth between connections, context
switching and crossing protection boundaries for each chunk of data they
handle. We present a technique called TCP Splice that provides kernel
support for data relaying operations which runs at near router speeds. In
our lab testing, we find SOCKS firewalls using TCP Splice can sustain a
data throughput twice that of normal firewalls, with an average packet
forwarding latency 30 times less.
[ Postscript ]
M. Mathis, J. Mahdavi, "Diagnosing Internet Congestion with a Transport Layer
Performance Tool," INET '96, Montreal, Quebec.
The Internet is once again suffering from its own success.
The massive increase in presented load has left many large
users with insufficient bandwidth to support their applications.
This problem is exacerbated by the lack of performance
measures by which by which users can compare IP providers.
We have focused on one particular metric, bulk transfer capacity,
and a tool, "TReno," which can function as a basis for
a formal bulk transfer metric.
Bulk transfer capacity requires a very clean network,
and has other properties that make it
an important gauge of network performance.
We introduce our tool, TReno, and describe its relation to current TCP
implementations. Our research with TReno led us to key observations
about TCP behavior and influenced details of the
new specifications for the TCP Selective acknowledgment (SACK) option.
Currently TReno does not precisely estimate
TCP's performance in the Internet.
As SACK is deployed, and the research community reaches consensus
regarding its proper behavior, TReno and state-of-the-art TCP
will converge to identical performance and behavior.
The IP Provider Metrics subcommittee of the IETF is developing standardized
formal metrics for providers. There is a
substantial amount of work to be completed in the areas
of measurement procedures and statistics before TReno can
become a standardized metric.
Our goal is for the TReno development effort
and the IPPM standards efforts to converge.
TReno will maximize its potential as a metric if it also functions
well as a diagnostic. If a long multi-provider path
between two users is not performing as well as desired,
it would be extremely valuable to use TReno to measure the
path hop-by-hop or provider-by-provider to localize the problem.
M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, "TCP Selective Acknowledgment
Options," RFC 2018, October 1996.
TCP may experience poor performance when multiple packets are lost from
one window of data. With the limited information available from cumulative
acknowledgments, a TCP sender can only learn about a single lost packet
per round trip time. An aggressive sender could choose to retransmit packets
early, but such retransmitted segments may have already been successfully
A Selective Acknowledgment (SACK) mechanism, combined with a selective
repeat retransmission policy, can help to overcome these limitations. The
receiving TCP sends back SACK packets to the sender informing the sender
of data that has been received. The sender can then retransmit only the
missing data segments.
This memo proposes an implementation of SACK and discusses its performance
and related issues.
[ Text ]
M. Mathis, J. Mahdavi, G. L. Huntoon, "TCP Enhancements to Support
High Performance Applications," research proposal published on the web,
Existing TCP implementations are inadequate to support
high performance applications and services over national scale,
high speed networks. The Pittsburgh Supercomputing Center
requests funding for a project to design, develop, implement and test
performance enhancements to TCP implementations.
These enhancements will improve
TCP based application performance over high latency,
high bandwidth production networks. To accomplish this,
we will focus on the Center's existing high performance
computer and network environment:
production supercomputers attached to the present
and future wide area high speed networks.
Our ultimate goal is to eliminate TCP itself as a bottleneck,
so that applications use appropriate shares of the underlying
IP infrastructure with only minor latency or bandwidth penalties
beyond those present in the lower layers. In other words, for all
TCP connections, performance should be limited only by the
application itself or congestion on an underlying network link.
The only exceptions are during transients in
the data rate, such as during connection startup.
Furthermore it should not require network expertise
to write such applications.
Ultimately, all straightforward, non-privileged application codes
using the standard network socket interface on a high performance
computer will benefit from this work.
To accomplish this goal, we propose a three year research and development
project. Our primary objective is to produce an enhanced TCP that will
better support high performance applications over national scale, high
speed networks. We will address other important objectives and produce
valuable tools for future work in this area.
[ HTML ]
L. McVoy, "Splice -- a push pull I/O model for Unix"
This document describes a strawman proposal for a new model for I/O in a
Unix Operating System. The goals of the new model include low
overhead I/O, scatter-gather for zero copy networking to and from disks,
async I/O, simple implemenation, and compatibility with the existing
kernel implementation. The model is fundamentally different than the
existing model, but easily provides compatibility with the standard read
and write interfaces. If desired, a kernel could move entirely to this
model, building the old interfaces on top, somewhat similarly to what Sun
did with mmap. If this idea ever lives up to its full potential, it could
be as useful as the mmap work.
[ Postscript ]
J. C. Mogul, K. K. Ramakrishnan, "Eliminating Receive Livelock in an Interrupt-driven
Proceedings of USENIX Technical Conference, January 1996.
Most operating systems use interface interrupts to schedule network tasks.
Interrupt-driven systems can provide low overhead and good latency at low
offered load, but degrade significantly at higher arrival rates unless
care is taken to prevent several pathologies. These are various forms of
receive livelock, in which the system spends all its time processing interrupts,
to the exclusion of other necessary tasks. Under extreme conditions, no
packets are delivered to the user application or the output of the system.
To avoid livelock and related problems, an operating system must schedule
network interrupt handling as carefully as it schedules process execution.
We modified an interrupt-driven networking implementation to do so; this
eliminates receive livelock without degrading other aspects of system
We present measurements demonstrating the success of our approach.
Mosberger, D., Jin, T. "httperf--- A Tool for Measuring Web Server Performance,"
SIGMETRICS Workshop on Internet Server Performance, June 1998.
This paper describes httperf, a tool for measuring web server performance.
It provides a flexible facility for generating various HTTP workloads and
for measuring server performance. The focus of httperf is not on implementing
one particular benchmark but on providing a robust, high-performance tool
that facilitates the construction of both micro- and macro-level benchmarks.
The three distinguishing characteristics of httperf are its robustness,
which includes the ability to generate and sustain server overload, support
for the HTTP/1.1 protocol, and its extensibility to new workload generators
and performance measurements. In addition to reporting on the design and
implementation of httperf this paper also discusses some of the experiences
and insights gained while realizing this tool.
B. J. Murphy, S Zeadally, C. J. Adams, "An Analysis of Process and Memory
Models to Support High-Speed Networking in a UNIX Environment,"
of USENIX Technical Conference, January 1996.
In order to reap the benefits of high-speed networks, the performance of
the host operating system must at least match that of the underlying network.
A barrier to achieving high throughput is the cost of copying data within
current host architectures. We present a performance comparison of three
styles of network device driver designed for a conventional monolithic
UNIX kernel. Each driver performs a different number of copies. The zero-copy
driver works by allowing the memory on the network adapter to be mapped
directly into user address space. This maximises performance at the cost
of: 1) breaking the semantics of existing network APIs such as BSD sockets
and SVR4 TLI; 2) pushing responsibility for network buffer management up
from the kernel into the application layer. The single-copy driver works
by copying data directly between user space and adapter memory obviating
the need for an intermediate copy into kernel buffers in main memory. This
approach can be made transparent to existing application code but, like
the zero-copy case, relies on an adapter with a generous quantity of on-board
memory for buffering network data. The two-copy driver is a conventional
STREAMS driver. The two-copy approach sacrifices performance for generality.
We observe that the STREAMS overhead for small packets is significant.
We report on the benefit of the hardware cache in ameliorating the effect
of the second copy, although we note that streaming network data through
the cache reduces the level of cache residency seen by the rest of the
A barrier to achieving low jitter is the non-deterministic nature of many
operating system schedulers. We describe the implementation and report
on the performance of a kernel streaming driver that allows data to be
copied between a network adapter and another I/O device without involving
the process scheduler. This provides performance benefits in terms of increased
throughput, increased CPU availability and reduced jitter.
V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: A unified I/O buffering
and caching system," Rice University CS Technical Report TR97-294.
This paper presents the design, implementation, and evaluation
of IO-Lite, a unified I/O buffering and caching system.
IO-Lite unifies all buffering and caching in the system,
to the extent permitted by the hardware.
In particular, it allows applications, interprocess communication,
the filesystem, the file cache, and the network subsystem
to share a single physical copy of the data safely and concurrently.
Protection and security are maintained through a combination
of access control and read-only sharing.
The various subsystems use (mutable) buffer aggregates to access
the data according to their needs.
IO-Lite eliminates all copying and multiple buffering of I/O data,
and enables various cross-subsystem optimizations.
Performance measurements show significant performance improvements
on Web servers and other I/O intensive applications.
V. Paxson, "Automated Packet Trace Analysis of TCP Implementations,"
Proceedings of ACM SIGCOMM, 1997.
We describe tcpanaly,
a tool for automatically analyzing a TCP implementation's behavior
by inspecting packet traces of the TCP's activity.
Doing so requires surmounting a number of hurdles, including
detecting packet filter measurement errors, coping with ambiguities
due to the distance between the measurement point and
the TCP, and accomodating a surprisingly large range of behavior
among different TCP implementations.
We discuss why our efforts to develop a fully general tool failed,
and detail a number of significant differences among 8 major TCP
implementations, some of which, if ubiquitous, would devastate Internet
The most problematic TCPs were all independently written, suggesting
that correct TCP implementation is fraught with difficulty.
Consequently, it behooves the Internet community to develop
testing programs and reference implementations.
M. Russinovich, "Inside I/O Completion Ports," http://www.sysinternals.com,
The goal of a server is to incur as few context switches as possible
by having its threads avoid unnecessary blocking,
while at the same time maximizing parallelism by using multiple threads.
The ideal is for there to be a thread actively servicing a client request
on every processor and for those threads not to block if there are
additional requests waiting when they complete a request.
For this to work correctly however, there must be a way for
the application to activate another thread when one processing
a client request blocks on I/O (like when it reads from a file
as part of the processing).
Windows NT 3.5 introduced a set of APIs that make this goal relatively easy
The APIs are centered on an object called a completion port.
In this article I'm going to provide an overview of how completion ports
are used and then go inside them to show you how Windows NT implements them.
[ HTML ]
M. L. Schmit, Pentium(tm) Processor Optimzation Tools,
Academic Press, Inc., 1995.
This book shows C/C++ and assembly language programmers how
to improve the performance of their programs by taking advantage
of the superscalar architecture of the Intel(R) Pentium(tm) CPU.
Programs written originally for the 8088 through 80486 can be
optimized for the Pentium, resulting in 100% to 400% performance
J. Semke, J. Mahdavi, M. Mathis, "Automatic TCP Buffer Tuning,"
Proceedings of ACM SIGCOMM, pp. 315-323, September 1998.
With the growth of high performance networking, a single host may have
simultaneous connections that vary in bandwidth by as many as six orders
of magnitude. We identify requirements for an automatically-tuning TCP
to achieve maximum throughput across all connections simultaneously within
the resource limits of the sender. Our auto-tuning TCP implementation makes
use of several existing technologies and adds dynamically adjusting socket
buffers to achieve maximum transfer rates on each connection without manual
Our implementation involved slight modifications to a BSD-based socket
interface and TCP stack. With these modifications, we achieved drastic
improvements in performance over large bandwidth*delay paths compared to
the default system configuration, and a significant reduction in memory
usage compared to hand-tuned connections, allowing many more simultaneous
C. Staelin, L. McVoy, "lmbench: Portable tools for performance analysis,"
of USENIX Technical Conference, pp. 279-284, January 1996.
lmbench is a micro-benchmark suite designed to focus attention
on the basic building blocks of many common system applications, such as
databases, simulations, software development, and networking. In almost
all cases, the individual tests are the result of analysis and isolation
of a customer's actual performance problem. These tools can be, and currently
are, used to compare different system implementations from different vendors.
In several cases, the benchmarks have uncovered previously unknown bugs
and design flaws. The results have shown a strong correlation between memory
system performance and overall performance. lmbench includes an
extensible datsbase of results from systems current as of late 1995.
B. Zorn, "Malloc/Free and GC Implementations,"
This is an annotated table of contents for well-known memory allocation
implementations, including Wolfram Gloger's ptmalloc, used in glibc-2.0.
If you have comments or suggestions, email
Revision: $Id: annbib.html,v 1.17 2000/02/21 23:32:59 cel Exp $