CITI: Projects: Linux scalability: January/February 1999 Status Report

The primary goal of this research is to improve the scalability and robustness of the Linux operating system to support greater network server workloads more reliably. We are specifically interested in single-system scalability, performance, and reliability of network server infrastructure products running on Linux, such as LDAP directory servers, IMAP electronic mail servers, and web servers, among others.

Summary

January 5th, 1999 was officially Day One for this project. We've continued our efforts to build strong relationships with Netscape server developers and the Linux community. Several small technical projects have been completed. And we've begun building our technological infrastructure.

Milestones

Peter has contacted NLNet and Intel with official requests for funding and hardware. Initial response has been positive, and he is continuing to pursue.
Professor Gary Tyson lent an Intel-donated Dell PowerEdge 6300 with four 450Mhz Xeon CPUs and 512M of RAM. The machine has been installed in the CITI lab, and is running Linux 2.2.2. We soon hope to connect it to switched 100Mb Ethernet routed through ATM. It will be available to this project until summer.
Chuck acquired beta Linux ports of the Directory Server and the Messaging Server, and has installed them on CITI's Dell PowerEdge server for testing.
Niels has implemented several patches that improve the performance and scalability of poll(). A report on these patches appears here.
Chuck visited Mountain View again the second week of February to meet with server developers. During this meeting, Chuck met with John Paul, SVP and GM of the Server Product Division, and with Linux Product Manager Kevin Tsurutome, as well as developers from the Directory Server team, the Messaging Server team, and the NSPR team.
During these two months, both the 2.2 kernel series, and the 2.1 C library were released. This means that our development platforms will change radically in the next few months as commercial Linux vendors integrate these new technologies into their distributions. Part of the new technology includes clustered page-out and page-in read-ahead, both of which enhance overall system performance under constrained memory conditions. This will help mmap() performance on busy servers. We are still advocating for the inclusion of large fdset support; we expect the support to appear in the stock 2.2 kernel distribution within a few months.
Chuck constructed a benchmark for malloc() performance under multithreaded load. The results are in his report. The graphs below show that malloc() performance is fairly linear as more concurrent threads are added.
A new linux-* mailing list was created in February called "linux-perf" to discuss tools and standards for measuring performance on Linux. The CITI-Netscape project will participate in this effort by writing Linux versions of commonly used Unix performance measurement tools, such as "sar" and the /usr/proc/bin tools on Solaris.
Chuck and Tim are working with Netscape's legal team to understand how code written by Netscape employees, and thus owned by Netscape, can be included in Linux, under the GPL.
The SPEC S-DET and KENBUS benchmarks are now running cleanly under Linux. See the report for details. We're using these benchmark suites to drive Linux to the breaking point, rather than to measure and compare it's performance to other operating systems. There are many subtle incompatibilities between POSIX-compliant utilities and the GNU versions widely available in Linux distributions that have made the port more difficult than expected.
Chuck has been studying the causes of wide performance variations during benchmarks and under real stress. The way the Linux kernel maps physical pages to virtual pages can be improved to reduce CPU cache collisions. There also appears to be one or more unecessary serializations in the page cache implementation. More on this as research progresses.

Challenges

Understanding the purpose of research (project workscope)

During these last two months, our focus has shifted from determining and describing our project's goals and workscope to beginning our efforts. Even though we have a fairly complete working draft of the project workscope, there are several issues that continue to prevent the workscope from completely solidifying. Needless to say, there are many ambiguities about this work, and many complicating factors.

"Scope creep" is an ever-present danger. Few have a clear picture of Linux's true performance and scalability issues, and the image is always shifting as new Linux kernel releases are made. Linux kernel developers operate by feel, rather than on quantitative or historical analysis, since everyone knows that benchmarks can easily mislead even the most well-intentioned. Unfortunately, this prevents narrowly-focused development effort, since distraction is only the next bug fix away. And we all have our own agenda, from getting our products to market in a timely fashion, to proving that our way of analyzing a problem is the right way.

So many who are involved with our project are unfamiliar with, or untrusting of, the underlying rules of research projects like these. Netscape, as a company, hires many, many product developers, but few researchers. Netscape wants results, deliverables, execution. It's not clear to product-oriented managers exactly what value research can add. Often, researchers are asked to produce product deliverables, rather than to chase the results of exploration. Frustration results on both sides because there is a mismatch of expectations.

And the Linux community is almost anti-academic, charging that academics create unportable and unmaintainable code. Their suspicion is that once the measurements are taken and the simulations have completed, an academic's usefulness is finished. An academic never had to work with development methodolgy, defect counts, coding conventions, software portability issues, or within the constraints of a market.

The Linux kernel learning curve

The learning curve is still steep. Developers often don't respond to e-mail, problem reports, or technical questions because they are busy, or for other reasons. Documentation in the code or produced separately doesn't begin to help one understand some of the obscure techniques used to speed up kernel functions.

But we do have a clearer window on what Linus will accept into the stock kernel distribution. He has made plain several guidelines that he uses to judge a modification or new feature.

Does it make the code cleaner and simpler? Does it remove old kruft?
Is there clear and unambiguous evidence that real applications will benefit, in terms of performance or scalability, from the proposed modification or new feature?
Does it pave the way for innovation and expansion, not just for next year's new stuff, but five or even ten years down the line?

As many have suggested, the kernel development community should take these guidelines to heart, rather than having only Linus police the code.

Threaded signals v. NT completion ports

There are many complications involved with pushing out a software release as complex as a kernel. The problems of combining threads and asynchronicity at the application level have slipped off the Linux kernel development radar screen while the latest production branch of the Linux kernel stabilizes.

It appears that there may be some room to crack the "must be POSIX-compliant" wagon-circle. As we become more familiar with Linux, especially in the area of threads and signals, it is clear that Linux does not implement a wholly POSIX-compliant API. But again, we must produce the numbers to breach the politics of "no NT allowed here", in order to suggest, and be believed, that completion ports are a superior software technology.

The CITI lab is now in a better position to begin analyzing the true nature of thread/event scalability. We now have a four CPU machine running Linux, a recent build of the server applications we want to test, and a reproducible way to stress large systems like this one. Reports from the directory team indicate that Linux DS performs as well as the other Unix ports; that is, somewhat less well than the same server running on NT. In the coming month, we hope to bring our test harness on-line to discover where Linux can offer better performance and scalability.

Performance graphs

Our experimental benchmark program ran multiple threads allocating and freeing heap memory at the same time. These graphs show nearly linear growth of elapsed time when adding more threads. This is what we would expect in a correctly operating two-CPU system.

If you have comments or suggestions, email linux-scalability @ citi.umich.edu