ELSC : Scalable Linux Scheduling on a Symmetric Multi-Processor Machine

 

Chris King, Scott Lathrop, Steve Molloy, Paul Moore

 

Department of Electrical Engineering and Computer Science

University of Michigan

Ann Arbor, MI 48109

 

 


ABSTRACT

 

Concerns about the scalability of multithreaded network servers running on Linux have prompted us to investigate possible improvements to the Linux scheduler.   The purpose of our research is to improve the scalability of the Linux scheduler in order to prepare it for industrial-strength computational chores.  Our problem focuses on determining why time spent in the Linux scheduler increases with the number of threads executing in the system. We determine that this problem’s cause is a direct result of the scheduler’s task selection process.   Every time the schedule function is called, the scheduler calculates a “goodness” value for each ready task—an q(n) operation where n is the number of ready tasks.  The task with the highest “goodness” value is the next task to execute. 

 

Given this information, we focus on incrementally improving the existing scheduling architecture in the Linux kernel rather than fundamentally redesigning it.  We propose a scheduler design alternative based on the static and dynamic portions of “goodness”, implement that design, and compare our implementation with the current Linux scheduler.  Our solution demonstrates a 90% increase in performance when 1200 ready tasks in the system and compares favorably to the present scheduler when there are only a few ready threads.

 

1 Introduction

 

The Linux Operating System is increasingly being used not only as a personal computer (PC) operating system, but also as a cost-efficient alternative for network servers, web servers, distributed workstations, and other large scale platforms.   Its gain in popularity for such large-scale tasks stems mainly from its low price/commodity ratio, but also is a factor of its multitasking environment, stable platform, and familiarity for former Unix-like users--specifically application developers.  Many companies, such as IBM, are beginning to offer software support for their products on Linux [3], while other organizations use Linux as  web servers [6],  network print servers providing access to over 2000 printers [10], or as an enterprise-class network service provider for their customers.

 

However, there is a problem facing wide-scale Linux deployment.  The root of Linux’s problem is its inability to scale.  The shortfall specific to the Linux scheduler, is its inability to manage a large number of concurrent processes such as what is typical found in a high-load web server environment.  Studies show that when there are a large (defined as 1000) number of threads running, up to 50% of the kernel’s time is spent in its scheduler.  

 

To be fair, we must point out that Linux was not designed for operation in a large-scale environment.  Its original target was the Intel 80386 personal computer [8].   We also wish to emphasize that the focus of our efforts is on scalability in the terms of performance.  Our question of interest is “how does the performance of the scheduler degrade as the load on it increases?”

 

We originally proposed to investigate and determine the reason(s) causing this bottleneck, recommend alternative designs, and implement a solution.  Our implementation (ELSC) is based on the “static” and “dynamic” properties of a task’s goodness.  ELSC improves the scalability of the Linux operating system showing a 90% increase in performance when there are 1200 ready threads as compared to the current scheduler. 

 

The paper’s remaining contents are organized as follows.   We begin in Section 2 by describing previous work surrounding our research.  In section 3, we detail some specific scheduler data structures and algorithms we find hindering scalability.  These hindrances ultimately affect our design decisions.  Section 4 discusses our design proposal and details the implementation.  Section 5 shows the results of benchmark tests comparing the current scheduler with our implementation.  Section 6 details potential future work and in section 7 we provide our conclusion.

 

2 Back Ground and Related Work

 

The basis of our research begins with observations of various Linux scalability problems by IBM and the Center for Information Technology Integration (CITI) at the University of Michigan.  IBM’s Linux Technological Center in Austin, Texas observed problems with the Linux scheduler and thread architecture when they measured the performance of their Java Virtual Machine (JVM) using the VolanoMark benchmark. VolanoMark, a well-known Java benchmark that simulates a chat room application, is characterized by long-lasting (i.e. several minutes) network connections and high thread counts. [5]

 

IBM profiled two Linux kernels running VolanoMark on their Linux JVM implementation in order to better understand where performance bottlenecks occurred.  Because their JVM implementation uses a one-to-one threading model (one kernel thread is created for each Java thread) and VolanoMark uses a pair of Java threads on each end of a chat room connection, four native Linux threads (or Linux “tasks”) are created for each client-sever connection.  As the number of threads increased from 400 to 2000, IBM’s kernel profiles revealed that the percentage of kernel execution time spent in the scheduler function alone increased from 17% to 25%. [2] These results indicated a more general scalability problem with the Linux kernel that would affect any heavily multithreaded application. 

 

IBM conducted further tests to evaluate the scheduler’s performance while running VolanoMark.  They inserted instrumentation code to measure run queue length and found that, on average, there were 30 threads in the queue but occasionally that number increased to 400 threads. They then reorganized the Linux “task”  (i.e. thread) data structure so that fields used to compute a tasks “goodness” value were in the same cache lines.  Since the “goodness” value of a task determines its potential for scheduling and is calculated for every task each time the scheduler is called, this adjustment reduced the number of cache accesses required in order to calculate goodness.  This structural reorganization resulted in a 35% reduction in the “goodness” calculation time and a 7% increase in VolanoMark throughput.   Based on these observations, IBM recommended that (1) the Linux scheduler be modified to efficiently schedule a large number of tasks, and (2) that Linux support a many-to-many threading model.  We chose to focus our research on the scheduler.

 

CITI, an applied research and development center, has been researching Linux scalability since 1998.  Their focus is primarily on supporting greater network server loads, file descriptor scalability, improving memory and TCP bandwidth, and multiprocessor scalability.  To date we are their lead investigators of the scheduler’s scalability problem.

 

Other research has recently spent an enormous effort designing real-time scheduling for Linux.  Atlas [1] designed and implemented a statistical rate monotonic scheduler for periodic scheduling of real-time tasks in Linux.  Wang [9] presents a general framework for scheduling real-time applications in their version of Linux using an allocator and dispatcher abstraction.  Regher proposes incorporating flexibility into the scheduler by using dynamically loadable scheduler modules to fit an application’s requirements. [7]  Although this research is focused on the scheduler, none of it focuses on the scheduler’s scalability issue. 

 

The Linux discussion groups provide evidence that work on the scheduler function has and continues to be investigated with the most recent changes being added to a task’s scheduling policy.  Similar to our findings of what causes the scheduler scalability problem, there has been previous discussion within the Linux community concerning the q(n) scheduling algorithm.  Counter arguments state evidence showing that even heavily used websites only show one runnable process 90% of the time, while never going higher than about fifteen.  We believe that this assumption is incorrect based on the evidence presented by IBM running a typical web server application—chat rooms.  We show that a scheduler design that satisfies both a small and large number of ready tasks is possible. 

 

To the best of our knowledge, besides the research identified in this section, no other group has published work on the scheduler’s scalability problem or proposed changes enabling it to handle a large number of ready tasks.

 

Unfortunately, since Linus Torvalds’ original decision to keep the kernel simple and small, coupled with his belief that there are not “a lot of major new innovations in store for the kernel” [8], the Linux developer community is justifiably hesitant to incorporate any modifications that may induce extra overhead.  Our work shows that such changes are justified and cost little. 

 

3 Current Scheduler

 

This section will discuss some of the data structures and algorithms implemented in the current scheduler as of linux_2.3.99-pre3 (24 March 2000). [11] This information assists in explaining our observations and design decisions.

 

3.1 Task Structure

 

The Linux execution context is referred to as a task.  A task is responsible for maintaining the state of all address space information.  This state includes pointers to its address space, the task state relative to the kernel, the processor state (i.e. register usage), task statistics (used for memory management and resource limit enforcement), task credentials, file system information (including file descriptors), IPC data, signal handlers, and thread group tracking.  Figure 1 shows the task structure’s key fields used in scheduling.

 

The state field holds the current state of a task.  A task can be in one of six states with TASK_RUNNING being the task’s state when it is in the run queue.   A task’s counter indicates its time remaining in its current quantum while a task’s policy is either SCHED_OTHER   for user tasks and SCHED_FIFO or SCHED_RR (round robin) for real-time tasks.  Additionally, a task’s policy can be set to SCHED_YIELD when an interrupt handler wants the task to yield its processor.  The field, has_cpu, is set to 1 when the task is currently executing and 0 otherwise; processor is the processor number that a task last ran; and run_list contains the next and prev pointers that enable the forming of the run queue’s circular doubly linked list.  The fields next_task and prev_task allow for the linkage of all tasks in the system and mm points to the task’s virtual memory mapping.

 

Linux, unlike other UNIX implementations, does not package kernel threads into a lightweight process.  Instead every thread is a process as far as the kernel is concerned.  This infers that every Java thread created in a 1-to-1 thread modeling using Linux native threads, is mapped to a user process and shows up as such if you use the process status (ps) command.  This gives rise to possible alternatives for grouping these tasks. 

 

3.2 Run Queue

 

The run queue in Linux is a circular doubly linked list containing all the tasks in a TASK_RUNNING state.  Each call to schedule() results in a traversal of the entire linked list.  This traversal is necessary in order to calculate a goodness value for each task in the run queue.  This goodness value determines what task will run next and is in the range of ±1000, with +1000 being a real-time task and -1000 being the lowest.  If the highest calculated weight is zero, the scheduler recalculates “goodness”.  This is an q (n) operation where n is the number of tasks in the run queue. 

 

Unfortunately the run queue’s simplicity is its undoing when a large number of tasks are on the queue.  We believe that if the Linux scheduler is to scale well we have to implement a replacement that allows a linear, or near-linear insertion and lookup of tasks.

 

3.3 Goodness Calculation

 

The “goodness()” function returns a value associated with a task’s relative desirability for scheduling.  The calculation depends on the following factors.  First, is processor affinity.  If the last processor the task ran on is the processor for which we are scheduling, it is given a significant advantage because of the existing possibility that it may still have some memory lines in that processor’s cache. 

 

The second factor affecting the goodness() calculation is its memory map address.  If a task shares the same address space as the previously ran task, then the task’s goodness value is increased by one because of the reduced context switch overhead involved.

 

The third and fourth factors affecting the calculation are the task’s counter value and its priority.  If its counter value is zero (i.e. its time quantum is expended), that task’s goodness value is set to zero and the goodness() function returns to the scheduler function with no further calculations.  Otherwise the task’s priority and counter values are added to its goodness value.

 

Finally, a task’s policy plays a role in its goodness value.  Any real time task is given the maximum value of 1000 plus its real time priority in order to guarantee scheduling unless another real time task is in the run queue with a higher priority.

 

We believe that this goodness() function considers the appropriate factors to make an intelligent scheduling decision.  However, it is not necessary, and in fact wasteful, to calculate goodness() for each ready task whenever the scheduler is invoked.

 

3.4 Scheduler Algorithm

 

The Linux kernel function schedule(), as in other operating systems, is called from over 500 other functions within the Linux kernel, indicating its significance to overall system performance.  The scheduler function first executes any active bottom-halfs.   Bottom-halfs are functions that are too substantial to run during an interrupt.  Upon an interrupt, the interrupt handler performs a minimum amount of work and saves the necessary state required so that the rest of it can execute later in a bottom-half handler.  We have decided not to modify any bottom-half code because it would require extensive changes that do not affect our goal of improving the scheduler’s scalability. 

 

After handling some additional administrative work while interrupts are turned off, the scheduler executes the heart of its code—a while loop that traverses through the entire run queue (Fig 2).  This loop is obviously the bottleneck in our system and what we focused on eliminating.

     

Linux schedule() function loop

while (tmp != &runqueue_head) {

  p = list_entry(tmp, struct task_struct, run_list);

  if (can_schedule(p)) {

    int weight = goodness(p, this_cpu, prev->active_mm);

    if (weight > c)

      c = weight, next = p;

  }

  tmp = tmp->next;

}                                               Fig 2

 

 

4 Design and Implementation

 

This section discusses our scheduler’s design criteria, presents our design and implementation details, and then evaluates our design based on the pre-established criteria.

 

4.1 Design Criteria

 

Based on our previously discussed observations we decided on five criteria to assist our design decisions and evaluation. Listed in priority below are these design criteria, followed by a brief discussion of our reasons behind choosing each criterion.

 

1.       Goodness is good so use it.

 

2.       KISS (Keep It Simple Stupid).  Keep the implementation simple in order to limit overhead and kernel size.

 

3.       Make the common case fast.  Must not change performance when there are only a few runnable tasks in the system.

 

4.       Must limit changes to current task scheduling behavior.

 

5.       Ignore real time tasks and other task policies for now.

 

We respect the goodness() calculation because not only does it account for a task's priority, but goodness() also considers its processor affinity and address space.  We believe that these considerations benefit the system more than any extra cycles we could save by optimizing the function.  In order to test our theory that goodness() is not the system’s bottleneck, we implemented a patch simplifying its calculation and ran preliminary benchmarks.  We saw no change in the benchmarks’ performance.  Furthermore, IBM’s experiments showed that by modifying the scheduler algorithm to take the first task in the run queue rather than calculating goodness() for each task, time in the scheduler is reduced, but so is system throughput [2].  Therefore, we use the factors in goodness() to guide scheduling decisions.

 

What we did observe with the goodness value, however, is that its calculation consists of two parts:  (1) a static and (2) dynamic part.  The static part of the goodness calculation is a task’s priority and its counter.  Although these values may change whenever the task is executing, their state is constant while the task is in the run queue awaiting processor dispatch.  We call these two values the static parts of goodness. 

 

Processor affinity and shared address space are parts of the goodness() calculation dependent on the freed processor and the previously running task.  Since the possibility exists that this portion of a task’s goodness calculation could change while the task is waiting in the run queue for processor dispatch, we call these two values the dynamic part of goodness.

 

Because the Linux community is adamant about limiting overhead and pays meticulous attention to the details of cache alignment and constant time running in other areas of the scheduler, we state up front that our goal is to limit the overhead of our implementation and keep it as simple as possible.  This criterion caused us to initially throw away many well-conceived brainstorms.

 

Another principle we wish to maintain is the running time of the operating system in the common case—that is when there is a single user using Linux on a desktop PC.  Since this is the reason Linus created Linux and the reason so many developers are committed to maintaining Linux, we do not wish to override its excellent performance when there are only a few ready tasks in the run queue. 

 

We desire to limit any changes to scheduling behavior so that current applications running on Linux will not see any drastic modifications to their task’s ordering.  That is, tasks should be scheduled in relatively the same order that they currently are.

 

Finally, and probably subject to the most criticism, is the fact that we chose to ignore real time tasks and the other task scheduling policy possibilities for now.  The reasons are simple.  After studying the kernel, we found no indication that policies other than SCHED_YIELD were ever assigned or changed anywhere else in the kernel.  Only SCHED_YIELD is modified to indicate the yielding of a processor, but this value does not affect the current goodness calculation.  Furthermore, these policies are inconsistent in how a task is placed in the run queue.  In the current implementation, when a process is woken up it is always placed at the front of the run queue even though it has no bearing on whether the scheduler will select it as the next task to run.  However, the round robin policy moves a task to the end of the run queue when its counter is zero.  This movement makes no sense because as noted above, every task is re-examined anyway to re-calculate its “goodness” for task selection.

 

Our understanding is that these fields and policies are parts of the scheduler still in the developmental stage and mainly used when Linux is running in a real-time environment.   Since the policy field can be considered as part of the static goodness, it would be trivial to add it to our design although we do not know at this time what the side effects may be.  Finally, because our goal is to improve the scheduler’s scalability and not fine-grained scheduling of specific type tasks, we decided to disregard a task’s policy when making scheduling decisions.

 

Although these criteria are aggressive, and perhaps contradictory in some ways, we believe that in order to assuage the mass number of Linux users, we have to meet our top three criteria if we want the Linux community to accept our recommended changes. 

 

4.2 ELSC Design and Implementation

           

In order to obtain a performance time that is less than linear to the number of tasks in the system, we had to either divide the run queue up in some sort of fashion using a variation of a multilevel feedback queue, or group similar tasks together that share the same address space.  Then, rather than compute goodness() for all tasks in the system, compute it for a subset of tasks--knowing that the goodness values of the remaining tasks are not high enough for selection in that scheduling round anyway.

 

Our design, ELSC, revolves around using the static and dynamic parts of goodness to create this division of tasks. Rather than store the tasks in a circular, doubly linked-list, we use a chained hash table as the run queue.  The tasks are indexed into the table by the most significant bits of their total static goodness values (Fig 3).

 

The hash function takes into account the task’s counter and priority—hashing it to one of 512 buckets.  The function is very simple, consisting of two shifts and an addition.  The priority is shifted 22 bits to the left then added to the counter.  The sum is then shifted 23 bits to the right.  We chose this hash because it was inexpensive in terms of execution cost, and signifies the desirability of a task by its bucket.  In order to reduce lookup time when choosing a task to run, we keep an index to the highest occupied bucket.

 

Each bucket in the table contains a circular, doubly, linked list of tasks built using the current implementation’s functions.  Each task in a list has the same upper nine bits of static goodness, where 29 is the hash table size.  Upon entering the run queue or after completion of its CPU burst, the task’s hash value is determined and the task is inserted as the tail of that bucket’s list.  The schedule() function then traverses the table’s top list calculating the dynamic goodness of each task in that bucket and selecting the task with the highest dynamic goodness.  If all tasks in top bucket are equally desirable for selection then the scheduler chooses the first task for execution since it is the oldest task in the top bucket.  In order to protect against the worst case, that is when all ready tasks hash to the same bucket, we bound the search to the first 16 tasks.  Finally, if all tasks currently in the top bucket are executing or have a counter value of zero, the scheduler searches the next lower bucket for potential tasks and continues this operation until it finds a task it can dispatch giving preference to tasks in the highest bucket.  Once again, in order to limit the search through all of the buckets, a bound is set at two buckets.

 

 

 

Another way to look at this design is that it is a radix sort of a runnable task’s static goodness value.  All tasks sorted into the highest bin are candidates for selection in the next scheduling cycle with the task having the highest dynamic goodness being selected.  The general algorithm for our design is then

 

(1)     Remove the interrupted task from the run queue.

 

(2)     Calculate static goodness of the interrupted task

 

(3)     Re-insert the interrupted task and update the top pointer if necessary

 

(4)     Select the next task from the top hash bucket based on the highest dynamic goodness and/or using the previously mentioned heuristics.  If all ready tasks are executing, run the idle task.

 

4.3 ELSC Design Evaluation

 

How does this design meet our criteria?  Its advantages are that we continue to use the idea of goodness to select the next task, keep the run queue data structure relatively simple, and do not change the performance of the system when there are a small number of runnable tasks.  Its disadvantage is that it may not exactly replicate the scheduling behavior of the current system although this is difficult to determine.  We believe that even though the scheduler behavior may have changed, the change is for the better.  Our tests show that the current Linux scheduler is not always fair under certain conditions but that our implementation promotes fairness [C1] among equally desirable tasks.  We will discuss these findings in section 5.

 

The difference between our design and the current scheduler’s goodness calculation is that we chose not to recalculate the static part of goodness on each call to schedule().  Instead static goodness is calculated only when a task enters the run queue from yielding its processor.  Dynamic goodness is then the final selection criteria for determining the next task to dispatch.  The number of goodness calculations is reduced to the number of tasks in the highest bucket plus the static goodness of the previously run task.  Furthermore, the goodness calculation performed when selecting a task is not the full goodness calculation, but rather the dynamic portion of goodness. 

 

Our implementation is kept relatively simple.  Since our design initially allocates memory for the maximum hash table size, the only additional overhead between our data structure and the current implementation is that we introduce 512 (29 ) doubly linked list sentinels (8 bytes each) into the structure for an additional overhead of 512 * 8 = 4K bytes.  This allows the hash table to fit nicely onto one virtual memory page and results in only a 7% increases in the scheduler size.

 

Since the shifting of bits is a fast operation our hashing function is efficient.  The insertion and deletion of tasks is a trivial operation.  We wrapped our hash insertion and deletion functions with the existing scheduler run queue functions. 

 

In keeping with our original criteria, the new scheduler is comparable to the current scheduler for everyday desktop use.  In fact we used the scheduler during the last few days of development work and did not feel any noticeable difference in terms of speed or interactivity.  Informal tests involving kernel compiles showed the two schedulers to be statistically the same. 

 

An average running time analysis of our algorithm yields an q (1 + n/2k) operation where 2k is the hash table size.  Because we know that all tasks in the top bucket have the same upper k bits of static goodness values, we limit the search of the top bucket to a constant number of tasks.  Thus, our algorithm defaults to a constant time operation—much better than q (n).  Our design defaults to the current scheduler’s algorithm in the case when there are only a small number of ready tasks.

 

Using our design criteria, the only possible disadvantage to our system is that, as of now we cannot guarantee that task scheduling behavior is exactly the same as the current scheduler although we argue that it is not important.  The only difference between our design and the current scheduler is that if a task hashes to a bucket significantly lower than the top bucket, the scheduler will not select it as the next task to run even if processor affinity would give it enough “points” to make it more desirable than other tasks in the top bucket. 

 

Still, we believe the ELSC scheduler exhibits more correct behavior by grouping tasks based upon time quantum and desired priority and then selecting from within that group a task that can be executed most efficiently.  We could easily modify our scheduler to hash based upon address space and processor affinity--grouping the tasks into levels of efficiency; then, within a group, selecting a task based upon desirability.

 

5 Tests and Results

 

In order to compare the scalability of the current scheduler and our implementation we tested five benchmarks using the Linux operating system version 2.3.99-pre3 and our version 2.3.99-pre3-elsc.  Version 2.3.99-pre3 is the most recent version of Linux as of 24 March 2000 [11]. 

 

We used two machines for our tests.  The first is a generic desktop computer with an AMD K6-2 400 MHz.  The second machine is a SMP computer with four, 500 MHZ Pentium III Xeon processors. 

 

In analyzing our results we desired to answer two questions:  does our scheduler show the same performance when there are a small number of threads in the system and does it scale when there are a large number of threads?

 

The remaining paragraphs in this section give a description of each benchmark, the purpose for the benchmark, followed by the results obtained running the benchmark on both the current scheduler and our implementation.

 

 5.1 gcc Compiler.

 

Our first benchmark is the gcc compiler.  The benchmark measures the time to compile the Linux kernel in order to determine how our scheduler compares with the original scheduler in the common case—that is when there are only a few tasks running in the system.  Table I and II show our results.

 

Time Breakdown

2.3.99-pre3

2.3.99-pre3-elsc

User Time (secs)

443.29

444.09

System Time (secs)

35.73

35.33

CPU Time

99%

99%

Total Time

08:00.0

08:00.1

                       Table I - Uniprocessor

 

 

Time Breakdown

2.3.99-pre3

2.3.99-pre3-elsc

User Time (secs)

262.78

265.05

System Time (secs)

19.61

19.03

CPU Time

332%

332%

Total Time

01:24.9

01:25.4

                Table II - Symmetric MultiProcessor

 

The tables show almost identical performance between the schedulers with the current scheduler compiling the kernel slightly faster than our implementation.  However, the total time spent in the kernel is interesting.  On both the uniprocessor and the SMP, our implementation spends less time in the system indicating that its scheduling algorithm is running slightly faster than the current version.

 

 

5.2 Thread Spawning.

 

The second benchmarks spawns 20 to 1000 threads and counts the number of times they yield in one second. To ensure that all threads are on the run queue when the benchmark begins all threads RSVP via a semaphore to a control thread. Those threads then proceed to wait for a pipe file descriptor by calling select( ). Thus, all threads have similar thread control blocks when they begin execution, as if they had all executed the same code. Once all the threads have RSVPed, the control thread gets the current time and writes a byte to the pipe. This action wakes up all threads and places them into the ready state simultaneously. An alarm handler signals the finish time by setting a shared stop byte to stop the benchmark.  The benchmark measures the average number of microseconds it takes the kernel to handle a yield call.

The purpose of this benchmark is to see how the fast the scheduler makes decisions as the number of threads increase.  The results, shown in figure 4, list the current Linux scheduler as “base” and our implementation as “elsc”.  The results clearly indicate that the current scheduler runs in linear time while ELSC runs in constant time.  This verifies earlier experiments by IBM that the current Linux scheduler does not scale.  Additionally, when there are less than 50 ready tasks in the run queue both schedulers show similar performance.

Furthermore, while running this benchmark we also observed that ELSC made fairer scheduling decisions as compared to the current scheduler—that is the number of yield calls made per thread was relatively consistent with ELSC, while in the current scheduler certain threads dominated the number of yield calls made.

 

 

 

 

 

 

 

5.3 Talking Threads.

 

The second micro benchmark spawns an increasing number of server processes operating in separate address spaces.  Each server process then spawns ten threads executing within the same address space as the server process.  Those ten threads send a byte to the server thread at a random interval.  The server thread receives the byte and then echoes it to the 10 threads. 

As with the previous benchmark, this benchmark uses the RSVP/select( ) method to ensure all threads are on the run queue prior to starting the timer.   In order to spawn 1000 threads, their creation required regulation with semaphores.  The benchmark measures the average number of microseconds required to generate a message. 

The purpose of the benchmark, as the previous benchmark, is to measure how each scheduler handles an increase in the number of threads.  This benchmark also stresses the address space portion of the dynamic goodness calculation since it creates more than one task operating in separate domains.  Finally, because of the number of synchronization elements used, the benchmark also increases confidence that ELSC does not disturb other areas of the operating system. 

Even with all these ongoing actions, ELSC still generates 500 more threads than the current scheduler version (figure 5). The results again show that our scheduler runs in relatively constant time—even with an increase in the number of ready tasks in the system.  The slight slope is likely due to an increase number of servers’ broadcasting their messages back to their threads.  Once again, when there are only a small number of ready tasks the two schedulers show similar performance.



5.4 Java Counting Threads.

 

The fourth micro-benchmark is written in Java.  This program takes as input the number of Java threads to create.  The benchmark first creates all the threads, then each thread starts incrementing a counter.  When a thread’s counter reaches 10,000 it stops.  The count is high enough so that a thread’s time quantum expires at least once prior to reaching 10,000--forcing a minimum of one yield per thread during the simulation.  The program stops executing when all threads have counted to 10,000, and it records the total time for all threads to reach this limit.  We used IBM’s Java Development Kit (JDK) 1.18 for Linux to obtain our results since their JVM uses native threads in the implementation of Java threads.  Java is run with a maximum heap size of 256 MB in order to minimize garbage collection time.  Similar to the second and third benchmarks, this benchmark stresses the scheduler because it forces it to choose between large numbers of threads, resulting in a large run queue.  Since this is a Java implementation the JVM/kernel interface is also stressed.  

 

The benchmark results are shown in Table III and figure 6.  We ran this benchmark on the uniprocessor machine only to obtain some initial results as to the performance of our scheduler in a JVM/kernel environment. The graph shows that between 10 and 800 threads ELSC performs slightly better than that of the current scheduler.  However, the graph clearly indicates that when 1000 threads are running in the scheduler, ELSC scales much better.

 

Number

Elapsed

Time

 of Threads

          (msecs)

 

 

base

elsc

10

12

11

100

38

26

200

53

34

400

85

57

800

114

108

1000

645

135

       Table III - Java CountingThread

 

 

5.5 VolanoMark.

 

The final benchmark we ran is VolanoMark.  As discussed in section 2, VolanoMark is a well-known Java benchmark currently used to compare JVM implementations because it simulates a real-world type scenario—chat rooms.  The benchmark creates client connections in groups of twenty.  After establishing an intial connection to the server, the clients send messages to the server.  The server receives messages from each client and then broadcasts those messages so every client can another client’s messages.  VolanoMark measures how long it takes for clients to broadcast their messages to the group. At the end of the test, it reports a score as the average number of messages transferred by the server per second.  [5] 

 

Because we ran the benchmark using IBM’s JVM, VolanoMark also tested the schedulers’ ability to the service the client and server threads. We ran all VolanoMark tests in loop back mode in order to factor out any network latency, and only ran it on the SMP machine because page swapping was dominating the time on the uniprocessor.  Additionally, we set a thread’s stack size to 256K and the maximum heap size to 256 MB in order to minimize the frequency of Java’s garbage collection (java –ms 64MB –mx 128 MB).

 

Each VolanoMark room creates 80 threads (20 clients per room x 2 threads/client x 2 threads/server).   We varied the number of rooms from 1 to 15 resulting in 80 to 1200 threads and had each client send 100 messages.  The total elapsed time for each thread to send 100 messages and the total message throughput is shown in figures 7and 8 and Tables IV and V.

 

Again, as witnessed with the previous Java micro-benchmark, our scheduler performs slightly better than the current scheduler when there are less than 800 threads and scales considerably better than the current Linux scheduler, showing almost a two times speedup and a 90% message throughput improvement when there are 1200 threads running in VolanoMark.

 


Number

Elapsed

Time

 of Threads

                 (secs)

 

 

pre3

elsc

80

4.58

4.55

160

15.31

11.86

400

63.30

41.51

800

195.78

126.34

1200

438.55

224.37

      TABLE IV VolanoMark Elapsed Time

 

 

 

Number

Average

Throughput

 of Threads

(messages/sec)

 

 

pre3

elsc

80

8737

8795

160

5337

6744

400

3162

4818

800

2046

3166

1200

1387

2674

              TABLE V VolanoMark Throughput


Our results running a real-world benchmark show that our scheduler performs as well if not, perhaps, slightly better than the current scheduler when there are a small number of ready threads, and scales much more gracefully when there are more than 400 threads in the system.

 

6 Future Work

 

Based upon what we have learned in implementing the ELSC scheduler, we feel that there are two opportunities for future work:  further optimization and a slightly different approach.  We would like to first optimize the ELSC scheduler. There are a few implementation constants, such as search bounds, that we chose arbitrarily. A closer study of running systems would allow us to fine tune these parameters and yield the most effective system.  Other optimizations might involve the use of an extensible hash function to better scale the structure for both a large and small number of running tasks.  Finally, we could improve the implementation through careful code inspection, optimizing for both uniprocessor and SMP machines.

 

Another design proposal, that we considered but did not have time to implement, is to sort the ready tasks within each hash bucket by an address space/processor combination in an attempt to "predict" the dynamic portion of a goodness value.  Further, we would order each run queue using a priority queue, or similar structure.  This design has the advantage of extremely fast lookups and comparisons.  However, it is our opinion that the overhead required to maintain the structures correctly may outweigh the advantages.  We do believe though, that  “everything is worth trying once”, and the Linux scheduler is no exception. 

 

7 Conclusions

           

An increasing number of organizations continue to evaluate, test, and use the Linux operating system to address their computer system requirements because of its low price/commodity ratio and ease of upgrading upon new releases.  Several of these organizations are large corporations and Internet service providers, interested in using Linux as their corporate operating system and web server.  We have shown in this paper, however, that when the Linux scheduler is confronted with a large number of ready tasks, overall system performance and user responsiveness rapidly declines.  In a large-scale enterprise or web server environment this decrease in performance is unacceptable. 

 

In this paper we set out to incrementally improve the Linux scheduler’s scalability problem, desiring modifications that did not change its excellent desktop performance yet scaled appropriately when faced with a large number of ready tasks.  While there is still work required in terms of optimizing our current design, we demonstrated that the current ELSC scheduler satisfies both a small and large number of ready tasks and offers a possible alternative to the current Linux scheduler. 

 

 


 

Acknowledgements

 

The authors would like to acknowledge Peter Honeyman and Chuck Lever of CITI for their initial idea, guidance, and assistance; the expertise and help from Ray Bryant and Bill Hartner of the IBM Technology Center in Austin, Texas for their assistance in acquiring and using VolanoMark and the IBM Kernel Trace Facility for Linux; Brian Noble for sparking our interest in operating system’s research; and finally, all the past, previous, and future developers of Linux.


Copyright and Trademark Information

 

            CITIÔ is a register trademark of the Center for Information Technology Integration as part of the Information Technology Division (ITD) at the University of Michigan.

 

            Linuxâ is a register trademark of Linus Torvalds.

 

            IBMÔ Kernel Trace Facility is a trademark of IBM Corporation.

 

            IBM Java Development Kit 1.18 (JDK 1.18) is a trademark of IBM Corporation.

 

            JavaÔ is a trademark of Sun MicrosystemsÔ, Inc and refers to Sun’s Java Programming language.

 

            VolanoChatÔ and VolanoMarkÔ are registered trademarks of Volano LLC.  The VolanoMarkÔ benchmark is Copyright ã 1996-1999 by Volano LLC, All Rights Reserved.

 

References

 

[1]  Atlas, A.  Design and implementation of statistical rate monotonic scheduling in KURT Linux.  In Proceedings 20th IEEE Real-Time Systems Symposium.  Phoenix, AZ, December, 1999.

 

[2]   Bryant Ray and Hartner, Bill.  Javaä, Threads, and Scheduling in Linuxâ.  IBM Linux Technology Center, IBM Software Group.   http://www-4.ibm.com/software/developer/library/java2/index.html. 

 

[3]   Lohr, Steve.  IBM goes countercultural with Linux.  The New York Times On The Web, 20 March 2000. http://www10.nytimes.com/library/tech/00/03/biztech/articles/20soft.html  

 

[4]   Molnar, Ingo.  Re: scheduling.   mingo@chiara.csoma.elte.hu, 1 May 1998.  http://www.uwsg.indiana.edu/hypermail/linux/kernel/9805.0/0056.html

 

[5]  Neffenger,  John.  The Volano Report.  Volano LLC, 24 March 2000.   http://www.volano.com/report.html.

 

[6]  Orr, G.  Building a Library Web Sever on a Budget.  Library Software Review,  17:3, Sep 1988, 171-176.

 

[7]  Regehr,  John.  Reply:  A different approach to scheduling issues.  jdr8d@cs.virginia.edu, 30 September 1998.  http://www.uwsg.indiana.edu/hypermail/linux/kernel/9809.3/0933.html.

 

[8]  Torvalds, Linus.  The Linux Edge.  Communications of the ACM 42, 4 Apr 1999, 38-39. 

 

[9]  Wang, Y.C.  Implementing a general real-time scheduling framework in the RED-Linux real-time kernel.  In Proceedings 20th IEEE Real-Time Systems Symposium.  Los Alamitos, CA, 1999, 246-55.

 

[10] Woodard, B.  Building an enterprise printing system.  In Proceedings of the Twelfth Systems Administration Conference (LISA XII).   USENIX Association, Berkeley, CA, 1998, 219-28.

 

 

[11]  Available:  http://www.kernel.org/pub/linux/kernel/