Sprite on Mach Mike Kupfer University of California, Berkeley Abstract Sprite is a distributed operating system that supports a fast single-image network file system and transparent process migration. Over a period of 19 months I ported Sprite to run as a server on top of the Mach 3.0 microkernel. Although the resulting server does not implement some Sprite features, it can run in an existing Sprite cluster, and it supports standard UNIX programs like vi, gcc, and make. Porting Sprite to Mach was generally straightforward, though there were some difficulties. Many of the problems were related to asynchronous interactions between the Sprite server, Mach, and Sprite user processes. Others resulted from trying to maintain native Sprite's internal interfaces in the Sprite server. Although machine-dependent code accounts for 14% of the source lines in the Sprite kernel, it accounts for only 1% of the source lines in the Sprite server. I believe that this will significantly simplify porting Sprite to new hardware platforms. Unfortunately, the Sprite server runs the Andrew benchmark at only 38% of the speed of native Sprite. None of the performance problems appear insurmountable, but they could require a long time to track down and fix. 1. Introduction Sprite is a distributed operating system that was developed at the University of California, Berkeley [1]. It features a fast single-image network file system [2], transparent process migration [3], and a high-performance log-structured local filesystem [4]. Sprite was originally written to support the SPUR multiprocessor project at Berkeley [5], so the kernel is multi-threaded and uses fine-grain locking. I have ported the Sprite kernel to run as a server on top of Mach 3.0. The server does not have complete Sprite functionality, but it can run most UNIX commands as a client of native Sprite file servers. I used the modified Andrew benchmark [8] to partially tune the server and to analyze the remaining performance problems. In the following sections I will explain why I ported Sprite to Mach; I will sketch the design of the Sprite server, with particular attention to some of the problem areas; I will discuss the relative sizes and portability of native Sprite and the Sprite server; I will analyze the performance of the Sprite server; I will evaluate the success of the Sprite server; and I will present some conclusions that I have drawn. 2. Why mix Sprite and Mach? The Sprite project became interested in Mach for three reasons: to make Sprite more portable, to make Sprite easier to distribute to external sites, and to verify whether Mach is a suitable platform for building distributed systems. % Sprite currently % runs on the DECStation 5000 and on Sun SPARC systems. In the past it % has also run on the DECStation 3100, the Sun 2, the Sun 3, prototype % SPUR systems, and the Sequent Symmetry. Most of these ports were % performed by students in the Sprite group. These ports were % time-consuming and delayed progress on the students' research work. % It was hoped that by putting Sprite on top of Mach, the Sprite group % could leave the hardware ports to the Mach community, freeing the % Sprite group to concentrate on more interesting research projects. (The full paper will elaborate on these reasons. The portability and software distribution issues are a result of the small size of the Sprite group. The discussion about Mach being a suitable platform will mention the UX server and the shared memory server as being the only relevant prior art for a highly-networked system like Sprite. I will claim that Sprite stresses Mach more than either of these servers, and we in the Sprite group were concerned about possible performance problems in the network code.) These project goals led to the following design goals. First, the revised Sprite system should be highly portable. This led to implementing Sprite as a server, under the assumption that machine-dependent code such as device drivers and memory management would be written by the Mach community, not the Sprite group. A second goal was simplicity, so as not to complicate future Sprite research work. A third goal was to minimize changes to existing Sprite code. The point of this goal was to minimize development time and to simplify synchronizing native Sprite code changes with the Sprite server. It led to a single-server, rather than a multi-server, approach. The fourth goal was to get performance comparable to native Sprite. Slight performance degradation was considered acceptable in return for improved portability. 3. Design of the Sprite server As a first approximation, Mach can be thought of as a high-level abstract machine. It provides processes, scheduling, a simple interface for accessing a process's memory, and a machine-independent interface for accessing local devices such as disks or networks. The C Threads library provides threads, locks, and condition variables. Thus the first step in porting Sprite was to rip out all the low-level native code whose functionality was already provided by Mach, replacing it with a facade that maintains Sprite's internal interfaces. This process is similar to the transformation that was used to implement the ``UX'' UNIX server [6, 7]. On the user side, emulation code was added to the C runtime library. This code translates Sprite system calls into Matchmaker RPC requests to the Sprite server. Again, this setup is similar to the one used in the UNIX server. One important difference between the UNIX and Sprite servers is how they use external pagers. The UNIX server provides an external pager for mapped files, including program text, but it uses the Mach default pager for swap storage (i.e., a process's heap and stack). The Sprite server provides an external pager that backs the entire address space of a user process, including its heap and stack. This design lets a process page from a network file server, as is done in native Sprite. Sprite uses this feature to support diskless operation - almost all Sprite workstations are diskless - and to support process migration. Most of the changes to port Sprite to Mach were straightforward. Nonetheless, some changes presented unexpected difficulties. Many of the problems were related in some way to asynchronous interactions between the Sprite server, Sprite user processes, and Mach. For example, some data structures, such as memory objects and process table entries, are logically shared between the Sprite server and Mach, and special care is needed to ensure that the data structures are not freed prematurely or leaked. (The full paper will elaborate on one or both of these examples, depending on space availability, with pictures. If there is space, it will also discuss a race in the implementation of sbrk() and complications to handling copyin errors.) No single problem was insurmountable, but each required extra time to get the details of the design and implementation correct. % For example, consider what happens when a process maps a file and then % exits. The Sprite server cannot free its internal representation of % the mapped file until it receives a memory_object_terminate % notification from the Mach kernel. Now consider what happens if a new % user process tries to map the same file before the previous mapping % has been cleaned up. Should the server create a new object to % represent the mapping? Should it wait for the old one to be % destroyed? A similar problem was how to invoke a user signal handler for signals that are generated asynchronously. (The full paper will explain this in more detail. The current implementation does not support asynchronous signal delivery when there is a user handler for the signal. Instead the server passes back a ``pending signal'' flag for almost all requests. If this flag is set, the emulation library gets the signal information and invokes the handler.) % In native Sprite the kernel can % simply wait until the user process returns from a time slice interrupt % or system call. Strictly speaking, this is not an option for the % Sprite server because the process could be in an infinite loop waiting % for a signal. Instead the server can freeze the process, change its % registers and stack to invoke the signal handler, and resume the % process. Unfortunately, if the user process is making a Mach system % call, signal delivery should be delayed until the system call is % complete. Because of the potential complexity and performance % problems associated with a general solution (e.g., looking at the % process's program counter), we decided not to attempt asynchronous % signal delivery when there is a signal handler. Instead Sprite calls % return a flag to the emulation library saying that there is a pending % signal, and the emulation library makes a special call to invoke the % signal handler. We expect this design to be acceptable in practice. Maintaining Sprite's internal process interface was also complicated in places. Native Sprite is like UNIX in that user processes run in ``kernel mode'' after an interrupt or after executing a trap. As a consequence, native Sprite was designed so that a process can kill or suspend only itself. An attempt to kill or suspend a different process is simply a request, which the target process eventually acts on. In contrast, the Sprite server is a separate Mach task that user processes invoke via RPCs. The server can simulate the effects of traps and interrupts through an internal pool of threads, but there are some problems with this approach. First, special care is needed, e.g., in exit() and exec(), to ensure that the threads correctly manage internal resources like buffers. Second, because the Sprite server does not handle time slice interrupts, it cannot guarantee that user processes will notice attempts to kill or suspend them. Thus, unlike in native Sprite, user processes must be able to kill or suspend other user processes. This requirement led to many changes in the locking strategy employed by the process management and signals code in Sprite, and these changes were the source of many bugs. % Deadlocks in the exit() code were common during development and % testing. % Special code to handle suspension is needed in the system request % code, in case a process is suspended after it has submitted a % request but before the request has been acted on. Also, native % Sprite integrates suspension with the sched/sync code; the server % requires that ugly wait loop. 4. Status and code measurements This section explains the current status of the Sprite server and presents some code size measurements. I developed the Sprite server over a period of 19 months. The first 7 months of the project were a half-time effort; the remaining 12 months were full-time, including 2.5 months of performance tuning. I began development on a Sun 3 but later moved to a DECStation 5000. % Moved to ds5000 for performance reasons and to avoid having to deal % with signals support for 2 architectures. The Sprite server supports standard UNIX programs like vi, gcc, and make. Nonetheless, the implementation is still a prototype, and it lacks some important features. The server design includes support for these features; I simply lacked the time to implement them. The missing features include - binary compatibility (with both the vendor operating system and native Sprite) - local disk access - support for debugging user processes - process migration A Sprite DECStation kernel with the same functionality as the Sprite server contains roughly 143,000 lines of code, of which 27,000 lines are machine-dependent, including 3600 lines of assembly code. The Sprite server contains 111,000 lines of code, of which 1300 lines are machine-dependent. The server contains 4 lines of assembly code to help debug locking errors. In going from native Sprite to the Sprite server, I preserved about 90,000 lines of code (63%). I threw away another 39,000 lines of code (27%), and I rewrote the remaining code, about 14,000 lines (10%), for use with Mach. The thrown away code consisted of low-level (and often machine-dependent) code for device, process, and memory management, plus code that duplicated Mach functionality (e.g., the process scheduler and a kernel debugger). Most of the rewritten code was for process and memory management, ``system call'' support, and Sprite's internal lock package. To support the missing functionality mentioned earlier in this section, I would have to port an additional 58,000 lines of code, of which 2400 lines are machine-dependent. I expect that about 52,000 lines of this code would not require change, another 2000 lines would get thrown away, and the remaining 4000 lines would get rewritten. (The full paper will use bar charts to present some or all of these numbers.) 5. Performance Due to time constraints, I used only one benchmark to tune the Sprite server: the modified Andrew benchmark [8]. This benchmark copies a file tree, scans the tree several times, and then compiles some window system code. Unless otherwise noted, all the performance numbers in this section were obtained on a DECStation 5000 with at least 32 megabytes of memory, using files stored on a native Sprite file server. As (will be) shown in the table, native Sprite runs the benchmark in 91 seconds. The UNIX server (UX34) and Ultrix 4.2, using only local files, need around 118 seconds. The time for Ultrix 4.2 goes up to 141 seconds when the files are on a Prestoserve-equipped NFS server. The Sprite server's fastest time for the benchmark is 237 seconds. Early versions of the Sprite server required around 440 seconds to run the benchmark. Most of the performance fixes to reach the current time were in three areas: caching of text segments, reduced Sprite RPC latency, and more efficient string management in exec(). (The full paper will explain these in more detail and tell how much they contributed to the overall improvement.) I have analyzed the 146 second gap between native Sprite and the Sprite server, and I can explain about 98 seconds of it. I know more or less where the remaining 48 seconds are being spent, but not why. Tuning stopped when I changed employers in July 1992. The paper will explain the current top 5 understood or partially understood bottlenecks. They are (1) overhead in fork() from copying the heap and stack, which accounts for 45 seconds. This problem is present because Mach does not currently support copy-on-write for externally managed memory objects. (2) copyin/copyout delays for read() and write(), which account for 18 seconds. The UNIX server uses mapped files to avoid this problem. A similar approach for Sprite is not straightforward, because Sprite doesn't support consistent access to mapped files that are shared by multiple clients. (3) higher Sprite RPC latency (despite the improvements made during tuning), which accounts for 11 seconds. I suspect that most of the time is spent in the Mach networking code, though I was unable to verify this for the network input path. (4) overhead after exec() from faulting in heap and stack pages, which accounts for 8 seconds. The Sprite server currently creates new heap and stack segments in exec(); I expect that a better approach is to zero out and reuse the old segments. (5) unexplained Sprite ``system call'' time, which accounts for 8 seconds. I suspect that this problem results from unnecessary context switching, and that it can be fixed by changing how the Sprite server garbage collects process table entries. % Put a summary (a short, coherent summary) of the 5 items here? % Of the understood bottlenecks, the largest is in fork(). Although the % Mach external pager interface supports copy-on-write copying of memory % objects, the implementation is broken, so the benchmark spends 45 % seconds just copying heap and stack pages. The UNIX server avoids % this problem by using the default pager for user heaps and stacks, but % as explained earlier, this is not an option for Sprite. Fortunately, % there are plans to fix copy-on-write for externally managed memory % objects [JSB private comm.]. % % The next largest bottleneck (18 seconds) is spent copying bytes to and % from user space, mostly for read() and write(). Our measurements % indicate that the real culprit is the cost of the cross-address-space % memory operations, rather than the bcopy cost just to move the bytes. % The UNIX server uses mapped files to avoid this bottleneck [ref UX % paper and Dean/Armand paper]. % Unfortunately, mapped files in Sprite do not enjoy the same network % consistency guarantees that files accessed through a read/write % interface enjoy. This is not an inherent flaw in Sprite's design, but % fixing it would require changes to native Sprite as well as to the % Sprite server. It may be worth exploring other alternatives (e.g., % having the server cache mappings into user processes) before resorting % to mapped files. % % Another large slowdown in the system is caused by an increase in RPC % latency. Native Sprite can do a null RPC between two DECstation 5000s % in 0.85 milliseconds. The null RPC time between the Sprite server and % a native Sprite system is 2.2 milliseconds. For the Andrew benchmark, % which causes about 8200 Sprite RPCs, this translates to a % slowdown of about 11 seconds. It's not clear what the right solution % is for this problem. (Adopt NORMA RPC? (See 1992 Mach Usenix paper; % is it applicable?) Replace the packet filter architecture with the % OSF proposal? Use the X-kernel?) % % Another slowdown is caused by unnecessary faults during exec(). The % Sprite server currently destroys and recreates the heap and stack % segments during exec(). The Andrew benchmark causes about % 8300 heap and stack faults; at 1 millisecond per fault (where does % this number come from??), this causes a delay of about 8 seconds. To % fix this problem, the Sprite server can reuse heap and stack segments. % Unfortunately, it will return once the Sprite server starts using % copy-on-write for fork(). % % Another large bottleneck is in the code that processes Sprite "system % calls" (really MIG calls) from Sprite user processes. After comparing % the UNIX server "system call" times with the Sprite server times, we % estimate that this bottleneck is adding 8 seconds to the Andrew % benchmark time. We think that this is due to excessive context % switches in that part of the Sprite server, but this hypothesis has % not yet been verified. 6. Evaluation This section will evaluate the Sprite server according to the design goals presented earlier in the paper: portability, simplicity, minimizing changes to native Sprite, and performance. The Sprite server appears to be highly portable. It contains less code than an equivalent Sprite kernel, plus there is very little machine-dependent code and essentially no assembly code. The unimplemented code from native Sprite is mostly machine-independent, so a fully functional Sprite server should still be very portable. Although the Sprite server is not as simple as I'd like (because of the design complications described in section 3), most of the Sprite server design and implementation were straightforward, and debugging the server with gdb was generally easy. In fact, I was able to track down some bugs from native Sprite that had long been puzzling the Sprite group. Porting Sprite to Mach involved an acceptably small number of changes to native Sprite. Most of the server code is identical to the kernel code from which it is derived, and I made few changes to Sprite's internal interfaces. Furthermore, the Sprite server runs in an existing Sprite cluster. No changes - other than adding a new machine type - were made to the native Sprite systems to accommodate the Sprite server. Unfortunately, I was not able to meet my performance goal. Assuming that I can apply the fixes that I know of or expect to see soon, I am still left with a performance of about 165 seconds for the Andrew benchmark, which is still worse than Ultrix using NFS. I believe that additional tuning could reduce the benchmark time even further, though this would probably be a time-consuming task. 7. Future work As mentioned earlier in the paper, I am no longer employed by the University of California, and it is unlikely that anyone will do additional work on the Sprite server. If development were to continue, the obvious first task would be to continue performance tuning. More work is needed to make the Sprite server perform adequately on the Andrew benchmark, and there are other benchmarks that the server should be tested on. Furthermore, a production-quality Sprite server would require the remaining functionality mentioned in section 4. % mention the WPI synthetic benchmarks [ref]? It would be nice % if you could find a different reference? (Reread the Mach workshop % paper; maybe it's better than you remember.) In addition to mundane development work, it would be useful to conduct additional research using the Sprite server. For example, it would be interesting to compare the performance of the Sprite server with the performance of a similar server that uses Mach IPC for access to remote devices or process migration. % Look at the FS/VM cache boundary stuff? 8. Conclusions Porting Sprite to Mach was a mixed success. On the one hand, it greatly reduced the amount of machine-dependent code in Sprite, which should make Sprite much easier to port to new hardware. The asynchronous interfaces provided by Mach require some unpleasant complexity in the Sprite server, but this complexity is manageable. On the other hand, the server is almost unusably slow, and it appears that much work is still needed to bring performance within striking distance of the native system. At least one third of the performance gap results from the distributed nature of Sprite. However, the slowdown is not primarily due to RPC latency or throughput problems. Rather, it is due to the Sprite server's heavy use of an external pager, plus problems such as its inability to use mapped files to avoid copy overhead. The lesson here seems to be that there is more to high-performance distributed systems than fast communication, and Mach 3.0 currently does not support some designs as well as others. For some problems (e.g., copy-on-write for external pagers) it should be possible to fix Mach, but in other cases (e.g., use of mapped files for performance), it may be necessary to redesign the server instead. Thus, if I ask whether it is worth porting an existing operating system to run on top of Mach, the answer is ``maybe.'' For a research system like Sprite, with its small development community, increased portability seems attractive enough to warrant a large one-time porting and tuning effort. For commercial systems, though, a more portable server seems less alluring, particularly if the vendor has to port Mach as well. I think that commercial vendors will have to consider the potential benefits of Mach other than portability before deciding to base their systems on it. References [1] John Ousterhout et al, ``The Sprite Network Operating System'', IEEE Computer, February 1988. [2] Brent Welch, ``Naming, State Management, and User-Level Extensions in the Sprite Distributed File System'', University of California, Berkeley, Technical Report UCB/CSD 90/567, February 1990. [3] Fred Douglis and John Ousterhout, ``Transparent Process Migration: Design Alternatives and the Sprite Implementation'', Software--Practice & Experience, July 1991. [4] Mendel Rosenblum and John Ousterhout, ``The Design and Implementation of a Log-Structured File System'', Proc. Symposium on Operating Systems Principles, October 1991. [5] M. D. Hill et al, ``Design Decisions in SPUR'', IEEE Computer, November 1986. [6] David Golub et al, ``UNIX as an Application Program'', Proc. Summer 1990 USENIX Conference, June 1990. [7] David L. Black et al, ``Microkernel Operating System Architecture and Mach'', Proc. USENIX Microkernel Workshop, April 1992. [8] John Ousterhout, ``Why Aren't Operating Systems Getting Faster As Fast as Hardware?'', Proc. Summer 1990 USENIX Conference, June 1990. % [9] Ken Shirriff and John Ousterhout, ``A Trace-driven Analysis of % Name and Attribute Caching in a Distributed File System'', % Proc. Winter 1992 USENIX Conference, January 1992.