Notes, ideas about performance


	- Context switches -

W/i Sprite server, do you lose much when signalling a process because
(in Sync_WakeWaitingProcess) you do a broadcast on the condition
variable, rather than waking only the one process?

How much do you lose by doing context switches in the RPC processing
path?


	- Segment management -

How often does segment lookup have to bail out and start over because
the desired segment was in the middle of being destroyed?

How much time would you save by having Sprite cache segments, rather
than forcing them to be "immediately" destroyed?  (If you implement
this, look at Fs_GetSegPtr and how native Sprite stashes away the
segment handle.)

How much time do you win by renaming the port to be the segment
handle?

How much is exec speeded up by using an initialization file for the
heap?

How much time is lost by always opening and closing the swap file?
For a short lived process, you may never need to go to the swap file,
so why bother opening it?  [Cost is significant: doubles the time of
the fork benchmark.]

Try to understand the read-ahead stuff done by native Sprite
(vmServer.c).  How much does it buy you?

How much time do you spend releasing the reference on the control port
in the data_request, data_write, etc. routines?

How much time do you spend cleaning "anonymous" (heap & stack)
segments at process exit?

Understand the Fs_FileBeingMapped calls: how they're used, how they
affect performance.  See the comments in VmSegmentCleanup.

How much time is spent waiting to process a request because the server
is already processing a request for the given memory object?  (For
example, you can't currently page-in in parallel multiple pages for a
text segment.)

How much time do you lose querying for the size of the swap file in
VmCopySwapFile?  Should you keep that information in the Vm_Segment
(and update it from memory_object_write)?

How much time is spent waiting for the VM monitor lock?  How often is
the lock held while doing an RPC?  (For example, as of 9-Jan-92, the
code path for destroying a segment would hold the VM monitor lock
while trying to notify the file server that the file was no longer
mapped.  Also, Vm_GetSwapSegment holds the monitor lock while calling
VmOpenSwapFile.)

How much time in fork() is spent copying initialized heap pages?


	- processes -

Have you allocated enough Proc_ServerProcs?  Too many?  Should you
split the FS cache and VM server processes into separate pools?  You
might want to look at Mendel's changes to allow an expandable pool of
Proc_ServerProcs (procServer.h, procServer.c, proc.h).

How much do you lose by only having one thread to get requests and do
pcb reaping?  Would it be better to have multiple threads, each of
which goes through "obtain lock -> process dead list -> get msg ->
release lock"?  (What is the cost of two mach_msg's and two context
switches compared to the overall request processing time?)  Also,
there are a bunch of messages from late October and early November
1991 about the cthread_mach_mumble routines used in the UX server that
you should review.
Note: a possible alternative to locking (to avoid the process re-use
race) is to use no-senders notification.  You may need to take
advantage of sequence numbers; see Richard Draves's message of August
9, 1991.


	- network -

How much time is required for a null RPC?  How does that compare with
native Sprite?  Where is the time going?

Disable the RPC delay code?
[The way things are currently configured, this shouldn't make a
difference.  Sun 3's, Sun 4's, and DECstations are all set with input
and output times of 500 usec, and the RPC output code uses the
difference between the receiver input rate and the sender output rate
(i.e., 0) as the amount to delay.]

What is the efficiency of the FS and VM caches?  Would having cache
size negotiation make the caches more efficient?

Instrument the driver to find out how long the packet queue is.  Maybe
you should have multiple ReadPortSet threads.

Don't bother with the UtilsMach_Delay calls?

Should you re-enable the Proc_SetServerPriority call in
Rpc_CreateServer?

When comparing native and server Sprite, get an RPC count for the
benchmark (i.e., find out where sprited is doing more RPC's than
native Sprite and figure out if there's some good way to fix it).


	- server memory usage -

How much paging does the server do?  Are there data structures that
can be shrunk (e.g., VmFileMap)?  Are there different algorithms or
different ways of walking data structures to reduce the amount of
paging? 

Use the Sprite malloc (with Mem_Bin & callers)?


	- other VM -

Make the Vm_Copy{In,Out} code avoid vm_{read,write} calls when
possible (when dealing with server addresses)?  Note that
copy-on-write can only be used when the destination is backed by the
default pager (rather than an external pager).

Avoid using CopyIn/CopyOut by using a bounded string argument (e.g.,
for file names & such)?

When copying in arguments and environment variables from user space,
would probably be faster if you ensure that the server's buffer is
page-aligned (assuming you're still using Vm_CopyIn).  In fact, it
might be worthwhile revisiting the interface presented by
Vm_Copy{In,Out} to see if you can change it into something that
doesn't cause so much byte copying.

Keep counts for the number of 1-page, 2-page, 3-page, etc. page-ins
and page-outs?

Reduce the number of copies by using memory_object_data_supply with
deallocate?


	- timer -

Are you getting burned by having elements in the timer queue processed
too late?  (See notes for 12 November 1991).  Should you re-enable the
Proc_SetPriority call in TimerThread()?

The current timer code tries to schedule wakeups to the nearest
millisecond, since that's what Mach advertises.  First, does the
implementation really meet the specs, or is the granularity for
wakeups coarser than 1ms?  Second, would you be better off by reducing
overhead by upping the Sprite granularity to 10ms or 20ms?

For systems that don't have a mapped timer, how expensive is
Timer_GetTimeOfDay?  Should it always be called by TransferInProc?
(If not, an alternative is to put something in the timer queue to run
every N seconds and see whether there has been any console input in
the previous interval.)


	- file system -

Look at Fsrmt_Read and Fsrmt_Write.  Notice the user of an
intermediary buffer between the user buffer and the RPC packet
(costing an extra alloc and copy).  Can this be fixed?  For example,
JO has suggested mapping one or two pages of each user process
directly into the server address space, keeping the mappings around
from call to call until a different address is needed.  Does this
alloc/copy problem show up in other stream types?


	- signals -

Are there any applications where signal-handling performance is
critical (e.g., for SIGIO)?


	- Sprite "system" calls -

How much do you lose from the extra context switch (between the thread
that reads messages and the thread that processes the request)?

Is there a performance loss from creating/destroying the thread that
processes the request (rather than keeping a pool of them)?

Do you have too many paranoia checks?