.RP .TL A Set of Benchmarks for Measuring IRAF System Performance .AU Doug Tody .AI .K2 "" "" "*" March 1986 .br (Revised July 1987) .AB .ti 0.75i This paper presents a set of benchmarks for measuring the performance of IRAF as installed on a particular host system. The benchmarks serve two purposes: [1] they provide an objective means of comparing the performance of different IRAF host systems, and [2] the benchmarks may be repeated as part of the IRAF installation procedure to verify that the expected performance is actually being achieved. While the benchmarks chosen are sometimes complex, i.e., at the level of actual applications programs and therefore difficult to interpret in detail, some effort has been made to measure all the important performance characteristics of the host system. These include the raw cpu speed, the floating point processing speed, the i/o bandwidth to disk, and a number of characteristics of the host operating system as well, e.g., the efficiency of common system calls, the interactive response of the system, and the response of the system to loading. The benchmarks are discussed in detail along with instructions for benchmarking a new system, followed by tabulated results of the benchmarks for a number of IRAF host machines. .AE .pn 1 .bp .ce \fBContents\fR .sp 3 .sp 1.\h'|0.4i'\fBIntroduction\fP\l'|5.6i.'\0\01 .sp 2.\h'|0.4i'\fBWhat is Measured\fP\l'|5.6i.'\0\02 .sp 3.\h'|0.4i'\fBThe Benchmarks\fP\l'|5.6i.'\0\03 .br \h'|0.4i'3.1.\h'|0.9i'Host Level Benchmarks\l'|5.6i.'\0\03 .br \h'|0.9i'3.1.1.\h'|1.5i'CL Startup/Shutdown [CLSS]\l'|5.6i.'\0\03 .br \h'|0.9i'3.1.2.\h'|1.5i'Mkpkg (verify) [MKPKGV]\l'|5.6i.'\0\04 .br \h'|0.9i'3.1.3.\h'|1.5i'Mkpkg (compile) [MKPKGC]\l'|5.6i.'\0\04 .br \h'|0.4i'3.2.\h'|0.9i'IRAF Applications Benchmarks\l'|5.6i.'\0\04 .br \h'|0.9i'3.2.1.\h'|1.5i'Mkhelpdb [MKHDB]\l'|5.6i.'\0\05 .br \h'|0.9i'3.2.2.\h'|1.5i'Sequential Image Operators [IMADD, IMSTAT, etc.]\l'|5.6i.'\0\05 .br \h'|0.9i'3.2.3.\h'|1.5i'Image Load [IMLOAD,IMLOADF]\l'|5.6i.'\0\05 .br \h'|0.9i'3.2.4.\h'|1.5i'Image Transpose [IMTRAN]\l'|5.6i.'\0\06 .br \h'|0.4i'3.3.\h'|0.9i'Specialized Benchmarks\l'|5.6i.'\0\06 .br \h'|0.9i'3.3.1.\h'|1.5i'Subprocess Connect/Disconnect [SUBPR]\l'|5.6i.'\0\07 .br \h'|0.9i'3.3.2.\h'|1.5i'IPC Overhead [IPCO]\l'|5.6i.'\0\07 .br \h'|0.9i'3.3.3.\h'|1.5i'IPC Bandwidth [IPCB]\l'|5.6i.'\0\07 .br \h'|0.9i'3.3.4.\h'|1.5i'Foreign Task Execution [FORTSK]\l'|5.6i.'\0\07 .br \h'|0.9i'3.3.5.\h'|1.5i'Binary File I/O [WBIN,RBIN,RRBIN]\l'|5.6i.'\0\07 .br \h'|0.9i'3.3.6.\h'|1.5i'Text File I/O [WTEXT,RTEXT]\l'|5.6i.'\0\08 .br \h'|0.9i'3.3.7.\h'|1.5i'Network I/O [NWBIN,NRBIN,etc.]\l'|5.6i.'\0\08 .br \h'|0.9i'3.3.8.\h'|1.5i'Task, IMIO, GIO Overhead [PLOTS]\l'|5.6i.'\0\09 .br \h'|0.9i'3.3.9.\h'|1.5i'System Loading [2USER,4USER]\l'|5.6i.'\0\09 .sp 4.\h'|0.4i'\fBInterpreting the Benchmark Results\fP\l'|5.6i.'\0\010 .sp \fBAppendix A: IRAF Version 2.5 Benchmarks\fP .sp \fBAppendix B: IRAF Version 2.2 Benchmarks\fP .nr PN 0 .bp .NH Introduction .PP This set of benchmarks has been prepared with a number of purposes in mind. Firstly, the benchmarks may be run after installing IRAF on a new system to verify that the performance expected for that machine is actually being achieved. In general, this cannot be taken for granted since the performance actually achieved on a particular system may depend upon how the system is configured and tuned. Secondly, the benchmarks may be run to compare the performance of different IRAF hosts, or to track the system performance over a period of time as improvements are made, both to IRAF and to the host system. Lastly, the benchmarks provide a metric which can be used to tune the host system. .PP All too often, the only benchmarks run on a system are those which test the execution time of optimized code generated by the host Fortran compiler. This is primarily a hardware benchmark and secondarily a test of the Fortran optimizer. An example of this type of test is the famous Linpack benchmark. .PP The numerical execution speed test is an important benchmark but it tests only one of the many factors contributing to the overall performance of the system as perceived by the user. In interactive use other factors are often more important, e.g., the time required to spawn or communicate with a subprocess, the time required to access a file, the response of the system as the number of users (or processes) increases, and so on. While the quality of optimized code is significant for cpu intensive batch processing, other factors are often more important for sophisticated interactive applications. .PP The benchmarks described here are designed to test, as fully as possible, the major factors contributing to the overall performance of the IRAF system on a particular host. A major factor in the timings of each benchmark is of course the IRAF system itself, but comparisons of different hosts are nonetheless possible since the code is virtually identical on all hosts (the applications and VOS are in fact identical on all hosts). The IRAF kernel (OS interface) is coded differently for each host operating system, but the functions performed by the kernel are identical on each host, and since the kernel is a very "thin" layer the kernel code itself is almost always a negligible factor in the final timings. .PP The IRAF version number, host operating system and associated version number, and the host computer hardware configuration are all important in interpreting the results of the benchmarks, and should always be recorded. .NH What is Measured .PP Each benchmark measures two quantities, the total cpu time required to execute the benchmark, and the total (wall) clock time required to execute the benchmark. If the clock time measurement is to be of any value the benchmarks must be run on a single user system. Given this "best time" measurement and some idea of how the system responds to loading, it is not difficult to estimate the performance to be expected on a loaded system. .PP The total cpu time required to execute a benchmark consists of the "user" time plus the "system" time. The "user" time is the cpu time spent executing the instructions comprising the user (IRAF) program, i.e., any instructions in procedures linked directly into the process being executed. The "system" time is the cpu time spent in kernel mode executing the system services called by the user program. On some systems there is no distinction between the two types of timings, with the system time either being included in the measured cpu time, or omitted from the timings. If the benchmark involves several concurrent processes no cpu time measurement of the subprocesses may be possible on some systems. .PP When possible we give both measurements, while in some cases only the user time is given, or only the sum of the user and system times. The cpu time measurements are therefore only directly comparable between different operating systems for the simpler benchmarks, in particular those which make few system calls. The cpu measurements given \fIare\fR accurate for the same operating system (e.g., some version of UNIX) running on different hosts, and may be used to compare such systems. Reliable comparisions between different operating systems are also possible, but only if one thoroughly understands what is going on. .PP The clock time measurement includes both the user and system times, plus the time spent waiting for i/o. Any minor system daemon processes executing while the benchmarks are being run may bias the clock time measurement slightly, but since these are a constant part of the host environment it is fair to include them in the timings. Major system daemons which run infrequently (e.g., the print symbiont in VMS) should invalidate the benchmark. .PP Assuming an otherwise idle system, a comparison of the cpu and clock times tells whether the benchmark was cpu bound or i/o bound. Those benchmarks involving compiled IRAF tasks do not include the process startup and pagein times (these are measured by a different benchmark), hence the task should be run once before running the benchmark to connect the subprocess and page in the memory used by the task. A good procedure to follow is to run each benchmark once to start the process, and then repeat the benchmark three times, averaging the results. If inconsistent results are obtained further iterations and/or monitoring of the host system are called for until a consistent result is achieved. .PP Many benchmarks depend upon disk performance as well as compute cycles. For such a benchmark to be a meaningful measure of the i/o bandwidth of the system it is essential that no other users (or batch jobs) be competing for disk seeks on the disk used for the test file. There are subtle things to watch out for in this regard, for example, if the machine is in a VMS cluster or on a local area network, processes on other nodes may be accessing the local disk, yet will not show up on a user login or process list on the local node. It is always desirable to repeat each test several times or on several different disk devices, to ensure that no outside requests were being serviced while the benchmark was being run. If the system has disk monitoring utilities use these to find an idle disk before running any benchmarks which do heavy i/o. .PP Beware of disks which are nearly full; the maximum achievable i/o bandwidth may fall off rapidly as a disk fills up, due to disk fragmentation (the file must be stored in little pieces scattered all over the physical disk). Similarly, many systems (VMS, AOS/VS, V7 and Sys V UNIX, but not Berkeley UNIX) suffer from disk fragmentation problems that gradually worsen as a files system ages, requiring that the disk periodically be backed off onto tape and then restored to render the files and free spaces as contiguous as possible. In some cases, disk fragmentation can cause the maximum achievable i/o bandwidth to degrade by an order of magnitude. For example, on a VMS system one can use \fLCOPY/CONTIGUOUS\fR to render files contiguous (e.g., this can be done on all the executables in \fL[IRAF.BIN]\fR after installing the system, to speed process pagein times). If the copy fails for a large file even though there is substantial free space left on the disk, the disk is badly fragmented. .NH The Benchmarks .PP Instructions are given for running each benchmark, and the operations performed by each benchmark are briefly described. The system characteristics measured by the benchmark are briefly discussed. A short mnemonic name is associated with each benchmark to identify it in the tables given in the appendices, tabulating the results for actual host machines. .NH 2 Host Level Benchmarks .PP The benchmarks discussed in this section are run at the host system level. The examples are given for the UNIX cshell, under the assumption that a host dependent example is better than none at all. These commands must be translated by the user to run the benchmarks on a different system (hint: use \fLSHOW STATUS\fR or a stop watch to measure wall clock times on a VMS host). .NH 3 CL Startup/Shutdown [CLSS] .PP Go to the CL login directory (any directory containing a \fLLOGIN.CL\fR file), mark the time (the method by which this is done is system dependent), and startup the CL. Enter the "logout" command while the CL is starting up so that the CL will not be idle (with the clock running) while the command is being entered. Mark the final cpu and clock time and compute the difference. .DS \fL% time cl logout\fR .DE .LP This is a complex benchmark but one which is of obvious importance to the IRAF user. The benchmark is probably dominated by the cpu time required to start up the CL, i.e., start up the CL process, initialize the i/o system, initialize the environment, interpret the CL startup file, interpret the user LOGIN.CL file, connect and disconnect the x_system.e subprocess, and so on. Most of the remaining time is the overhead of the host operating system for the process spawns, page faults, file accesses, and so on. \fIDo not use a customized \fLLOGIN.CL\fP file when running this benchmark\fR, or the timings will almost certainly be affected. .NH 3 Mkpkg (verify) [MKPKGV] .PP Go to the PKG directory and enter the (host system equivalent of the) following command. The method by which the total cpu and clock times are computed is system dependent. .DS \fL% cd $iraf/pkg % time mkpkg -n\fR .DE .LP This benchmark does a "no execute" make-package of the entire PKG suite of applications and systems packages. This tests primarily the speed with which the host system can read directories, resolve pathnames, and return directory information for files. Since the PKG directory tree is continually growing, this benchmark is only useful for comparing the same version of IRAF run on different hosts, or the same version of IRAF on the same host at different times. .NH 3 Mkpkg (compile) [MKPKGC] .PP Go to the directory "iraf$pkg/bench/xctest" and enter the (host system equivalents of the) following commands. The method by which the total cpu and clock times are computed is system dependent. Only the \fBmkpkg\fR command should be timed. .DS \fL % cd $iraf/pkg/bench/xctest % mkpkg clean # delete old library, etc., if present % time mkpkg % mkpkg clean # delete newly created binaries\fR .DE .LP This tests the time required to compile and link a small IRAF package. The timings reflect the time required to preprocess, compile, optimize, and assemble each module and insert it into the package library, then link the package executable. The host operating system overhead for the process spawns, page faults, etc. is also a major factor. If the host system provides a shared library facility this will significantly affect the link time, hence the benchmark should be run linking both with and without shared libraries to make a fair comparison to other systems. Linking against a large library is fastest if the library is topologically sorted and stored contiguously on disk. .NH 2 IRAF Applications Benchmarks .PP The benchmarks discussed in this section are run from within the IRAF environment, using only standard IRAF applications tasks. The cpu and clock times of any (compiled) IRAF task may be measured by prefixing the task name with a $ when the command is entered into the CL, as shown in the examples. The significance of the cpu time measurement is not precisely defined for all systems. On a UNIX host, it is the "user" cpu time used by the task. On a VMS host, there does not appear to be any distinction between the user and system times (probably because the system services execute in the context of the calling process), hence the cpu time given probably includes both, but probably excludes the time for any services executing in ancillary processes, e.g., for RMS. .NH 3 Mkhelpdb [MKHDB] .PP The \fBmkhelpdb\fR task is in the \fBsoftools\fR package. The function of the task is to scan the tree of ".hd" help-directory files and compile the binary help database. .DS \fLcl> softools cl> $mkhelpdb .DE .LP This benchmark tests the speed of the host files system and the efficiency of the host system services and text file i/o, as well as the global optimization of the Fortran compiler and the MIPS rating of the host machine. Since the size of the help database varies with each version of IRAF, this benchmark is only useful for comparing the same version of IRAF run on different hosts, or the same version run on a single host at different times. Note than any additions to the base IRAF system (e.g., SDAS) will increase the size of the help database and affect the timings. .NH 3 Sequential Image Operators [IMADDS,IMADDR,IMSTATR,IMSHIFTR] .PP These benchmarks measure the time required by typical image operations. All tests should be performed on 512 square test images created with the \fBimdebug\fR package. The \fBimages\fR and \fBimdebug\fR packages should be loaded. Enter the following commands to create the test images. .DS \fLcl> mktest pix.s s 2 "512 512" cl> mktest pix.r r 2 "512 512"\fR .DE .LP The following benchmarks should be run on these test images. Delete the output images after each benchmark is run. If you enter the commands shown once, the command can be repeated by typing \fL^\fR followed by return. Each benchmark should be run several times, discarding the first timing and averaging the remaining timings for the final result. .DS .TS l l. [IMADDS] \fLcl> $imarith pix.s + 5 pix2.s; imdel pix2.s\fR [IMADDR] \fLcl> $imarith pix.r + 5 pix2.r; imdel pix2.r\fR [IMSTATR] \fLcl> $imstat pix.r\fR [IMSHIFTR] \fLcl> $imshift pix.r pix2.r .33 .44 interp=spline3\fR .TE .DE .LP The IMADD benchmarks test the efficiency of the image i/o system, including binary file i/o, and provide an indication of how long a simple disk to disk image operation takes on the system in question. This benchmark should be i/o bound on most systems. The IMSTATR and IMSHIFTR benchmarks are normally cpu bound, and test primarily the speed of the host cpu and floating point unit, and the quality of the code generated by the host Fortran compiler. Note that the IMSHIFTR benchmark employs a true two dimensional bicubic spline, hence the timings are a factor of 4 greater than one would expect if a one dimensional interpolator were used to shift the two dimensional image. .NH 3 Image Load [IMLOAD,IMLOADF] .PP To run the image load benchmarks, first load the \fBtv\fR package and display something to get the x_display.e process into the process cache. Run the following two benchmarks, displaying the test image PIX.S (this image contains a test pattern of no interest). .DS .TS l l. [IMLOAD] \fLcl> $display pix.s 1\fR [IMLOADF] \fLcl> $display pix.s 1 zt=none\fR .TE .DE .LP The IMLOAD benchmark measures how long it takes for a normal image load on the host system, including the automatic determination of the greyscale mapping, and the time required to map and clip the image pixels into the 8 bits (or whatever) displayable by the image display. This benchmark measures primarily the cpu speed and i/o bandwidth of the host system. The IMLOADF benchmark eliminates the cpu intensive greyscale transformation, yielding the minimum image display time for the host system. .NH 3 Image Transpose [IMTRAN] .PP To run this benchmark, transpose the image PIX.S, placing the output in a new image. .DS \fLcl> $imtran pix.s pix2.s\fR .DE .LP This benchmark tests the ability of a process to grab a large amount of physical memory (large working set), and the speed with which the host system can service random rather than sequential file access requests. The user working set should be large enough to avoid excessive page faulting. .NH 2 Specialized Benchmarks .PP The next few benchmarks are implemented as tasks in the \fBbench\fR package, located in the directory "pkg$bench". This package is not installed as a predefined package as the standard IRAF packages are. Since this package is used infrequently the binaries may have been deleted; if the file x_bench.e is not present in the \fIbench\fR directory, rebuild it as follows: .DS \fLcl> cd pkg$bench cl> mkpkg\fR .DE .LP To load the package, enter the following commands. It is not necessary to \fIcd\fR to the bench directory to load or run the package. .DS \fLcl> task $bench = "pkg$bench/bench.cl" cl> bench .DE .LP This defines the following benchmark tasks. There are no manual pages for these tasks; the only documentation is what you are reading. .DS .TS l l. FORTASK - foreign task execution GETPAR - get parameter; tests IPC overhead PLOTS - make line plots from an image RBIN - read binary file; tests FIO bandwidth RRBIN - raw (unbuffered) binary file read RTEXT - read text file; tests text file i/o speed SUBPROC - subprocess connect/disconnect WBIN - write binary file; tests FIO bandwidth WIPC - write to IPC; tests IPC bandwidth WTEXT - write text file; tests text file i/o speed .TE .DE .NH 3 Subprocess Connect/Disconnect [SUBPR] .PP To run the SUBPR benchmark, enter the following command. This will connect and disconnect the x_images.e subprocess 10 times. Difference the starting and final times printed as the task output to get the results of the benchmark. The cpu time measurement may be meaningless (very small) on some systems. .DS \fLcl> subproc 10\fR .DE This benchmark measures the time required to connect and disconnect an IRAF subprocess. This includes not only the host time required to spawn and later shutdown a process, but also the time required by the IRAF VOS to set up the IPC channels, initialize the VOS i/o system, initialize the environment in the subprocess, and so on. A portion of the subprocess must be paged into memory to execute all this initialization code. The host system overhead to spawn a subprocess and fault in a portion of its address space is a major factor in this benchmark. .NH 3 IPC Overhead [IPCO] .PP The \fBgetpar\fR task is a compiled task in x_bench.e. The task will fetch the value of a CL parameter 100 times. .DS \fLcl> $getpar 100\fR .DE Since each parameter access consists of a request sent to the CL by the subprocess, followed by a response from the CL process, with a negligible amount of data being transferred in each call, this tests the IPC overhead. .NH 3 IPC Bandwidth [IPCB] .PP To run this benchmark enter the following command. The \fBwipc\fR task is a compiled task in x_bench.e. .DS \fLcl> $wipc 1E6 > dev$null\fR .DE This writes approximately 1 Mb of binary data via IPC to the CL, which discards the data (writes it to the null file via FIO). Since no actual disk file i/o is involved, this tests the efficiency of the IRAF pseudofile i/o system and of the host system IPC facility. .NH 3 Foreign Task Execution [FORTSK] .PP To run this benchmark enter the following command. The \fBfortask\fR task is a CL script task in the \fBbench\fR package. .DS \fLcl> fortask 10\fR .DE This benchmark executes the standard IRAF foreign task \fBrmbin\fR (one of the bootstrap utilities) 10 times. The task is called with no arguments and does nothing other than execute, print out its "usage" message, and shut down. This tests the time required to execute a host system task from within the IRAF environment. Only the clock time measurement is meaningful. .NH 3 Binary File I/O [WBIN,RBIN,RRBIN] .PP To run these benchmarks, make sure the \fBbench\fR package is loaded, and enter the following commands. The \fBwbin\fR, \fBrbin\fR and \fBrrbin\fR tasks are compiled tasks in x_bench.e. A binary file named BINFILE is created in the current directory by WBIN, and should be deleted after the benchmark has been run. Each benchmark should be run at least twice before recording the time and moving on to the next benchmark. Successive calls to WBIN will automatically delete the file and write a new one. .PP \fINOTE:\fR it is wise to create the test file on a files system which has a lot of free space available, to avoid disk fragmentation problems. Also, if the host system has two or more different types of disk drives (or disk controllers or bus types), you may wish to run the benchmark separately for each drive. .DS \fLcl> $wbin binfile 5E6 cl> $rbin binfile cl> $rrbin binfile cl> delete binfile # (not part of the benchmark)\fR .DE .LP These benchmarks measure the time required to write and then read a binary disk file approximately 5 Mb in size. This benchmark measures the binary file i/o bandwidth of the FIO interface (for sequential i/o). In WBIN and RBIN the common buffered READ and WRITE requests are used, hence some memory to memory copying is included in the overhead measured by the benchmark. A large FIO buffer is used to minimize disk seeks and synchronization delays; somewhat faster timings might be possible by increasing the size of the buffer (this is not a user controllable option, and is not possible on all host systems). The RRBIN benchmark uses ZARDBF to read the file in chunks of 32768 bytes, giving an estimate of the maximum i/o bandwidth for the system. .NH 3 Text File I/O [WTEXT,RTEXT] .PP To run these benchmarks, load the \fBbench\fR package, and then enter the following commands. The \fBwtext\fR and \fBrtext\fR tasks are compiled tasks in x_bench.e. A text file named TEXTFILE is created in the current directory by WTEXT, and should be deleted after the benchmarks have been run. Successive calls to WTEXT will automatically delete the file and write a new one. .DS \fLcl> $wtext textfile 1E6 cl> $rtext textfile cl> delete textfile # (not part of the benchmark)\fR .DE .LP These benchmarks measure the time required to write and then read a text disk file approximately one megabyte in size (15,625 64 character lines). This benchmark measures the efficiency with which the system can sequentially read and write text files. Since text file i/o requires the system to pack and unpack records, text i/o tends to be cpu bound. .NH 3 Network I/O [NWBIN,NRBIN,NWNULL,NWTEXT,NRTEXT] .PP These benchmarks are equivalent to the binary and text file benchmarks just discussed, except that the binary and text files are acccessed on a remote node via the IRAF network interface. The calling sequences are identical except that an IRAF network filename is given instead of referencing a file in the current directory. For example, the following commands would be entered to run the network binary file benchmarks on node LYRA (the node name and filename are site dependent). .DS \fLcl> $wbin lyra!/tmp3/binfile 5E6 \fR[NWBIN]\fL cl> $rbin lyra!/tmp3/binfile \fR[NRBIN]\fL cl> $wbin lyra!/dev/null 5E6 \fR[NWNULL]\fL cl> delete lyra!/tmp3/binfile\fR .DE .LP The text file benchmarks are equivalent with the obvious changes, i.e., substitute "text" for "bin", "textfile" for "binfile", and omit the null textfile benchmark. The type of network interface used (TCP/IP, DECNET, etc.), and the characteristics of the remote node should be recorded. .PP These benchmarks test the bandwidth of the IRAF network interfaces for binary and text files, as well as the limiting speed of the network itself (NWNULL). The binary file benchmarks should be i/o bound. NWBIN should outperform NRBIN since a network write is a pipelined operation, whereas a network read is (currently) a synchronous operation. Text file access may be either cpu or i/o bound depending upon the relative speeds of the network and host cpus. The IRAF network interface buffers textfile i/o to minimize the number of network packets and maximize the i/o bandwidth. .NH 3 Task, IMIO, GIO Overhead [PLOTS] .PP The \fBplots\fR task is a CL script task which calls the \fBprow\fR task repeatedly to plot the same line of an image. The graphics output is discarded (directed to the null file) rather than plotted since otherwise the results of the benchmark would be dominated by the plotting speed of the graphics terminal. .DS \fLcl> plots pix.s 10\fR .DE This is a complex benchmark. The benchmark measures the overhead of task (not process) execution and the overhead of the IMIO and GIO subsystems, as well as the speed with which IPC can be used to pass parameters to a task and return the GIO graphics metacode to the CL. .PP The \fBprow\fR task is all overhead and is not normally used to interactively plot image lines (\fBimplot\fR is what is normally used), but it is a good task to use for a benchmark since it exercises the subsystems most commonly used in scientific tasks. The \fBprow\fR task has a couple dozen parameters (mostly hidden), must open the image to read the image line to be plotted on every call, and must open the GIO graphics device on every call as well. .NH 3 System Loading [2USER,4USER] .PP This benchmark attempts to measure the response of the system as the load increases. This is done by running large \fBplots\fR jobs on several terminals and then repeating the 10 plots \fBplots\fR benchmark. For example, to run the 2USER benchmark, login on a second terminal and enter the following command, and then repeat the PLOTS benchmark discussed in the last section. Be sure to use a different login or login directory for each "user", to avoid concurrency problems, e.g., when reading the input image or updating parameter files. .DS \fLcl> plots pix.s 9999\fR .DE Theoretically, the timings should be approximately .5 (2USER) and .25 (4USER) as fast as when the PLOTS benchmark was run on a single user system, assuming that cpu time is the limiting resource and that a single job is cpu bound. In a case where there is more than one limiting resource, e.g., disk seeks as well as cpu cycles, performance will fall off more rapidly. If, on the other hand, a single user process does not keep the system busy, e.g., because synchronous i/o is used, performance will fall off less rapidly. If the system unexpectedly runs out of some critical system resource, e.g., physical memory or some internal OS buffer space, performance may be much worse than expected. .PP If the multiuser performance is poorer than expected it may be possible to improve the system performance significantly once the reason for the poor performance is understood. If disk seeks are the problem it may be possible to distribute the load more evenly over the available disks. If the performance decays linearly as more users are added and then gets really bad, it is probably because some critical system resource has run out. Use the system monitoring tools provided with the host operating system to try to identify the critical resource. It may be possible to modify the system tuning parameters to fix the problem, once the critical resource has been identified. .NH Interpreting the Benchmark Results .PP Many factors determine the timings obtained when the benchmarks are run on a system. These factors include all of the following: .sp .RS .IP \(bu The hardware configuration, e.g., cpu used, clock speed, availability of floating point hardware, type of floating point hardware, amount of memory, number and type of disks, degree of fragmentation of the disks, bus bandwidth, disk controller bandwidth, memory controller bandwidth for memory mapped DMA transfers, and so on. .IP \(bu The host operating system, including the version number, tuning parameters, user quotas, working set size, files system parameters, Fortran compiler characteristics, level of optimization used to compile IRAF, and so on. .IP \(bu The version of IRAF being run. On a VMS system, are the images "installed" to permit shared memory and reduce physical memory usage? Were the programs compiled with the code optimizer, and if so, what compiler options were used? Are shared libraries used if available on the host system? .IP \(bu Other activity in the system when the benchmarks were run. If there were no other users on the machine at the time, how about batch jobs? If the machine is on a cluster or network, were other nodes accessing the same disks? How many other processes were running on the local node? Ideally, the benchmarks should be run on an otherwise idle system, else the results may be meaningless or next to impossible to interpret. Given some idea of how the host system responds to loading, it is possible to estimate how a timing will scale as the system is loaded, but the reverse operation is much more difficult. .RE .sp .PP Because so many factors contribute to the results of a benchmark, it can be difficult to draw firm conclusions from any benchmark, no matter how simple. The hardware and software in modern computer systems is so complicated that it is difficult even for an expert with a detailed knowledge and understanding of the full system to explain in detail where the time is going, even when running the simplest benchmark. On some recent message based multiprocessor systems it is probably impossible to fully comprehend what is going on at any given time, even if one fully understands how the system works, because of the dynamic nature of such systems. .PP Despite these difficulties, the benchmarks do provide a coarse measure of the relative performance of different host systems, as well as some indication of the efficiency of the IRAF VOS. The benchmarks are designed to measure the performance of the \fIhost system\fR (both hardware and software) in a number of important areas, all of which play a role in determining the suitability of a system for scientific data processing. The benchmarks are \fInot\fR designed to measure the efficiency of the IRAF software itself (except parts of the VOS), e.g., there is no measure of the time taken by the CL to compile and execute a script, no measure of the speed of the median algorithm or of an image transpose, and so on. These timings are also important, of course, but should be measured separately. Also, measurements of the efficiency of individual applications programs are much less critical than the performance criteria dealt with here, since it is relatively easy to optimize an inefficient or poorly designed applications program, even a complex one like the CL, but there is generally little one can do about the host system. .PP The timings for the benchmarks for a number of host systems are given in the appendices which follow. Sometimes there will be more than one set of benchmarks for a given host system, e.g., because the system provided two or more disks or floating point options with different levels of performance. The notes at the end of each set of benchmarks are intended to document any special features or problems of the host system which may have affected the results. In general we did not bother to record things like system tuning parameters, working set, page faults, etc., unless these were considered an important factor in the benchmarks. In particular, few IRAF programs page fault other than during process startup, hence this is rarely a signficant factor when running these benchmarks (except possibly in IMTRAN). .PP Detailed results for each configuration of each host system are presented on separate pages in the Appendices. A summary table showing the results of selected benchmarks for all host systems at once is also provided. The system characteristic or characteristics principally measured by each benchmark is noted in the table below. This is only approximate, e.g., the MIPS rating is a significant factor in all but the most i/o bound benchmarks. .KS .TS center; ci ci ci ci ci l c c c c. benchmark responsiveness mips flops i/o CLSS \(bu MKPKGV \(bu MKHDB \(bu \(bu PLOTS \(bu \(bu IMADDS \(bu \(bu IMADDR \(bu \(bu IMSTATR \(bu IMSHIFTR \(bu IMTRAN \(bu WBIN \(bu RBIN \(bu .TE .KE .sp .PP By \fIresponsiveness\fR we refer to the interactive response of the system as perceived by the user. A system with a good interactive response will do all the little things very fast, e.g., directory listings, image header listings, plotting from an image, loading new packages, starting up a new process, and so on. Machines which score high in this area will seem fast to the user, whereas machines which score poorly will \fIseem\fR slow, sometimes frustratingly slow, even though they may score high in the areas of floating point performance, or i/o bandwidth. The interactive response of a system obviously depends upon the MIPS rating of the system (see below), but an often more significant factor is the design and computational complexity of the host operating system itself, in particular the time taken by the host operating system to execute system calls. Any system which spends a large fraction of its time in kernel mode will probably have poor interactive response. The response of the system to loading is also very important, i.e., if the system has trouble with load balancing as the number of users (or processes) increases, response will become increasingly erratic until the interactive response is hopelessly poor. .PP The MIPS column refers to the raw speed of the system when executing arbitrary code containing a mixture of various types of instructions, but little floating point, i/o, or system calls. A machine with a high MIPS rating will have a fast cpu, e.g., a fast clock rate, fast memory access time, large cache memory, and so on, as well as a good optimizing Fortran compiler. Assuming good compilers, the MIPS rating is primarily a measure of the hardware speed of the host machine, but all of the MIPS related benchmarks presented here also make a significant number of system calls (MKHDB, for example, does a lot of files accesses and text file i/o), hence it is not that simple. Perhaps a completely cpu bound pure-MIPS benchmark should be added to our suite of benchmarks (the MIPS rating of every machine is generally well known, however). .PP The FLOPS column identifies those benchmarks which do a significant amount of floating point computation. The IMSHIFTR and IMSTATR benchmarks in particular are heavily into floating point. These benchmarks measure the single precision floating point speed of the host system hardware, as well as the effectiveness of do-loop optimization by the host Fortran compiler. The degree of optimization provided by the Fortran compiler can affect the timing of these benchmarks by up to a factor of two. Note that the sample is very small, and if a compiler fails to optimize the inner loop of one of these benchmark programs, the situation may be reversed when running some other benchmark. Any reasonable Fortran compiler should be able to optimize the inner loop of the IMADDR benchmark, so the CPU timing for this benchmark is a good measure of the hardware floating point speed, if one allows for do-loop overhead, memory i/o, and the system calls necessary to access the image on disk. .PP The I/O column identifies those benchmarks which are i/o bound and which therefore provide some indication of the i/o bandwidth of the host system. The i/o bandwidth actually achieved in these benchmarks depends upon many factors, the most important of which are the host operating system software (files system data structures and i/o software, disk drivers, etc.) and the host system hardware, i.e., disk type, disk controller type, bus bandwidth, and DMA memory controller bandwidth. Note that asynchronous i/o is not currently used in these benchmarks, hence higher transfer rates are probably possible in special cases (on a busy system all i/o is asynchronous at the host system level anyway). Large transfers are used to minimize disk seeks and synchronization delays, hence the benchmarks should provide a good measure of the realistically achievable host i/o bandwidth.