Virtual Memory Caching Scheme Mon Oct 25 1999 - Thu Jan 20 2000 OVERVIEW [now somewhat dated] Most modern Unix systems implement ordinary file i/o by mapping files into host memory, faulting the file pages into memory, and copying data to and from process memory and the cached file pages. This has the effect of caching recently read file data in memory. This scheme replaces the old Unix buffer cache, with the advantage that there is no builtin limit on the size of the cache. The global file cache is shared by both data files and the file pages of executing programs, and will grow until all physical memory is in use. The advantage of the virtual memory file system (VMFS) is that it makes maximal use of system memory for caching file data. If a relatively static set of data is repeatedly accessed it will remain in the system file cache, speeding access and minimizing i/o and page faulting. The disadvantage is the same thing: VMFS makes maximal use of system memory for caching file data. Programs which do heavy file i/o, reading a large amount of data, fault in a great deal of file data pages which may only be accessed once. Once the free list is exhausted the system page daemon runs to reclaim old file pages for reuse. The system pages heavily and becomes inefficient. The goal of the file caching scheme presented here is to continue to cache file data in the global system file cache, but control how data is cached to minimize use of the pageout daemon which runs when memory is exhausted. This scheme makes use of the ** existing operating system kernel facilities ** to cache the file data and use the cached data for general file access. The trick is to try to control how data is loaded into the cache, and when it is removed from the cache, so that cache space is reused efficiently without invoking the system pageout daemon. Since data is cached by the system the cache benefits all programs which access the cached file data, without requiring that the programs explicitly use any cache facilities such as a custom library. HOW IT WORKS INTERFACE vm = vm_initcache (initstr) vm_closecache (vm) vm_cachefile (vm, fname, flags) vm_cachefd (vm, fd, flags) vm_uncachefile (vm, fname) vm_uncachefd (vm, fd) vm_cacheregion (vm, fd, offset, nbytes, flags) vm_uncacheregion (vm, fd, offset, nbytes) vm_reservespace (vm, nbytes) vm_sync (vm, fd) vm_cacheregion (vm, fd, offset, nbytes, flags) check whether the indicated region is mapped (vm descriptor) if not, free space from the tail of the cache; map new region request that mapped region be faulted into memory (madvise) move referenced file to head of cache redundant requests are harmless, but will reload any missing pages, and cause the file to again be moved to the head of the cache list may need to scan the cache periodically to make adjustments for files that have changed in size, or been deleted, while still in the cache cached regions may optionally be locked into memory until freed the cache controller may function either as a library within a process, or as a cache controller server process shared by multiple processes vm_uncacheregion (vm, fd, offset, nbytes) check whether the indicated region is mapped if so, unmap the pages if no more pages remain mapped, remove file from cache list vm_reservespace (vm, nbytes) unmap file segments from tail of list until the requested space (plus some extra space) is available for reuse data structures caching mechanism is file-oriented linked list of mapped regions (each from a file) for each region keep track of file descriptor, offset, size linked list of file descriptors for each file keep track of file size, mtime, type of mapping (full,region) and so on some dynamic things such as the size of a file or wether pages are memory resident can only be determined by querying the system at runtime Solaris VM Interface madvise (addr, len, advice) mmap (addr, len, prot, flags, fildes, off) munmap (addr, len) mlock (addr, len) munlock (addr, len) memcntl (addr, len, cmd, arg, attr, mask) mctl (addr, len, function, arg) mincore (addr, len, *vec) msync (addr, len, flags) Notes Madvise can be used to request that a range of pages be faulted into memory (WILL_NEED), or freed from memory (DONT_NEED) Mctl can be used to invalidate page mappings in a region Mincore can be used to determine if pages in a given address range are resident in memory VMCACHED -- December 2001 ------------------------------ Added VMcache daemon and IRAF interface to same Design notes follow Various Cache Control Algorithms 1. No Cache No VMcache daemon. Clients use their builtin default i/o mechanism, e.g., either normal or direct i/o depending upon the file size. 2. Manually or externally controlled cache Files are cached only when directed. Clients connect to the cache daemon to see if files are in the cache and if so use normal VM i/o to access data in the cache. If the file is not cached the client uses its default i/o mechanism, e.g., direct i/o. 3. LRU Cache A client file access causes the accessed file to be cached. Normal VM i/o is used for file i/o. As new files are cached the space used by the least recently used files is reclaimed. Accessing a file moves it to the head of the cache, if it is still in the cache. Otherwise it is reloaded. 4. Adaptive Priority Cache This is like the LRU cache, but the cache keeps statistics on files whether or not they have aged out of the cache, and raises the cache priority or lifetime of files that are more frequently accessed. Files that are only accessed once tend to pass quickly through the cache, or may not even be cached until the second access. Files that are repeatedly accessed have a higher priority and will tend to stay in the cache. The caching mechanism and algorithm used are independent of the client programs, hence can be easily tuned or replaced with a different algorithm. Factors determining if a file is cached: user-assigned priority (0=nocache; 1-N=cache priority) number of references time since last access (degrades nref) amount of available memory (cutoff point) Cache priority priority = userpri * max(0, (nref-refbase - ((time - last_access) / tock)) ) Tunable parameters userpri User defined file priority. Files with a higher priority stay in the cache longer. A zero priority prevents a file from being cached. refbase The number of file references has to exceed refbase before the file will be cached. For example, if refbase=0 the file will be cacheable on the first reference. If refbase=1 a file will only become cacheable if accessed two or more times. Refbase can be used to exclude files from the cache that are only referenced once and hence are not worth caching. tock While the number of accesses increases the cache priority of a file, the time interval since the last access likewise decreases the cache priority of the file. A time interval of "tock" seconds will cancel out one file reference. In effect, tock=N means that a file reference increases the cache priority of a file for N seconds. A frequently referenced file will be relatively unaffected by tock, but tock will cause infrequently referenced files to age out of the cache within a few tocks. Cache Management Manual cache control Explicitly caching or refreshing a file always maps the file into memory and moves it to the head of the cache. File access Accessing a file (vm_accessfile) allows cache optimization to occur. The file nref and access time are updated and the priority of the current file and all files (to a certain depth in the cache list) are recomputed. If a whole-file level access is being performed the file size is examined to see if it has changed and if the file has gotten larger a new segment is created. The segment descriptor is then unlinked and relinked in the cache in cache priority order. If the segment is above the VM cutoff it is loaded into the cache: lower priority segments are freed as necessary, and if the file is an existing file it is marked WILL_NEED to queue the file data to be read into memory. If the file is a new file it must already have been created externally to be managed under VMcache. The file size at access time will determine the size of the file entry in the cache. Some systems (BSD, Sun) allow a mmap to extend beyond the end of a file, but others (Linux) do not. To reserve space for a large file where the ultimate size of the file is known in advance, one can write a byte where the last byte of the file will be (as with zfaloc in IRAF) before caching the file, and the entire memory space will be reserved in advance. If a file is cached and later extended, re-accessing the file will automatically cache the new segment of the file (see above). Data structures Segment descriptors List of segments linked in memory allocation order first N segments are cached (whatever will fit) remainder are maintained in list, but are not cached manually cached/refreshed segments go to head of list accessed files are inserted in list based on priority List of segments belonging to the same file a file can be stored in the cache in multiple segments File hash table provides fast lookup of an individual file hash dev+ino to segment segment points to next segment if collision occurs only initial/root file segment is indexed Cache management Relinking of the main list occurs only in certain circumstances when a segment is manually cached/uncached/refreshed referenced segment moves to head of list new segment is always cached when a file or segment is accessed priority of each element is computed and segment is placed in priority order (only referenced segment is moved) caching/uncaching may occur due to new VM cutoff when a new segment is added when an old segment is deleted Residency in memory is determined by link order priority normally determines memory residency but manual caching will override (for a time) File Driver Issues Image kernels Currently only OIF uses the SF driver. FXF, STF, and QPF (FMIO) all use the BF driver. Some or all could be changed to use SF if it is made compatible with BF, otherwrise the VM hooks need to go into the BF driver. Since potentially any large file can be cached, putting the VM support into BF is a reasonable option. The FITS kernel is a problem currently as it violates device block size restrictions, using a block size of 2880. It is always a good idea to use falloc to pre-allocate storage for a large imagefile when the size is known in advance. This permits the VM system to reserve VM space for a new image before data is written to the file. Direct I/O Direct i/o is possible only if transfers are aligned on device blocks and are an integral number of blocks in length. Direct i/o flushes any VM buffered data for the file. If a file is mapped into memory this is not possible, hence direct i/o is disabled for a file while it is mapped into memory. This decision is made at read/write time, hence cannot be determined reliably when a file is opened. FITS Kernel Until the block size issues can be addressed, direct i/o cannot be used for FITS images. Some VM cache control is still possible however. Options include: o Always cache a .fits image: either set vmcached to cache a file on the first access, or adjust the cache parameters based on the file type. Use a higher priority for explicitly cached files (e.g. Mosaic readouts), so that running a sequence of normal i/o images through the cache does not flush the high priority images. o Writing to new files which have not been pre-allocated is problematic as a large amount of data can be written, causing paging. One way to deal with this is to use large transfers (IMIO will already do this), and to issue a reservespace directive on each file write at EOF, to free up VM space as needed. The next access directive would cause the new portion of the image to be mapped into the cache. A possible problem with this is that the new file may initially be too small to reach the cache threshold. Space could be reserved in any case, waiting for the next access to cache the file; the cache daemon could always cache new files of a certain type; or the file could be cached when it reaches the cache threshold. Kernel File Driver A environment variable will be used in the OS driver to define a cache threshold or to disable use of VMcache entirely. We need to be able to specify these two things separately. If a cache threshold is set, files smaller than this size will not result in a query to the cache daemon. If there is no cache threshold but VMcache is enabled, the cache daemon will decide whether the file is too small to be cached. It should also be possible to force the use of direct i/o if the file is larger than a certain size. Kernel file driver parameters: enable boolean vmcache Use vmcache only if the file size equals or exceeds the specified threshold. directio If the file size equals or exceeds the specified threshold use direct i/o to access the file. If direct i/o is enabled in this fashion then vmcache is not used (otherwise vmcache decides whether to use direct i/o for a file). port Socket number to be used. VMPORT=8797 VMCLIENT=enable,threshold=10m,directio=10m