Virtual Memory Caching Scheme
Mon Oct 25 1999 - Thu Jan 20 2000


OVERVIEW [now somewhat dated]

Most modern Unix systems implement ordinary file i/o by mapping files into
host memory, faulting the file pages into memory, and copying data to and
from process memory and the cached file pages.  This has the effect of
caching recently read file data in memory.  This scheme replaces the old
Unix buffer cache, with the advantage that there is no builtin limit on
the size of the cache.  The global file cache is shared by both data files
and the file pages of executing programs, and will grow until all physical
memory is in use.

The advantage of the virtual memory file system (VMFS) is that it makes
maximal use of system memory for caching file data.  If a relatively static
set of data is repeatedly accessed it will remain in the system file cache,
speeding access and minimizing i/o and page faulting.  The disadvantage
is the same thing: VMFS makes maximal use of system memory for caching
file data.  Programs which do heavy file i/o, reading a large amount of
data, fault in a great deal of file data pages which may only be accessed
once.  Once the free list is exhausted the system page daemon runs to
reclaim old file pages for reuse.  The system pages heavily and becomes
inefficient.

The goal of the file caching scheme presented here is to continue to cache
file data in the global system file cache, but control how data is cached to
minimize use of the pageout daemon which runs when memory is exhausted.  This
scheme makes use of the ** existing operating system kernel facilities **
to cache the file data and use the cached data for general file access.
The trick is to try to control how data is loaded into the cache, and when
it is removed from the cache, so that cache space is reused efficiently
without invoking the system pageout daemon.  Since data is cached by the
system the cache benefits all programs which access the cached file data,
without requiring that the programs explicitly use any cache facilities
such as a custom library.


HOW IT WORKS


INTERFACE


	      vm = vm_initcache (initstr)
		  vm_closecache (vm)

		   vm_cachefile (vm, fname, flags)
		     vm_cachefd (vm, fd, flags)
		 vm_uncachefile (vm, fname)
		   vm_uncachefd (vm, fd)

		 vm_cacheregion (vm, fd, offset, nbytes, flags)
	       vm_uncacheregion (vm, fd, offset, nbytes)
		vm_reservespace (vm, nbytes)
		        vm_sync (vm, fd)


vm_cacheregion (vm, fd, offset, nbytes, flags)

    check whether the indicated region is mapped (vm descriptor)
    if not, free space from the tail of the cache; map new region
    request that mapped region be faulted into memory (madvise)
    move referenced file to head of cache

    redundant requests are harmless, but will reload any missing pages,
    and cause the file to again be moved to the head of the cache list

    may need to scan the cache periodically to make adjustments for
    files that have changed in size, or been deleted, while still in
    the cache

    cached regions may optionally be locked into memory until freed

    the cache controller may function either as a library within a process,
    or as a cache controller server process shared by multiple processes


vm_uncacheregion (vm, fd, offset, nbytes)

    check whether the indicated region is mapped
    if so, unmap the pages
    if no more pages remain mapped, remove file from cache list


vm_reservespace (vm, nbytes)

    unmap file segments from tail of list until the requested space
    (plus some extra space) is available for reuse


data structures

    caching mechanism is file-oriented
    linked list of mapped regions (each from a file)
	for each region keep track of file descriptor, offset, size
    linked list of file descriptors
	for each file keep track of file size, mtime,
	type of mapping (full,region) and so on

    some dynamic things such as the size of a file or wether pages are memory
    resident can only be determined by querying the system at runtime


Solaris VM Interface

	     madvise (addr, len, advice)
		mmap (addr, len, prot, flags, fildes, off)
	      munmap (addr, len)
	       mlock (addr, len)
	     munlock (addr, len)
	     memcntl (addr, len, cmd, arg, attr, mask)
	        mctl (addr, len, function, arg)
	     mincore (addr, len, *vec)
	       msync (addr, len, flags)

    Notes
	Madvise can be used to request that a range of pages be faulted
	into memory (WILL_NEED), or freed from memory (DONT_NEED)

	Mctl can be used to invalidate page mappings in a region

	Mincore can be used to determine if pages in a given address range
	are resident in memory


VMCACHED -- December 2001
------------------------------

Added VMcache daemon and IRAF interface to same
Design notes follow


Various Cache Control Algorithms

    1.	No Cache

	No VMcache daemon.  Clients use their builtin default i/o mechanism,
	e.g., either normal or direct i/o depending upon the file size.

    2.	Manually or externally controlled cache

	Files are cached only when directed.  Clients connect to the cache
	daemon to see if files are in the cache and if so use normal VM i/o
	to access data in the cache.  If the file is not cached the client
	uses its default i/o mechanism, e.g., direct i/o.

    3.	LRU Cache

	A client file access causes the accessed file to be cached.  Normal
	VM i/o is used for file i/o.  As new files are cached the space
	used by the least recently used files is reclaimed.  Accessing a
	file moves it to the head of the cache, if it is still in the cache.
	Otherwise it is reloaded.

    4.	Adaptive Priority Cache

	This is like the LRU cache, but the cache keeps statistics on files
	whether or not they have aged out of the cache, and raises the 
	cache priority or lifetime of files that are more frequently
	accessed.  Files that are only accessed once tend to pass quickly
	through the cache, or may not even be cached until the second
	access.  Files that are repeatedly accessed have a higher priority
	and will tend to stay in the cache.

The caching mechanism and algorithm used are independent of the client
programs, hence can be easily tuned or replaced with a different algorithm.

Factors determining if a file is cached:

	user-assigned priority (0=nocache; 1-N=cache priority)
	number of references
	time since last access (degrades nref)
	amount of available memory (cutoff point)

Cache priority

	priority = userpri * max(0,
	    (nref-refbase - ((time - last_access) / tock)) )

Tunable parameters

	userpri		User defined file priority.  Files with a higher
			priority stay in the cache longer.  A zero priority
			prevents a file from being cached.

	refbase         The number of file references has to exceed refbase
			before the file will be cached.  For example, if
			refbase=0 the file will be cacheable on the first
			reference.  If refbase=1 a file will only become
			cacheable if accessed two or more times.  Refbase
			can be used to exclude files from the cache that
			are only referenced once and hence are not worth
			caching.

	tock            While the number of accesses increases the cache
			priority of a file, the time interval since the
			last access likewise decreases the cache priority
			of the file.  A time interval of "tock" seconds
			will cancel out one file reference.  In effect,
			tock=N means that a file reference increases the
			cache priority of a file for N seconds.  A
			frequently referenced file will be relatively
			unaffected by tock, but tock will cause
			infrequently referenced files to age out of the
			cache within a few tocks.

Cache Management

    Manual cache control

	Explicitly caching or refreshing a file always maps the file into
	memory and moves it to the head of the cache.

    File access

	Accessing a file (vm_accessfile) allows cache optimization to
	occur.  The file nref and access time are updated and the priority
	of the current file and all files (to a certain depth in the cache
	list) are recomputed.  If a whole-file level access is being
	performed the file size is examined to see if it has changed and
	if the file has gotten larger a new segment is created.  The
	segment descriptor is then unlinked and relinked in the cache in
	cache priority order.  If the segment is above the VM cutoff it
	is loaded into the cache: lower priority segments are freed as
	necessary, and if the file is an existing file it is marked
	WILL_NEED to queue the file data to be read into memory.

	If the file is a new file it must already have been created
	externally to be managed under VMcache.  The file size at access
	time will determine the size of the file entry in the cache.  Some
	systems (BSD, Sun) allow a mmap to extend beyond the end of a
	file, but others (Linux) do not.  To reserve space for a large
	file where the ultimate size of the file is known in advance, one
	can write a byte where the last byte of the file will be (as with
	zfaloc in IRAF) before caching the file, and the entire memory
	space will be reserved in advance.  If a file is cached and later
	extended, re-accessing the file will automatically cache the new
	segment of the file (see above).

    Data structures

	Segment descriptors
	List of segments linked in memory allocation order
	    first N segments are cached (whatever will fit)
	    remainder are maintained in list, but are not cached
	    manually cached/refreshed segments go to head of list
	    accessed files are inserted in list based on priority
	List of segments belonging to the same file
	    a file can be stored in the cache in multiple segments

	File hash table
	    provides fast lookup of an individual file
	    hash dev+ino to segment
	    segment points to next segment if collision occurs
	    only initial/root file segment is indexed

    Cache management

	Relinking of the main list occurs only in certain circumstances
	    when a segment is manually cached/uncached/refreshed
		referenced segment moves to head of list
		new segment is always cached
	    when a file or segment is accessed
		priority of each element is computed and segment is
		placed in priority order (only referenced segment is moved)
		caching/uncaching may occur due to new VM cutoff
	    when a new segment is added
	    when an old segment is deleted
	Residency in memory is determined by link order
	    priority normally determines memory residency
	    but manual caching will override (for a time)


File Driver Issues

    Image kernels

	Currently only OIF uses the SF driver.  FXF, STF, and QPF (FMIO)
	all use the BF driver.  Some or all could be changed to use SF
	if it is made compatible with BF, otherwrise the VM hooks need
	to go into the BF driver.  Since potentially any large file can
	be cached, putting the VM support into BF is a reasonable option.

	The FITS kernel is a problem currently as it violates device
	block size restrictions, using a block size of 2880.

	It is always a good idea to use falloc to pre-allocate storage for
	a large imagefile when the size is known in advance.  This permits
	the VM system to reserve VM space for a new image before data is
	written to the file.

    Direct I/O

	Direct i/o is possible only if transfers are aligned on device
	blocks and are an integral number of blocks in length.

	Direct i/o flushes any VM buffered data for the file.  If a file
	is mapped into memory this is not possible, hence direct i/o is
	disabled for a file while it is mapped into memory.

	This decision is made at read/write time, hence cannot be
	determined reliably when a file is opened.

    FITS Kernel

	Until the block size issues can be addressed, direct i/o cannot
	be used for FITS images.  Some VM cache control is still possible
	however.  Options include:

	o   Always cache a .fits image: either set vmcached to cache a file
	    on the first access, or adjust the cache parameters based on
	    the file type.  Use a higher priority for explicitly cached
	    files (e.g. Mosaic readouts), so that running a sequence of
	    normal i/o images through the cache does not flush the high
	    priority images.

	o   Writing to new files which have not been pre-allocated is
	    problematic as a large amount of data can be written, causing
	    paging.  One way to deal with this is to use large transfers
	    (IMIO will already do this), and to issue a reservespace
	    directive on each file write at EOF, to free up VM space as
	    needed.  The next access directive would cause the new
	    portion of the image to be mapped into the cache.

	    A possible problem with this is that the new file may initially
	    be too small to reach the cache threshold.  Space could be
	    reserved in any case, waiting for the next access to cache
	    the file; the cache daemon could always cache new files of a
	    certain type; or the file could be cached when it reaches the
	    cache threshold.

    Kernel File Driver

	A environment variable will be used in the OS driver to define a
	cache threshold or to disable use of VMcache entirely.  We need
	to be able to specify these two things separately.  If a cache
	threshold is set, files smaller than this size will not result in
	a query to the cache daemon.  If there is no cache threshold but
	VMcache is enabled, the cache daemon will decide whether the file
	is too small to be cached.  It should also be possible to force
	the use of direct i/o if the file is larger than a certain size.

	Kernel file driver parameters:

	    enable	boolean

	    vmcache	Use vmcache only if the file size equals or exceeds
			the specified threshold.

	    directio    If the file size equals or exceeds the specified
			threshold use direct i/o to access the file.  If
			direct i/o is enabled in this fashion then vmcache
			is not used (otherwise vmcache decides whether to
			use direct i/o for a file).

	    port	Socket number to be used.

	    VMPORT=8797
	    VMCLIENT=enable,threshold=10m,directio=10m