.help dbss Sep85 "Design of the IRAF Database Subsystem"
.ce
\fBDesign of the IRAF Database Subsystem\fR
.ce
Doug Tody
.ce
September 1985
.sp 2

.nh
Preface

    The primary purpose of this document is to define the interfaces comprising
the IRAF database i/o subsystem to the point where they can be built rapidly
and efficiently, with confidence that major changes will not be required after
implementation begins.  The document also serves to inform all interested
parties of what is planned while there is still time to change the design.
A change which can easily be made to the design prior to implementation may
become prohibitively expensive as implementation proceeds.  After implementation
is completed and the new subsystem has been in use for several months the basic
interfaces will be frozen and the opportunity for change will have passed.

The description of the database subsystem presented in this document should
be considered to be no more than a close approximation to the system which
will actually be built.  The specifications of the interface can be expected
to change in detail as the implementation proceeds.  Any code which is written
according to the interface specifications presented in this document may have
to modified slightly before system testing with the final interfaces can
proceed.

.nh 2
Scope of this Document

    The scope of this document is the conceptual design and specification of
all IRAF packages and i/o interfaces directly involved with either user or
program access to binary data maintained in mass storage.  Versions of some
of the interfaces described are already in use; when this is the case it will
be noted in the text.

This document is neither a user's guide nor a reference manual.  The reader
is assumed to be familiar with both database technology and with the IRAF
system.  In particular, the reader should be familiar with the concept of the
IRAF VOS (virtual operating system), with the features of the IMIO (image i/o),
FIO (file i/o), and OS (host system interface) interfaces, as well as with the
architecture of the network interface.

.nh 2
Relationship to Previous Documents

    This document supercedes the document "IRAF Database I/O", November 1984.
Most of the concepts presented in that document are still valid but have been
expanded upon greatly in the present document.  The scope of the original
document was limited to the DBIO interface alone, whereas the scope of the
present document has been expanded to encompass all subsystems or packages
directly involved with binary data access.  This expansion in the scope of
the project was necessary to meet our primary goal of completing and freezing
the program interface, of which DBIO is only a small part.  Furthermore, it
is difficult to have confidence in the design of a single subsystem without
working out the details of all closely related or dependent subsystems.

In addition to expanding the scope of the database design project to cover
more interfaces, the requirements which the database subsystem must meet have
been expanded since the original conceptual design was done.  In particular
it has become clear that data format conversions are prohibitively expensive
for our increasingly large datasets.  Conversions such as those between FITS
and internal format (for an image), or between FITS table and internal format
(for a database) are too expensive to be performed routinely.  Data which is
archived in a machine independent format should not have to be reformatted
to be accessed by the online system.  The archival format may vary from site
to site and it should be possible to read the different formats without
reformatting the data.  Large datasets should not have to be reformatted to
be moved between machines with incompatible binary data formats.

A change such as this in the requirements for an interface can have a major
impact on the design of the final interface.  It is essential that all such
requirements be identified and dealt with in the design before implementation
begins.

.nh
Introduction

    In this section we introduce the database subsystem and summarize the
reasons why we need such a system.  We then introduce the major components
of the database subsystem and briefly mention some related subsystems.

.nh 2
The Database Subsystem

    The database subsystem (DBSS) is conceived as a single comprehensive system
to be used to manage and access all binary (non textfile) data accessed by IRAF
programs.  Simple applications are perhaps most easily and flexibly dealt with
using text files for the storage of data, descriptors, and control information.
As the amount of data to be processed grows or as the data structures to be
accessed grow in complexity, however, the text file approach becomes seriously
inefficient and cumbersome.  Converting the text files to binary files makes
processing more efficient but does little to address the problems of complex
data structures.  Efficient access to complex data structures requires complex
and expensive software.  Developing such software specially for each and every
application is prohibitively expensive in a large system; hence the need for
a general purpose database system becomes clear.

Use of a single central database system has significant additional advantages.
A standard user interface can be used to examine, edit, list, copy, etc., all
data maintained under the database system.  Many technical problems may be
addressed in a general purpose system that would be too expensive to address
in a particular application, e.g., the problems of storing variable size data
elements, of dynamically and randomly updating a dataset, of byte packing to
conserve storage, of maintaining indexes so that a record may be found
efficiently in a large dataset, of providing data independence so that storage
formats may be changed without need to change the program accessing the data,
and of transport of binary datasets between incompatible machines.  All of
these are examples of problems which are \fInot\fR adequately addressed by the
current IRAF i/o interfaces nor by the applications programs which use them.

.nh 2
Major Subsystems

    The major subsystems comprising the IRAF DBSS are depicted in Figure 1.
At the highest level are the CL (command language) packages, each of which
consists of a set of user callable tasks.  The IMAGES package (consisting
of general image processing operators) is shown for completeness but since
there are many such packages in the system they are not considered part of
the DBSS and will not be discussed further here.
The DBMS (database management) package is the user interface to the DBSS,
and some day will possibly be the largest part of the DBSS in terms of number
of lines of code.

In the center of the figure we see the VOS (virtual operating system) packages
IMIO, DBIO and FIO.  FIO (file i/o) is the standard IRAF file interface and
will not be discussed further here.  IMIO (image i/o) and DBIO (database i/o)
are the two major i/o interfaces in the DBSS and are the topic of much of the
rest of this document.  IMIO and DBIO are the two parts of the DBSS of interest
to applications programmers; these interfaces are implemented as libraries of
subroutines to be called directly by the applications program.  IMIO and FIO
are existing interfaces.

At the bottom of the figure is the DB Kernel.  The DB Kernel is the component
of the DBSS which physically accesses the data in mass storage (via FIO).
The DB Kernel is called only by DBIO and hence is invisible to both the user
and the applications programmer.  There is a lot more to the DB Kernel than
is evident from the figure, and indeed the DB Kernel will be the subject of
another figure when we discuss the system architecture in section 4.2.


.ks
.nf
                 DBMS          IMAGES(etc)            (CL)
                   \             /
                    \           /                   ---------
                     \         /
                      \     IMIO                     
                       \    /  \       
                        \  /    \                                  
                         \/      \                    (VOS)
                        DBIO     FIO
                         | 
                         |
                         |                          ---------
                         |
                         |
                    (DB Kernel)                (VOS or Host System)

.fi
.ce
Figure 1.  Major Components of the Database Subsystem
.ke


With the exception of certain optional subsystems to be outlined later,
the entire DBSS is machine independent and portable.  The IRAF system may
be ported to a new machine without any knowledge whatsoever of the
architecture or functioning of the DBSS.

.nh 2
Related Subsystems

    Several additional IRAF subsystems or packages are of interest from the
standpoint of the DBSS.  These are the PLOT package, the graphics interface
GIO, and the LISTS package.

The PLOT package is a CL level package consisting of general plotting
utilities.  In general PLOT tasks can accept input in a number of standard
formats, e.g., \fBlist\fR (text file) format and \fBimagefile\fR format.
The DBSS will provide an additional standard format which should perhaps be
directly accessible by the PLOT tasks.  Even if this is not done a very
general plotting capability will automatically be provided by "piping" the
list format output of a DBMS task to a PLOT task.  Additional graphics
capabilities will be provided as built in functions in the DBMS
\fBquery language\fR, which will access GIO directly to make plots.
The query language graphics facilities will be faster and more convenient
to use but less extensive and less sophisticated than those provided by PLOT.

The LISTS package is interesting because the facilities provided and operations
performed resemble those provided by the DBMS package in many respects.
The principle difference between the two packages is that the LISTS package
operates on arbitrary text files whereas the DBMS package operates only
upon DBIO format binary files.  The textual output of \fIany\fR IRAF or
non-IRAF program may serve as input to a LISTS operator, as may any ordinary
text file, e.g., the source files for a program or package.  A typical LISTS
database is a directory full of source files or documentation; LISTS can also
operate on tables of numbers but the former application is perhaps more
common.  Using LISTS it is possible to conveniently and rapidly perform
operations (evaluate queries) which would be cumbersome or impossible to
perform with a conventional database system such as DBMS.  On the other hand,
the LISTS operators would be hopelessly inefficient for the types of
applications for which DBMS is designed.

.nh
Requirements

    Requirements define the problem to be solved by a software system.
There are two types of requirements, non-functional requirements, i.e.,
restrictions or constraints, and functional requirements, i.e., the functions
which the system must perform.  Since nearly all IRAF science software will
be heavily dependent on the DBSS, the requirements for this subsystem are as
strict as those for any subsystem in IRAF.

.nh 2
General Requirements

    The general requirements which the DBSS must satisfy primarily take the
form of constraints or restrictions.  These requirements are common to
all mainline IRAF system software.  Note that these requirements are \fInot\fR
automatically enforced for all system software.  If a particular subsystem is
prototype or optional (not required for the normal functioning of IRAF) then
these requirements can be relaxed.  In particular, certain parts of the DBSS
(e.g, the host database interface) are optional and are not subject
to the same constraints as the mainline software.  The primary functional
requirements discussed in section 3.2, however, must be met by software which
satisfies all of the general requirements discussed here.

.nh 3
Portability

    All software in the DBMS, IMIO, and DBIO interfaces and in the DB kernel
must be fully portable under IRAF.  To meet this requirement the software
must be written in the IRAF SPP language using only the facilities provided
by the IRAF VOS.  In particular, this rules out complicated record locking
schemes in the DB kernel, as well as any type of centralized database server
which relies on process control, IPC, or signal handling facilities not
provided by the IRAF VOS.  For most processes the requirement is even more
strict, i.e., ordinary IRAF processes are not permitted to rely upon the VOS
process control or IPC facilities for their normal functioning (the IPC
connection to the CL is an exception since it is not required to run an
IRAF process standalone).

.nh 3
Efficiency

    The database interface must be efficient, particularly when used for
image access and intermodule communication.  There are as many ways to
measure the efficiency of an interface as there are applications for the
interface, and we cannot address them all here.  The dimensions of the
efficiency matrix we are concerned with here are the cpu time consumed
during execution, the clock time consumed during execution, e.g, the number
of file opens and disk seeks or required, and the disk space consumed for
table storage.  Where necessary efficient cpu utilization will be achieved
at the expense of memory requirements for code and buffers.

A simple and well defined efficiency requirement is that the cpu and clock
time required to access the pixels of an image stored in the database from
a "cold start" (no open files) must not noticeably exceed that required
by the old IMIO interface.  The efficiency of the new interface for the
case when many images are to be accessed is expected to be a major improvement
over that provided by the old IMIO interface, since the old interface
stores each image in two separate files, whereas the new interface will
be capable of storing the entire contents of many (small) images in a single
file.  The amount of disk space required for image header storage is also
expected to decrease by a large factor when multiple images are stored
in a single physical file.

.nh 3
Code Size

    We have already established that a process must directly access the
database in mass storage to meet our portability and efficiency requirements.
This type of access requires that the necessary IMIO, DBIO and DB Kernel
routines be linked into each process requiring database access.  Minimizing
the amount of text space used by the database code is desirable to minimize
disk and memory requirements and process spawn time, but is not critical
since memory is cheap and plentiful and is likely to become even cheaper
and more plentiful in the future.  Furthermore, the multitask nature of
IRAF processes allows the text segment used by the database code to be shared
by many tasks, saving both disk and memory.

The main problem remaining today with large text segments seems to be the
process spawn time; loading the text segment by demand paging in a virtual
memory environment can be quite slow.  The fault here seems to lie more with
the operating system than with IRAF, and probably the solution will require
tuning either the IRAF system interface or the operating system itself.

Taking all these factors into account it would seem that typical memory
requirements for the executable database code (not including data buffers)
in the range 50 to 100 Kb would be acceptable, with 50 Kb being a reasonable
goal.  This would make the database interface the largest i/o interface in
IRAF but that seems inevitable considering the complexity of the problem to
be solved.

.nh 3
Use of Proprietary Software

    A mainline IRAF interface, i.e., any interface required for the normal
operation of the system, must belong to IRAF and must be distributed with
the IRAF system at no additional charge and with no licensing restrictions.
The source code must be part of the system and is subject to strict
configuration control by the IRAF group, i.e., the IRAF group is responsible
for the software and must control it.  This rules out the use of a commercial
database system for any essential part of the DBSS, but does not rule out
IRAF access to a commercial database system provided such access is optional,
i.e., not required for the operation of the standard applications packages.
The host database interface provided by the DB kernel is an example of such
an interface.

.nh 2
Special Requirements

    In this section we present the functional requirements of the DBSS.
The major applications for which the DBSS in intended are described and
the desirable characteristics of the DBSS for each application are outlined.
The major applications thus far identified are catalog storage, image storage,
intermodule communication, and data archiving.

.nh 3
Catalog Storage

    The catalog storage application is probably the closest thing in IRAF to a
conventional database application.  A catalog is a set of records, each of
which describes a single object.  Each record consists of a set of fields
of various datatypes describing the attributes of the object.  A record is
produced by numerical analysis of the object represented as a region of a
digital array.  All records have the same structure, i.e., set of fields;
often the records are all the same size (but not necessarily).  A large catalog
might contain several hundred thousand records.  Examples of such catalogs are
the SAO star catalog, the IRAS point source catalog, and the catalogs produced
by analysis programs such as FOCAS (a faint object detection and classification
program) and RICHFLD (a digital stellar photometry program).  Many similar
examples can be identified.

Generation of such a catalog by an analysis program is typically a cpu bound
batch operation requiring many hours of computer time for a large catalog.
Once the catalog has been generated there are typically numerous questions of
scientific interest which can be answered using the data in the catalog.
It is highly desirable that this phase of the analysis be interactive and
spontaneous, as one question will often lead to another in an unpredictable
fashion.  A general purpose analysis capability is required which will permit
the scientist to pose arbitrary queries of arbitrary complexity, to be answered
by the system in a few seconds (or minutes for large problems), with the answer
taking the form of a number or name, set or table of numbers or names, plot,
subcatalog, etc.

Examples of such queries are given below.  Clearly, the set of all possible
queries of this type is infinite, even assuming a limited number of operators
operating on a single catalog.  The set of potentially interesting queries
is equally large.
.ls 4
.ls [1]
Find all objects of type "pqr" for which X is in the range A to B and
Z is less than 10.
.le
.ls [2]
Compute the mean and standard deviation of attribute X for all objects
in the set [1].
.le
.ls [3]
Compute and plot (X-Y) for all objects in set [1].
.le
.ls [4]
Plot a circle of size (log2(Z-3.2) * 100) at the position (X,Y) of all objects
in set [1].
.le
.ls [5]
Print the values of the attributes OBJ, X, Y, and Z of all objects for which
X is in the range A to B and Y is greater than 30.
.le
.le


In the past queries such as these have all too often been answered by writing
a program to answer each query, or worse, by wading though a listing of the
program output and manually computing the result or manually plotting points
on a graph.

Given the preceding description of the catalog storage application, we can
make the following observations about the application of the DBSS to catalog
storage.
.ls
.ls o
A catalog is typically written once and then read many times.
.le
.ls o
Both public and private catalogs are common.
.le
.ls o
Catalog records are infrequently updated or are not updated at all once the
original entry has been made in the catalog.
.le
.ls o
Catalog records are rarely if ever deleted.
.le
.ls o
Catalogs can be very large, making efficient storage structures important
in order to minimize disk storage requirements.
.le
.ls o
Since catalogs can be very large, indexing facilities are required for
efficient record retrieval and for the efficient evaluation of queries.
.le
.ls o
A general purpose interactive query capability is required for the user to
effectively make use of the data in a catalog.
.le
.le


In DBSS terminology a user catalog will often be referred to as a \fBtable\fR
to avoid confusion with the use of the DBSS term \fBcatalog\fR which refers
to the system table which lists the contents of a database.

.nh 3
Image Storage

    A primary requirement for the DBSS, if not \fIthe\fR primary requirement,
is that the DBSS be suitable for the storage of bulk data or \fBimages\fR.
An image consists of two parts: an \fIimage header\fR describing the image,
and a multidimensional array of \fBpixels\fR.  The pixel array is sometimes
small and sometimes very large indeed.  For efficiency and other reasons the
actual pixel array is not required to be stored in the database.  Even if the
pixels are stored directly in the database they are not expected to be used
in queries.

We can make the following observations about the use of the DBSS for image
storage.  The reader concerned about how all this might map into the storage
structures provided by a relational database should assume that the image
header is stored as a single large, variable size record (tuple), whereas
a group of images is stored as one or more tables (relations).  If the images
are large assume the pixels are be stored outside the DBSS in a file, storing
only the name of the file in the header record.
.ls
.ls o
Images tend to be grouped into sets that have some logical meaning to the user,
e.g., "nite1", "nite2", "raw", "reduced", etc.  Each group typically contains
dozens or hundreds of images (enough to require use of an index for efficient
retrieval).
.le
.ls o
Within a group the individual images are often referred to by a unique ordinal
number which is automatically assigned by some program (e.g., "nite1.10",
"nite1.11", etc).
.le
.ls o
Image databases tend to be private databases, created and accessed by a
single user.
.le
.ls o
The size of the pixel segment of an image varies enormously, e.g., from
1 kilobyte to 8 megabytes, even 40 megabytes in some cases.
.le
.ls o
Small pixel segments are most efficiently stored directly in the image header
to minimize the number of file opens and disk seeks required to access the
pixels once the header has been accessed (as well as to minimize file clutter).
.le
.ls o
Large pixel segments are most efficiently stored separately from the image
headers to increase clustering and speed sequential searches of a group of
headers.
.le
.ls o
It is occasionally desirable to store either the image header or the pixel
segment on a special, non file-structured device.
.le
.ls o
The image header logically consists of a closed set of standard attributes
common to all images, plus an open set of attributes peculiar to the data
or to the type of analysis being performed on the data.
.le
.ls o
The operations performed on images are often functions which produce a
modified version of the input image(s) as a new output image.  It is desirable
for most header information to be automatically preserved in such a mapping.
For this to happen automatically without the DBSS requiring knowledge of
the contents of a header, it is necessary that the header be a single object
to the DBSS, i.e., a single record in some table, rather than a set of
related records in several tables.
.le
.ls o
Since the image header needs to be maintained as a single record and since
the header may contain an unpredictable number of application or data specific
attributes, image headers can be quite large.
.le
.ls o
Not all image header attributes are simple scalar values or even fixed size
arrays.  Variable size attributes, i.e., arrays, are common in image headers.
Examples of such attributes are the bad pixel list, history text, and world
coordinate system (more on this in a later section).
.le
.ls o
Image header attributes often form logical groupings, e.g., several logically
related attributes may be required to define the bad pixel list or the world
coordinate system.
.le
.ls o
The image header structure is often dynamically updated and may change in
size when updated.
.le
.ls o
It is often necessary to add new attributes to an existing image header.
.le
.ls o
Images are often selectively deleted.  Any subordinate files logically
associated with the image should be automatically deleted when the image
header is deleted.  If this is not possible under the DBSS then the DBSS
should forbid deletion of the image header unless special action is taken
to remove delete protection.
.le
.ls o
For historical or other reasons, a given site will often maintain images
in several different and completely incompatible formats.  It is desirable
for the DBSS to be capable of directly accessing images maintained in a foreign
format without a format conversion, even if only limited (e.g., read only)
access is possible.
.le
.le


In summary, images are characterized by a header with a highly variable set
of fields, some of which may vary in size during the lifetime of the image.
New fields may be added to the image header at any time.  Array valued fields
are common and fields tend to form logical groupings.  The image header is
best maintained as a single structure under the DBSS.  Image headers can be
quite large.  The pixel segment of an image can be extremely large and may
be best maintained outside the DBSS.  Since many existing image archives exist,
each with its own unique format, it is desirable for the DBSS to be capable
of accessing multiple storage formats.

Storage of the pixel segment or any other portion of an image in a separate
file outside the DBSS causes problems which must be dealt with at some level
in the system, if not by the DBSS.  In particular, problems occur if the user
tries to backup, restore, copy, rename, or delete any portion of an image using
a host system utility.  These problems are minimized if all logically related
data is kept in a single data directory, allowing the database as a whole to
be moved or backed up with host system utilities.  All pathnames should be
defined relative to the data directory to permit relocation of the database
to a different directory.  Ideally all binary datafiles in the database should
be maintained in a machine independent format to permit movement of the
database between different machines without reformatting the entire database.

.nh 3
Intermodule Communication

    A large applications package consists of many separate tasks or programs.
These tasks are best defined and understood in terms of their operation on a
central package database.  For example, one task might fit some function to
an image, leaving a record describing the fit in the database.  A second task
might take this record as input and use it to control a transformation on
the original image.  Additional operators implementing a range of algorithms
or optimized for a discrete set of cases are easily added, each relying upon
the central database for intermodule communication.

This application of the DBSS is a fairly conventional database application
except that array valued attributes and logical groupings of attributes are
common.  For example, assume that a polynomial has been fitted to a data
vector and we wish to record the fit in the database.  A typical set of
attributes describing a polynomial fit are shown below.


.ks
.nf
	image_name	char*30		# name of source image
	nfeatures	int		# number of features fitted
	features.x	real*4[*]	# x values of the features
	features.y	real*4[*]	# y values of the features
	curve.type	char*10		# curve type
	curve.ncoeff	int		# number of coefficients
	curve.coeff	real*4[*]	# coefficients
.fi
.ke


The data structure shown records the positions (X) and world coordinates (Y)
of the data features to which the curve was fitted, plus the coefficients of
the fitted curve itself.  There is no way of predicting the number of features
hence the X and Y arrays are variable length.  Since the fitted curve might
be a spline or some other piecewise function rather than a simple polynomial,
there is likewise no reasonable way to place an upper limit on the amount of
storage required to store the fitted curve.  This type of record is common in
scientific applications.

We can now make the following observations regarding the use of the DBSS for
intermodule communication.
.ls
.ls o
The number of fields in a record tends to be small, but array valued fields
of variable size are common hence the physical size of a record may be large.
.le
.ls o
A large table might contain several hundred records in typical applications,
requiring the use of an index for efficient retrieval.
.le
.ls o
Record access is usually random rather than sequential.
.le
.ls o
Random record updates will be rare in some applications, but common in others.
.le
.ls o
Records will often change in size when updated.
.le
.ls o
Selective record deletion is rare, occurring mostly during cleanup following
an error.
.le
.ls o
New fields are rarely, if ever, added to existing records.  The record structure
is usually determined by the programmer rather than by the user and tends to
be well defined.
.le
.ls o
This type of database is typically a private database created and used by a
single user to process a specific dataset with a specific applications package.
.le
.le


Application specific information may sometimes be stored directly in the header
of the image being analyzed, but more often will be stored in one or more
separate tables, recording the name of the image analyzed in the new record
as a backpointer, as in the example.  Hence a typical scientific database
might consist of several tables containing the input images, several tables
containing intermodule records of various types, and one or more tables
containing either reduced images or catalog records, depending on whether a
reduction or analysis operation was performed.

.nh 3
Data Archiving

    Data archiving refers to the long term storage of raw or reduced data.
Data archiving is important for the following reasons.
.ls
.ls o
Archiving is currently necessary just to \fItransport\fR data from the
telescope to the site where reduction and analysis takes place.
.le
.ls o
Permanently archiving the raw (or pre-reduced) data is necessary in case
an error in the reduction process is later discovered, making it necessary
for the observer to repeat the reductions.
.le
.ls o
Archiving of the reduced data is desirable to save computer and human time
in case the analysis phase has to be repeated, or in case additional analysis
is later discovered to be necessary.
.le
.ls o
Archived data could conceivably be of considerable value to future researchers
who, given access to such data, might not have to make observations of their
own, or who might be able to use the archived data to augment or plan their
own observations.
.le
.ls o
Archived data could be invaluable for future projects studying the variability
of an object or objects over a period of years.
.le
.le


Ideally data should be archived as it is taken at the telescope, possibly
performing some simple pipeline reductions before archiving takes place.
Subsequent reduction and analysis using the archived data should be possible
without the format conversion (e.g., FITS to IRAF) currently required.
This conversion wastes cpu time and disk space as well as user time.
The problem is already serious and is expected to grow by an order of
magnitude in the next several years as digital detectors grow in size and
are used more frequently.

Archival data consists of the digital data itself (the pixels) plus information
describing the object, the observer, how the data was taken, when and where
the data was taken, and so on.  This is just the type of information assumed
to be present in an IRAF image.  In addition one would expect the archive to
contain one or more \fBmaster catalogs\fR containing exhaustive information
describing the observations but no data.

Since a permanent digital data archive can be expected to be around for many
years and to be read on many types of machines, data images should be archived
in a machine independent format; this format would almost certainly be FITS.
It is also desirable, though not essential, that the master catalogs be
readable on a variety of machines and hence be maintained and distributed in
a machine independent format.  The ideal storage medium for archiving and
transporting large amounts of digital data appears to be the optical disk.

Archival data and catalog access via the DBSS differs from conventional image
and catalog access only in the storage format, which is assumed to be machine
independent, and in the storage medium, which is assumed to be an archival
medium such as the optical disk.  Direct access to a database on optical
disk requires that the DBSS be able to read the machine independent format
directly.

To achieve acceptable performance for direct access it is necessary that
the storage medium be randomly accessible (unlike, say, a magnetic or optical
tape) and that the hardware seek time and transfer rate be comparable to those
provided by magnetic disk technology.  Note that current optical disk readers
often do not have fast seek times, and that those that do have fast seek times
generally have a lower storage density than sequential devices due to the gaps
between sectors.  Even if a device is not fast enough to be used directly it
is still possible to eliminate the expensive format conversion and do only a
disk to disk copy, accessing the machine independent format on magnetic disk.

There is no requirement that the IRAF DBSS be used to support data archiving,
but the DBSS \fIis\fR required to be able to access the data in an archive.
Accessing the master catalogs as well seems reasonable since such a catalog
is no different than those described in sections 3.2.1. and 3.2.3; IRAF will
have the capability to maintain, access, and query such a catalog without
developing any additional software.

The main obstacle likely to limit the success of data archiving may well be
the difficulty involved in gaining access to the archive.  If the master
catalogs were maintained on magnetic disk but released periodically in
optical disk format for astronomers to refer to at their home institutions,
access would be much easier (and probably more frequent) than if all the
astronomers in the country were required to access a single distant computer
via modem.  Telephone access by sites not on the continent would probably
be too expensive or problematic to be feasible.

.nh 2
Other Requirements

    In earlier sections we have discussed the principle constraints and
primary requirements for the DBSS.  Several other requirements or
non-requirements deserve mention.

.nh 3
Concurrency

    All of the applications identified thus far require either read-only access
to a public database or read-write access to a private database.
The DBSS is therefore not required to support simultaneous updating by many
users of a single centralized database, with all the overhead and complication
associated with record locking, deadlock avoidance and detection, and so on.
The only exception occurs when a single user has several concurrent processes
requiring simultaneous update access to the user's private database.  It appears
that this case can be addressed adequately by distributing the database in
several datasets and using host system file locking to lock the datasets,
a technique discussed further in a later section.

.nh 3
Recovery

    If a database update is aborted for some reason a dataset can be corrupted,
possibly preventing further access to the dataset.  The DBSS should of course
protect datasets from corruption in normal circumstances, but it is always
possible for a hardware or software error (e.g., disk overflow or reboot) to
cause a dataset to be corrupted.  Some mechanism is required for recovering a
database that has been corrupted.  The minimum requirement is that the DBSS,
when asked to access a corrupted dataset, detect that the dataset has been
corrupted and abort, after which the user runs a recovery task to rebuild the
dataset minus the corrupted records.

.nh 3
Data Independence

    Data independence is a fundamental property inherent in virtually all
database systems.  One of the major reasons one uses a database system is to
provide data independence.  Data independence is so fundamental that we will
not discuss it further here.  Suffice it so say that the DBSS must provide
a high degree of data independence, allowing applications programs to function
without detailed knowledge of the structure or contents of the database they
are accessing, and allowing databases to change significantly without
affecting the programs which access them.

.nh 3
Host Database Interface

    The host database interface (HDBI) makes it possible for the DBSS to
interface to a host database system.  The ability to interface to a host
database system is not a primary requirement for the DBSS but is a highly
desirable one for many of the same reasons that direct access to archival data
is important.  The problems of accessing a HDB and of accessing an archive
maintained in non-DBSS format are similar and might perhaps be addressed
by a single interface.

.nh
Conceptual Design

    In this section we develop the design of the various subsystems comprising
the DBSS at the conceptual level, without bothering with the details of specific
language bindings or with the details of implementation.  We start by defining
some important terms and then describe the system architecture.  Lastly we
describe each of the major subsystems in turn, starting at the highest level
and working down.

.nh 2
Terminology

    The DBSS is an implementation of a \fBrelational database\fR.  A relational
database views data as a collection of \fBtables\fR.  Each table has a fixed
set of named columns and may contain any number of rows of data.  The rows
of a table are often referred to as \fBrecords\fR.  A record consists of a set
of named \fBfields\fR.  The fields of a record are the columns of the table
containing the record.

We shall use this informal terminology when discussing the contents of a
physical database.  When discussing the \fIstructure\fR of a database we shall
use the formal relational terms relation, tuple, attribute, and so on.
The correspondence between the formal relational terms and their informal
equivalents is given in the table below.


.ks
.nf
	\fBformal relational term\fR		    \fBinformal equivalents\fR

		relation			table
		tuple				record, row
		attribute			field, column
		primary key			unique identifier
		domain				pool of legal values
.fi
.ke


A \fBrelation\fR is a set of like tuples.  A \fBtuple\fR is a set of
\fBattributes\fR, each of which is defined upon a specific domain.
A \fBdomain\fR is an abstract type which defines the legal values an
attribute may take on (e.g., "posint" or "color").  The tuples of a relation
must be unique within the containing relation.  The \fBprimary key\fR is
a subset of the attributes of a relation which is sufficient to uniquely
identify any tuple in the relation (often a single attribute serves as
the primary key).

The relational data model was chosen for the DBSS because it is the simplest
conceptual data model which meets our requirements.  Other possibilites
considered were the \fBhierarchical\fR model, in which data is organized in
a tree structure, and the \fBnetwork\fR model, in which data is organized in
a potentially recursive graph structure.  Virtually all new database systems
implemented since the mid-seventies have been based on the relational model
and most database research today is in support of the relational model (the
remainder goes to the new fifth-generation technology, not to the old data
models).

The term "relational" in "relational database" comes from the \fBrelational
algebra\fR, a branch of mathematics based on set theory which defines a
fundamental and mathematically complete set of operations upon relations
(tables).  The relational algebra is fundamental to the DBMS query language
(section 4.3) but can be safely ignored in the rest of the DBSS.  The reader
is referred to any introductory database text for a discussion of the relational
algebra and other database technotrivia.  The classic introductory database
text is \fI"An Introduction to Database Systems"\fR, Volume 1 (Fourth Edition,
1986) by C. J. Date.

.nh 2
System Architecture

    The system architecture of the DBSS is depicted in Figure 2.  The parts
of the figure above the "DBKI" have already been discussed in section 2.2.
The remainder of the figure is what has been referred to previously as the
DB kernel.

The primary function of DBIO is record access (retrieval, update, insertion,
and deletion) based on evaluation of a \fBselect\fR statement input as a string.
DBIO can also process symbolic definitions of relations and other database
objects so that new tables may be created.  DBIO does not implement any
relational operators more complex than select; the more complex relational
operations are left to the DBMS query language to minimize the size and
complexity of DBIO.

The basic concept underlying the design of the lower level portions of the DBSS
is that the DB kernel provides the \fBaccess method\fR for efficiently accessing
records in mass storage, while DBIO takes care of all higher level functions.
In particular, DBIO implements all functions required to access the contents
of a record, while the DB kernel is responsible for storage allocation and for
the maintenance and use of indexes, but has no knowledge of the actual contents
of a record (the HDBI is an exception to this rule as we shall see later).

The database kernel interface (DBKI) provides a layer of indirection between
DBIO and the underlying database kernel (DBK).  The DBKI can support a number
of different kernels, much the way FIO can support a number of different device
drivers.  The DBKI also provides network access to a remote database, using
the existing IRAF kernel interface (KI) to communicate with a DBKI on the
remote node.  Two standard database kernels are provided.

The primary DBK (at the right in the figure) maintains and accesses DBSS
binary datasets; this is the most efficient kernel and probably the only
kernel which will fully implement the semantic actions of the DBKI.
The second DBK (at the left in the figure) supports the host database
interface (HDBI) and is used to access archival data, any foreign image
formats, and the host database system (HDB), if any.  Specialized HDBI
drivers are required to access foreign image formats or to interface to
an HDB.


.ks
.nf
		 DBMS          IMAGES(etc)            (CL)
                   \             /
                    \           /                   ---------
                     \         /
                      \     IMIO                     
                       \    /  \       
                        \  /    \                                  
                         \/      \  
                        DBIO     FIO                  (VOS)
                         | 
                         |
                         |
                        DBKI
                         |
                  +------+------+-------+
                  |             |       |
                 DBK           DBK    (KI)
                  |             |       |
                  |             |       |
                 HDBI           |       |
                  |             |       |
             +----+----+        |       |           ---------
             |         |        |       |
             |         |        |       |
         [archive]   [HDB]  [dataset]   |
                                        |
                                        |         (host system)
                                        -
				      (LAN)
					-
                                        |
                                        |           ---------
                                        |
                                 (Kernel-Server)
                                        |
                                        |
                                       DBKI           (VOS)
                                        |
                                    +---+---+
                                    |       |
                                   DBK     DBK


.fi
.ce
Figure 2. \fBDatabase Subsystem Architecture\fR
.ke


.nh 2
The DBMS Package
.nh 3
Overview

    The user interfaces with a database in either of two ways.  The first way
is via the tasks in an applications package, which perform highly specialized
operations upon objects stored in the database, e.g., to reduce a certain kind
of data.  The second way is via the database management package (DBMS), which
gives the user direct access to any dataset (but not to large pixel arrays
stored outside the DBSS).  The DBMS provides an assortment of general purpose
operators which may be used regardless of the type of data stored in the
database and regardless of the applications program which originally created
the structures stored in the database.

The DBMS package consists of an assortment of simple procedural operators
(conventional CL callable parameter driven tasks), a screen editor for tables,
and the query language, a large program which talks directly to the terminal
and which has its own special syntax.  Lastly there is a subpackage containing
tasks useful only for datasets maintained by the primary DBK, i.e., a package
of relatively low level tasks for things like crash recovery and examining
the contents of physical datasets.

.nh 3
Procedural Interface

    The DBMS procedural interface provides a number of the most commonly
performed database operations in the form of CL callable tasks, allowing
these simple operations to be performed without the overhead involved in
entering the query language.  Extensive database manipulations are best
performed from within the query language, but if the primary concern of
the user is data reduction in some package other than DBMS the procedural
operators will be more convenient and less obtrusive.

.nh 4
General Operators

    DBMS tasks are required to implement the following general database
management operations.  Detailed specifications for the actual tasks are
given later.
.ls
.ls \fBchdb\fR newdb
Change the default database.  To minimize typing the DBSS provides a
"default database" paradigm analogous to the default directory of FIO.
Note that there need be no obvious connection between database objects
and files since multiple tables may be stored in a single physical file,
and the physical database may reside on an optical disk or worse may be
an HDB.  Therefore the FIO "directory" cannot be used to examine the
contents of a database.  The default database may be set independently
of the current directory.
.le
.ls \fBpcatalog\fR [database]
Print the catalog of the named database.  The catalog is a system table
containing one entry for every table in the database; it is analogous
to a FIO directory.  Since the catalog is a table it can be examined like
any other table, but a special task is provided since the print catalog
operation is so common.  If no argument is given the catalog of the default
database is printed.
.le
.ls \fBptable\fR spe
Print the contents of the specified relation in list form on the standard
output.  The operand \fIspe\fR is a general select expression defining
a new table as a projection of some subset of the records in a set of one or
more named tables.  The simplest select expression is the name of a single
table, in which case all fields of all records in the table will be printed.
More generally, one might print all fields of a single table, selected fields
of a single table (projection), all fields of selected records of a single
table (selection), or selected fields of selected records from one or more
tables (selection plus projection).
.le
.ls \fBrcopy\fR spe output_table
Copy (insert) the records specified by the general select expression
\fIspe\fR into the named \fIoutput_table\fR.  If the named output table
does not exist a new one will be created.  If the attributes of the output
table are different than those of the input table the proper action of
this operator is not obvious and has not yet been defined.
.le
.ls \fBrmove\fR spe output_table
Move (insert) the relation specified by the general select expression
\fIspe\fR into the named \fIoutput_table\fR.  If the named output table
does not exist a new one will be created.  The original records are deleted.
This operator is used to generate the union of two or more tables.
.le
.ls \fBrdelete\fR spe
Delete the records specified by the general select expression \fIspe\fR.
Note that this operator deletes records from tables, not the tables themselves.
.le
.ls \fBmkdb\fR newdb [ddl_file]
Create a new, empty database \fInewdb\fR.  If a data definition file
\fIddl_file\fR is named it will be scanned and any domain, relation, etc.
definitions therein entered into the new database.
.le
.ls \fBmktable\fR table relation
Create a new, empty table \fItable\fR of type \fIrelation\fR.  The parameter
\fIrelation\fR may be the name of a DDL file, the name of an existing base
table, or any general record select/project expression.
.le
.ls \fBmkview\fR table relation
Create a new virtual table (view) defined in terms of one or more existing
base tables by the operand \fIrelation\fR, which is the same as for the
task \fImktable\fR.  Operationally, \fBmkview\fR is much like \fBrcopy\fR,
except that it is considerably faster and the new table does not physically
store any data.  The new view-table behaves like any other table in most
operations (except some types of updates).  Note that the new table may
reference tuples in several different base tables.  A view-table may
subsequently be converted into a base table with \fBrcopy\fR.  Views are
discussed in more detail in section 4.5.
.le
.ls \fBmkindex\fR table fields
Make a new index on the named base table over the listed fields.
.le
.ls \fBrmtable\fR table
Drop (delete, remove) the named base table (or view) and any indexes defined
on the table.
.le
.ls \fBrmindex\fR table fields
Drop (delete, remove) the index defined over the listed fields on the named
base table.
.le
.ls \fBrmdb\fR [database]
Destroy the named database.  Unless explicitly overridden \fBrmdb\fR will
refuse to delete a database until all tables therein have been dropped.
.le
.le


Several terms were introduced in the discussion above which have not yet been
defined.  A \fBbase table\fR is a physical table (instance of a defined
relation), unlike a \fBview\fR which is a virtual table defined via selection
and projection over one or more base tables or other views.  Both types of
objects behave equivalently in most operations.
A \fBdata definition language\fR (DDL) is a language syntax used to define
database objects.

.nh 4
Forms Based Data Entry and Retrieval

    Many of the records typically stored in a database are too large to be
printed in list format on a single line.  Some form of multiline output is
necessary; this multiline representation is called a \fBform\fR.  The full
terminal screen is used to display a form, e.g. with the fields labeled
in reverse video and the field values in normal video.  Records are viewed
one at a time.

Data entry via a form is an interactive process similar to editing a file with
a screen editor.  The form is displayed, possibly with default values for the
fields, and the user types in new values for the fields.  Editor commands are
provided for positioning the cursor to the field to be edited and for editing
within a field.  The DBSS verifies each value as it is entered using the range
information supplied with the domain definition for that field.
Additional checks may be made before the new record is inserted into the
output table, e.g., the DBSS may verify that values have been entered for
all fields which do not permit null values.
.ls
.ls \fBetable\fR spe
Call up the forms editor to edit a set of records.  The operand \fIspe\fR
may be any general select expression.
.le
.ls \fBpform\fR spe
Print a set of records on the standard output, using the forms generator to
generate a nice self documenting format.
.le
.le


The \fBforms editor\fR (etable) may be used to display or edit existing records
as well as to enter new ones.  It is desirable for the forms editor to be able
to move backward as well as forward in a table, as well as to move randomly
to a record satisfying a predicate, i.e., search through the table for a
record.  This makes the forms editor a powerful tool for browsing through a
database.  If the predicate for a search is specified by entering values or
boolean expressions into the fields contributing to the predicate then we have
a query-by-form utility, which has been reported in the literature to be very
popular with users (since one does not have to remember a syntax and typing
is minimized).

A variation on the forms editor is \fBpform\fR, used to output records in
"forms" format.  This will be most useful for large records or for cases where
one is more interested in studying individual records than in comparing
different records.  The alternative to forms output is list or tabular format
output.  This form of output is more concise and can be used as input to the
\fBlists\fR operators, but may be harder to read and may overflow the output
line.  List format output is discussed further in the next section.

By default the format of a form is determined automatically by a
\fBforms generator\fR using information given in the DDL when the database
was created.  The domain definition capability of the DDL includes provisions
for specifying the default output format for a field as well as the field label.
In most cases this will be sufficient information for the forms generator to
generate an esthetically acceptable form.  If desired the user or programmer can
modify this form or create a new form from scratch, and the forms generator
will use the customized form rather than create one of its own.

The CL \fBeparam\fR parameter file editor is an example of a simple forms
editor.  The main differences between \fBeparam\fR and \fBetable\fR are the
forms generator and the browsing capability.

.nh 4
List Interface

    The \fBlist\fR is one of the standard IRAF data structures.  A list is
an ascii table wherein the standard record delimiter is the newline and the
standard field delimiter is whitespace.  Comment lines and blank lines are
ignored within lists; double comment lines ("## ...") may optionally be used
to label the columns of a list.  By default, non-DBMS lists are free format;
strings must be quoted if they contain one of the field delimiter characters.
The field and record delimiter characters may be changed if necessary, e.g.,
to permit multiline records.  Fixed format lists are available as an option
and are often required to interface to external (non-IRAF) programs.

The primary advantages of the list or tabular format for printed tables are
the following.
.ls
.ls [1]
The list or tabular format is the most concise form of printed output.
The eye can rapidly scan up and down a column to compare the values of
the same field in a set of records.
.le
.ls [2]
DBMS list output may be used as input to the tasks in the \fBlists\fR,
\fBplot\fR, and other packages.  Using the pipe syntax, tasks which
communicate via lists may be connected together to perform arbitrarily
complex operations.
.le
.ls [3]
List format output is the defacto standard format for the interchange of
tabular data (e.g., DBSS tables) amongst different computers and programs.
A list (usually the fixed format variety) may be written onto a cardimage
tape for export, and conversely, a list read from a cardimage tape may be
used to enter a table into a DBSS database.
.le
.le


The most common use for list format output will probably be to print tables.
When a table is too wide to fit on a line the user will learn to use
\fBprojection\fR to print only the fields of interest.  The default format
for DBMS lists will be fixed format, using the format information provided
in the DDL specification to set the default output format.  Fixed format
is best for DBMS lists since it forces the field values to line up in nice
orderly columns, which are easier for a human to read (fixed format is easier
and more efficient for a computer to read as well, if not to write).
The type of format used will be recorded in the list header and a
\fBlist interface\fR will be provided so that all list processing programs
can access lists equivalently regardless of their format.

As mentioned above, the list interface can be used to import and export tables.
In particular, an astronomical catalog distributed on card image tape can be
read directly into a DBSS table once a format descriptor has been prepared
and the DDL for the new table has been written and used to create an empty
table ready to receive the data.  After only a few minutes of setup a user can
have a catalog entered into the database and be getting final results using
the query language interface!
.ls
.ls \fBrtable\fR listfile output_table
The list \fIlistfile\fR is scanned, inserting successive records from the
list into the named output table.  A new output table is created if one does
not already exist.  The format of the list is taken from the list header
if there is one, otherwise the format specification is provided by the user
in a separate file.
.le
.ls \fBptable\fR spe
Print the contents of the relation \fIspe\fR in list form on the standard
output.  The operand \fIspe\fR may be any general select/project expression.
.le
.le


The \fBptable\fR operator (introduced in section 4.3.2.1) is used to generate
list output.  The inverse operation is provided by \fBrtable\fR.

.nh 4
FITS Table Interface

    The FITS table format is a standard format for the transport of tabular
data.  The idea is very similar to the cardimage format discussed in the last
section except that the FITS table standard includes a table header used to
define the format of the encoded table, hence the user does not have to
prepare a format descriptor to read a FITS table.  The FITS reader and writer
programs are part of the \fBdataio\fR package.

.nh 4
Graphics Interface

    All of the \fBplot\fR package graphics facilities are available for plotting
DBMS data via the \fBlist\fR interface discussed in section 4.3.2.3.  List
format output may also be used to generate output to drive external (non-IRAF)
graphics packages.  Plotting facilities are also available via a direct
interface within the query language; this latter interface is the most efficient
and will be the most suitable for most graphics applications.  See section
2.3 for additional comments on the graphics interface.

.nh 3
Command Language Interface

    All of the DBMS tasks are CL callable and hence part of the command language
interface to the DBSS.  For example, a CL script task may implement arbitrary
relational operators using \fBptable\fR to copy a table into a list, \fBfscan\fR
and \fBprint\fR to read the list and format the modified list, and finally
\fBrtable\fR to insert the output list into a table.  The query language may
also be called from within a CL script to process commands passed on the
command line, via the standard input, or via a temporary file.

Additional operators are required for randomly accessing records without the
use of a list; suitable operators are shown below.
.ls
.ls \fBdbgets\fR record fields
The named fields of the indicated record are returned as a free format string
suitable for decoding into individual fields with \fBfscan\fR.
.le
.ls \fBdbputs\fR record fields values
The named fields of the indicated record are set to the values given in the
free format value string.
.le
.le


More sophisticated table and record access facilities are conceivable but
cannot profitably be implemented until an enhanced CL becomes available.

.nh 3
Record Selection Syntax

    As we have seen, many of the DBMS operators employ a general record
selection syntax to specify the set of records to be operated upon.
The selection syntax will include a list of tables and optionally a
predicate (boolean expression) to be evaluated for each record in the
listed tables to determine if the record is to be included in the final
selection set.  In the simplest case a single table is named with no
predicate in which case the selection set consists of all records in the
named table.  Parsing and evaluation of the record selection expression
is performed entirely by the DBIO interface hence we defer detailed
discussion of selection syntax to the sections describing DBIO.

.nh 3
Query Language

    In most database systems the \fBquery language\fR is the primary user
interface, both for the end-user interactively entering adhoc queries, and for
the programmer entering queries via the host language interface.  The major
reasons for this are outlined below.
.ls
.ls [1]
A query language interface is much more powerful than a "task" or subroutine
based interface such as that described in section 4.3.2.  A query language
can evaluate queries much more complex than the simple "select" operation
implemented by DBIO and made available to the user in tasks such as
\fBptable\fR and \fBrcopy\fR.
.le
.ls [2]
A query language is much more efficient than a task interface for repeated
queries.  Information about a database may be cached between queries and
files may remain open between queries.  Complex queries may be executed as
a series of simpler queries, cacheing the intermediate results in memory.
Graphs may be generated directly from the data without encoding, writing,
reading, decoding, and deleting an intermediate list.
.le
.ls [3]
A query language can perform many functions via a single interface, reducing
the amount of code to be written and supported, as well as simplifying the
user interface.  For example, a query language can be used to globally
update (edit) tables, as well as to evaluate queries on the database.
Lacking a query language, such an editing operation would have to be
implemented with a separate task which would no doubt have its own special
syntax for the user to remember (e.g, the \fBhedit\fR task in the \fBimages\fR
package).
.le
.le


Unlike most commercial database systems, the DBSS is not built around the
query language.  The heart of the IRAF DBSS is the DBIO interface, which is
little more than a glorified record access interface.  The query language
is a high level applications task built upon DBIO, GIO, and the other interfaces
constituting the IRAF VOS.  This permits us to delay implementation of the
query language until after the DBSS is in use and our primary requirements have
been met, and then implement the query language as an experimental prototype.
Like all data analysis software, the query language is not required to meet
our primary requirements (data acquisition and reduction), rather it is needed
to do interesting things with our data once it has been reduced.

.nh 4
Query Language Functions

    The query language is a prominent part of the user interface and is
often used interactively directly by the user, but may also be called
noninteractively from within CL scripts and by SPP programs.  The major
functions performed by the query language are as follows.
.ls
.ls [1]
The database management operations, i.e., create/destroy database,
create/drop table or index, sort table, alter table (add new attribute),
and so on.
.le
.ls [2]
The relational operations, i.e., select, project, join, and divide
(the latter is rarely implemented).  These are the operations most used
to evaluate queries on the database.
.le
.ls [3]
The traditional set operations, i.e., union, intersection, difference,
and cartesian product.
.le
.ls [4]
The editing operations, i.e, selective record update and delete.
.le
.ls [5]
Operations on the columns of tables.  Compute the sum, average, minimum,
maximum, etc. of the values in a column of a table.  These operations
are also required for queries.
.le
.ls [6]
Tabular and graphical output.  The result of any query may be printed or
plotted in a variety of ways, without need to repeat the query.
.le
.le


The most important function performed by the query language is of course the
interactive evaluation of queries, i.e., questions about the data in the
database.  It is beyond the scope of this document to try to give the reader
a detailed understanding of how a query language is used to evaluate queries.

.nh 4
Language Syntax

    The great flexibility of a query language derives from the fact that it is
syntax rather than parameter driven.  The syntax of the DBMS query language
has not yet been defined.  In choosing a language syntax there are a several
possible courses of action: [1] implement a standard syntax, [2] extend a
standard syntax, or [3] develop a new syntax, e.g., as a variation on some
existing syntax.

The problem with rigorously implementing a standard syntax is that all query
languages currently in wide use were developed for commercial applications,
e.g., for banking, inventory, accounting, customer mailing lists, etc.
Experimental query languages are currently under development for CAD
applications, analysis of Landsat imagery, and other applications similar
to ours, but these are all research projects at the present time.
The basic characteristics desirable in a query language intended for scientific
data reduction and analysis seem little different than those provided by a query
language intended for commercial applications, hence the most practical
approach is probably to start with some existing query language syntax and
modify or extend it as necessary for our type of data.

There is no standard query language for relational databases.
The closest thing to a standard is SQL, a language originally developed by
IBM for System-R (one of the first relational database systems, actually an
experimental prototype), and still in use in the latest IBM product, DB2.
This language has since been used in many relational products by many companies.
SQL is the latest in a series of relational query languages from IBM; earlier
languages include SQUARE and SEQUEL.  The second most widely used relational
query language appears to be QUEL, the query language used in both educational
and commercial INGRES.

Both SQL and QUEL are examples of the so-called "calculus" query languages.
The other major type of query language is the "algebraic" query language
(excluding the forms and menu based query languages which are not syntax
driven).  Examples of algebraic languages are ISBL (PRTV, Todd 1976),
TABLET (U. Mass.), ASTRID (Gray 1979), and ML (Li 1984).
These algebraic languages have all been implemented and used, but nowhere
near as widely as SQL and QUEL.

It is interesting to note that ASTRID and ML were developed by researchers
active in the area of logic languages.  In particular, the ML (Mathematics-Like)
query language was implemented in Prolog and some of the character of Prolog
shows through in the syntax of the language.  There is a close connection
between the relational algebra and the predicate calculus (upon which the
logic languages are based) which is currently being actively explored.
One of the most promising areas of application for the logic languages
(upon which the so-called "fifth generation" technology is based) is in
database applications and query languages in particular.

There appears to be no compelling reason for the current dominance of the
calculus type query language, other than the fact that is what IBM decided
to use in System-R.  Anything that can be done in a calculus language can
also be done in an algebraic language and vice versa.

The primary difference between the two languages is that the calculus languages
want the user to express a complex query as a single large statement,
whereas the algebraic languages encourage the user to execute a complex
query as a series of simpler queries, storing the intermediate results as
snapshots or views (either language can be used either way, but the orientation
of the two languages is as stated).  For simple queries there is little
difference between the two languages, although the calculus languages are
perhaps more readable (more English-like) while the algebraic languages are
more concise and have a more mathematical character.

The orientation of the calculus languages towards doing everything in a single
statement provides more scope for optimization than if the equivalent query is
executed as a series of simpler queries; this is often cited as one of the
major advantages of the calculus languages.  The procedural nature of the
algebraic languages does not permit the type of global optimizations employed
in the calculus languages, but this approach is perhaps more user-friendly
since the individual steps are easy to understand, and one gets to examine
the intermediate results to figure out what to do next.  Since a complex query
is executed incrementally, intermediate results can be recomputed without
starting over from scratch.  It is possible that, taking user error and lack
of forethought into account, the less efficient algebraic languages might end
up using less computer time than the super efficient calculus languages for
comparable queries.

A further advantage of the algebraic language in a scientific environment is
that there is more of a distinction between executing a query and printing
the results of the query than in a calculus language.  The intermediate results
of a complex query in an algebraic language are named relations (snapshots
or views); an extra print command must be entered to examine the intermediate
result.  This is an advantage if the query language provides a variety of ways
to examine the result of a query, e.g., as a printed table or as some type
of plot.

.nh 4
Sample Queries

    At this point several examples of actual queries, however simple they may
be, should help us to visualize what a query language is like.  Several
examples of typical scientific queries were given (in English) in section 3.2.1.
For the convenience of the reader these are duplicated here, followed by actual
examples in the query languages SQL, QUEL, ASTRID, and ML.  It should be noted
that these are all examples of very simple queries and these examples do little
to demonstrate the power of a fully relational query language.
.ls
.ls [1]
Find all objects of type "pqr" for which X is in the range A to B and
Z is less than 10.
.le
.ls [2]
Compute the mean and standard deviation of attribute X for all objects
in the set [1].
.le
.ls [3]
Compute and plot (X-Y) for all objects in set [1].
.le
.ls [4]
Plot a circle of size (log2(Z-3.2) * 100) at the position (X,Y) of all objects
in set [1].
.le
.ls [5]
Print the values of the attributes OBJ, X, Y, and Z of all objects of type
"pqr" for which X is in the range A to B and Y is greater than 30.
.le
.le


It should not be difficult for the imaginative reader to make up similar
queries for a particular astronomical catalog or data archive.
For example (I can't resist), "find all objects for which B-V exceeds X",
"find all recorded observations of object X", "find all observing runs on
telescope X in which astronomer Y participated during the years 1975 to
1985", "compute the number of cloudy nights in August during the years
1985 to 1990", and so on.  The possibilities are endless.

Query [5] is an example of a simple select/project query.  This query is
shown below in the different query languages.  Note that whitespace may be
redistributed in each query as desired; in particular, the entire query may
be entered on a single line if desired.  Keywords are shown in upper case
and data names or values in lower case.  The object "table" is the table
from which records are to be selected, "pqr" is the desired value of the
field "type" of table "table", and "x", "y", and "z" are numeric fields of
the table.


.ks
.nf
SQL:

	SELECT	obj, x, y, z
	FROM	table
	WHERE	type = 'pqr'
	AND	x >= 10
	AND	x <= 20
	AND	z >  30;
.fi
.ke


.ks
.nf
QUEL:

	RANGE OF t IS table
	RETRIEVE (t.obj, t.x, t.y, t.z)
	WHERE	t.type = 'pqr'
	AND	t.x >= 10
	AND	t.y <= 20
	AND	t.z >  30
.fi
.ke


.ks
.nf
ASTRID (mnemonic form):

	table
	SELECTED_ON [
		type = 'pqr'
		AND x >= 10
		AND x <= 20
		AND z >  30
	] PROJECTED_TO
		obj, x, y, z
.fi
.ke


.ks
.nf
ASTRID (mathematical form):

	table ;[ type = 'pqr' AND x >= 10 AND x <= 20 AND z < 10 ] %
		obj, x, y, z
.fi
.ke


.ks
.nf
ASTRID (alternate query showing use of intermediates):

	a := table ;[ type = 'pqr' AND z > 30 ]
	b := a ;[ x >= 10 AND x <= 20 ]
	b % obj,x,y,z
.fi
.ke


.ks
.nf
ML (Li/Prolog):

	table : type=pqr, x >= 10, x <= 20, z < 10 [obj,x,y,z]
.fi
.ke


Note that in ASTRID and ML selection and projection are implemented as
operators or qualifiers modifying the relation on the left.  To print all
fields of all records of a table one need only enter the name of the table.
The logic language nature of such queries is evident if one thinks of the
query as a predicate or true/false assertion.  Given such an assertion (query),
the query processor tries to prove the assertion true by finding all tuples
satisfying the predicate, using the set of rules given (the database).

For simple queries such as these it makes little difference what query language
is used; many users would probably prefer the SQL or QUEL syntax for these
simple queries because of the English like syntax.  To seriously evaluate the
differences between the different languages more complex queries must be tried,
but such an exercise is beyond the scope of the present document.

As a final example we present, without supporting explanation, an example
of a more complex query in SQL (from Date, 1986).  This example is based
upon a "suppliers-parts-projects" database, consisting of four tables:
suppliers (S), parts (P), projects (J), and number of parts supplied to
a specified project by a specified supplier (SPJ), with fields 'supplier
number' (S#), 'part number' (P#) and 'project number' (J#).  The names
SPJX and SPJY are aliases for SPJ.  This example is rather contrived and
the data is not interesting, but it should serve to illustrate the use of
SQL for complex queries.


.ks
.nf
Query: Get part numbers for parts supplied to all projects in London.

	SELECT	DISTINCT p#
	FROM	spj spjx
	WHERE	NOT EXISTS
	      ( SELECT	*
		FROM	j
		WHERE	city = 'london'
		AND	NOT EXISTS
		      ( SELECT	*
			FROM	spj spjy
			WHERE	spjy.p# = spjx.p#
			AND	spjy.j# = j.j# ));
.fi
.ke


The nesting shown in this example is characteristic of the calculus languages
when used to evaluate complex queries.  Each SELECT implicitly returns an
intermediate relation used as input to the next higher level subquery.

.nh 3
DB Kernel Operators

    All DBMS operators described up to this point have been general purpose
operators with no knowledge of the form in which data is stored internally.
Additional operators are required in support of the standard IRAF DB kernels.
These will be implemented as CL callable tasks in a subpackage off DBMS.

.nh 4
Dataset Copy and Load

    Since our intention is to store the database in a machine independent
format, special operators are not required to backup, reload, or copy dataset
files.  The binary file copy facilities provided by IRAF or the host system
may be used to backup, reload, or copy dataset files.

.nh 4
Rebuild Dataset

    Over a period of time a dataset which is subjected to heavy updating
may become disordered internally, reducing the efficiency of a most record
access operations.  A utility task is required to efficiently rebuild such
datasets.  The same result can probably be achieved by an \fIrcopy\fR
operation but a lower level operator may be more efficient.

.nh 4
Mount Foreign Dataset

    Before a foreign dataset (archive or local format imagefile) can be
accessed it must be \fImounted\fR, i.e., the DBSS must be informed of the
existence and type of the dataset.  The details of the mount operation are
kernel dependent; ideally the mount operation will consist of little more
than examining the structure of the foreign dataset and making appropriate
entries in the system catalog.

.nh 4
Crash Recovery

    A utility is required for recovering datasets which have been corrupted
as a result of a hardware or software failure.  There should be sufficient
redundancy in the internal data structures of a dataset to permit automated
recovery.  The recover operation is similar to a rebuild so perhaps the
same task can be used for both operations.

.nh 2
The IMIO Interface
.nh 3
Overview

    The Image I/O (IMIO) interface is an existing subroutine interface used
to maintain and access bulk data arrays (images).  The IMIO interface is built
upon the DBIO interface, using DBIO to maintain and access the image headers
and sometimes to access the stored data (the pixels) as well.  For reasons of
efficiency IMIO directly accesses the bulk data array when large images are
involved.

Most of the material presented in this section on the image header is new.
The pixel access facilities provided by the existing IMIO interface will
remain essentially unchanged, but the image header facilities provided by
the current interface are quite limited and badly need to be extended.
The existing header facilities provide support for the major physical image
attributes (dimensionality, length of each axis, pixel datatype, etc.) plus
a limited facility for storing user defined attributes.  The main changes
in the new interface will be excellent support for history records, world
coordinates, histograms, a bad pixel list, and image masks.  In addition
the new interface will provide improved support for user defined attributes,
and greatly improved efficiency when accessing large groups of images.
The storage structures will be more localized, hopefully causing less
confusion for the user.

In this section we first discuss the components of an image, concentrating
primarily on the different parts of the image header, which is quite a
complex structure.  We then discuss briefly the (mostly existing) facilities
for header and pixel access.  Lastly we discuss the storage structures
normally used to maintain images in mass storage.

.nh 3
Logical Schema

    Images are stored as records in one or more tables in a database.  More
precisely, the main part of an image header is a record (row) in some table
in a database.  In general some of the other tables in a database will contain
auxiliary information describing the image.  Some of these auxiliary tables
are maintained by IMIO and will be discussed in this section.  Other tables
will be created by the applications programs used to reduce the image data.

As far as the DBSS is concerned, the pixel segment of an image is a pretty
minor item, a single array type attribute in the image header.  Since the
size of this array can vary enormously from one image to the next some
strategic questions arise concerning where to store the data.  In general,
small pixel segments will be stored directly in the image header, while large
pixel segments will be stored in a separate file from that used to store
the header records.

The major components of an image (as far as IMIO is concerned) are summarized
below.  More detailed information on each component is given in the following
sections.
.ls
.ls Standard Header Fields
An image header is a record in a relation initially of type "image".
The standard header fields include all attributes necessary to describe
the physical characteristics of the image, i.e., all attributes necessary
to access the pixels.
.le
.ls History
History records for all images in a database are stored in a separate history
relation in time sequence.
.le
.ls World Coordinates
An image may have any number of world coordinate systems associated with it.
These are stored in a separate world coordinate system relation.
.le
.ls Histogram
An image may have any number of histograms associated with it.
Histograms for all images in a database are stored in a separate histogram
relation in time sequence.
.le
.ls Pixel Segment
The pixel segment is stored in the image header, at least from the point of
view of the logical schema.
.le
.ls Bad Pixel List
The bad pixel list, a variable length integer array, is required to physically
describe the image hence is stored in the image header.
.le
.ls Region Mask
An image may have any number of region masks associated with it.  Region masks
for all images in a database are stored in a separate mask relation.  A given
region mask may be associated with any number of different images.
.le
.le


In summary, the \fBimage header\fR contains the standard header fields,
the pixels, the bad pixel list, and any user defined fields the user wishes
to store directly in the header.  All other information describing an image
is stored in external non-image relations, of which there may be any number.
Note that the auxiliary tables (world coordinates, histograms, etc.) are not
considered part of the image header.

.nh 4
Standard Header Fields

    The standard header fields are those fields required to describe the
physical attributes of the image, plus those fields required to physically
access the image pixels.  The standard header fields are summarized below.
These fields necessarily reflect the current capabilities of IMIO.  Since
the DBSS provides data independence, however, new fields may be added in
the future to support future versions of IMIO without rendering old images
unreadable.
.ls
.ls 12 image
An integer value automatically assigned by IMIO when the image is created
which uniquely identifies the image within the containing table.  This field
is used as the primary key in \fIimage\fR type relations.
.le
.ls naxis
Number of axes, i.e., the dimensionality of the image.
.le
.ls naxis[1-4]
A group of 4 attributes, i.e., \fInaxis1\fR through \fInaxis4\fR,
each specifying the length of the associated image axis in pixels.
Axis 1 is an image line, 2 is a column, 3 is a band, and so on.
If \fInaxis\fR is greater than four additional axis length attributes
are required.  If \fInaxis\fR is less than four the extra fields are
set to one.  Distinct attributes are used rather than an array so that
the image dimensions will appear in printed output, to simplify the use
of the dimension attributes in queries, and to make the image header
more FITS-like.
.le
.ls linelen
The physical length of axis one (a line of the image) in pixels.  Image lines
are often aligned on disk block boundaries (stored in an integral number of
disk blocks) for greater i/o efficiency.  If \fIlinelen\fR is the same as
\fInaxis1\fR the image is said to be stored in compressed format.
.le
.ls pixtype
A string valued attribute identifying the datatype of the pixels as stored
on disk.  The possible values of this attribute are discussed in detail below.
.le
.ls bitpix
The number of bits per pixel.
.le
.ls pixels
The pixel segment.
.le
.ls nbadpix
The number of bad pixels in the image.
.le
.ls badpix
The bad pixel list.  This is effectively a boolean image stored in compressed
form as a variable length integer array.  The bad pixel list is maintained by
the pixel list package, a subpackage of IMIO, also used to maintain region
masks.
.le
.ls datamin
The minimum pixel value.  This field is automatically invalidated (set to a
value greater than \fIdatamax\fR) whenever the image is modified, unless
explicitly updated by the caller.
.le
.ls datamax
The maximum pixel value.  This field is automatically invalidated (set to a
value less than \fIdatamin\fR) whenever the image is modified, unless
explicitly updated by the caller.
.le
.ls title
The image title, a one line character string identifying the image,
for annotating plots and other forms of output.
.le
.le


The possible values of the \fIpixtype\fR field are shown below.  The format
of the value string is "type.host", where \fItype\fR is the logical datatype
and \fIhost\fR is the host machine encoding used to represent that datatype.


.ks
.nf
	  TYPE                DESCRIPTION                MAPS TO

	byte.m		unsigned byte ( 8 bits)		short.spp
	ushort.m	unsigned word (16 bits)		long.spp

	short.m		short integer, signed		short.spp
	long.m		long integer, signed		long.spp
	real.m		single precision floating	real.spp
	double.m	double precision floating	double.spp
	complex.m	(real,real)			complex.spp
.fi
.ke


Note that the first character of each keyword is sufficient to uniquely
identify the datatype.  The ".m" suffix identifies the "machine" to which
the datatype refers.  When new images are written \fIm\fR will usually be
the name of the host machine.  When images written on a different machine
are read on the local host there is no guarantee that the i/o system will
recognize the formats for the named machine, but at least the format will
be uniquely defined.  Some possible values for \fIm\fR are shown below.


.ks
.nf
	dbk		DBK (database kernel) mip-format
	mip		machine independent (MII integer,
			    IEEE floating)
	sun		SUN formats (same as mip?)
	vax		DEC Vax data formats
	mvs		DG MV-series data formats
.fi
.ke


The DBK format is used when the pixels are stored directly in the image header,
since only the DBK binary formats are supported in DBK binary datafiles.
The standard i/o system will be support at least the MIP, DBK, SUN (=MIP),
and VAX formats.  If the storage format is not the host system format
conversion to and from the corresponding SPP (host) format will occur at the
level of the FIO interface to avoid an N-squared type conversion matrix in
IMIO, i.e., IMIO will see only the SPP datatypes.

Examples of possible \fIpixtype\fR values are "short.vax", i.e., a 16 bit signed
twos-complement byte-swapped integer format, and "real.mip", the 32 bit IEEE
single precision floating point format.

.nh 4
History Text

    The intent of the \fIhistory\fR relation is to record all events which
modify the image data in a dataset, i.e., all operations which create, delete,
or modify images.  The attributes of the history relation are shown below.
Records are added to the history table in time sequence.  Each record logically
contains one line of history text.
.ls 4
.ls 12 time
The date and time of the event.  This value of this field is automatically
set by IMIO when the history record is inserted.
.le
.ls parent
The name of the parent image in the case of an image creation event,
or the name of the affected image in the case of an image modification
event affecting a single image.
.le
.ls child
The name of the child or newly created image in the case of an image creation
event.  This field is not used if only a single image is involved in an event.
.le
.ls event
The history text, i.e., a one line description of the event.  The suggested
format is a task or procedure call naming the task or procedure which modified
the image and listing its arguments.
.le
.le


.ks
.nf
Example:

            TIME         PARENT       CHILD             EVENT

	Sep 23 20:24    nite1[12]       --       imshift (1.5, -3.4)
	Sep 23 20:30    nite1[10]    nite1[15]
	Sep 23 20:30    nite1[11]    nite1[15]
	Sep 23 20:30    nite1[15]       --       nite1[10] - nite1[11]
.fi
.ke


The principal reason for collecting all history text together in a single
relation rather than storing it scattered about in string attributes in the
image headers is to permit use of the DBMS facilities to pose queries on the
history of the dataset.  Secondary reasons are the completeness of the history
record thus provided for the dataset as a whole, and increased efficiency,
both in the amount of storage required and in the time required to record an
event (in particular, the time required to create a new image).  Note also that
the history relation may be used to record events affecting dataset objects
other than images.

The history of any particular image is easily recovered by printing the values
of the \fItext\fR field of all records with a particular value of the
\fIimage\fR key.  The parents or children of any image are easily traced
using the information in the history relation.  The history of the dataset
as a whole is given by printing all history records in time sequence.
History information is not lost when intermediate images are deleted unless
deletes are explicitly performed upon the history relation.

.nh 4
World Coordinates

    In general, an image may simultaneously have any number of world coordinate
systems (WCS) associated with it.  It would be quite awkward to try to store an
arbitrary number of WCS descriptors in the image header, so a separate WCS
relation is used instead.  If world coordinates are not used no overhead is
incurred.

Maintenance of the WCS descriptor, transformation of the WCS itself (e.g.,
when an image changes spatially), and coordinate transformations using the WCS
are all managed by a dedicated package, also called WCS.  The WCS package
is a general purpose package usable not only in IMIO but also in GIO and
other places.  IMIO will be responsible for copying the WCS records for an
image when a new image is created, as well as for correcting the WCS for the
effects of subsampling, coordinate flip, etc. when a section of an image is
mapped.

A general solution to the WCS problem requires that the WCS package support
both linear and nonlinear coordinate systems.  The problem is further
complicated by the variable number of dimensions in an image.  In general
the number of possible types of nonlinear coordinate systems is unlimited.
Our solution to this difficult problem is as follows.
.ls 4
.ls o
Each image axis is associated with a one or two dimensional mapping function.
.le
.ls o
Each mapping function consists of a general linear transformation followed
by a general nonlinear transformation.  Either transformation may be unitary
(may be omitted) if desired.
.le
.ls o
The linear transformation for an axis consists of some combination of a shift,
scale change, rotation, and axis flip.
.le
.ls o
The nonlinear transformation for an axis consists of a numerical approximation
to the underlying nonlinear analytic function.  A one dimensional function is
approximated by a curve x=f(a) and a two dimensional function is approximated
by a surface x=f(a,b), where X, A, and B may be any of the image axes.
A choice of approximating functions is provided, e.g., chebyshev or legendre
polynomial, piecewise cubic spline, or piecewise linear.
.le
.ls o
The polynomial functions will often provide the simplest solution for well
behaved coordinate transformations.  The piecewise functions (spline and linear)
may be used to model any slowly varying analytic function represented in
cartesian coordinates.  The piecewise functions \fIinterpolate\fR the original
analytic function on a regular grid, approximating the function between grid
points with a first or third order polynomial.  The approximation may be made
arbitrarily good by sampling on a finer grid, trading table space for increased
precision.
.le
.ls o
For many nonlinear functions, especially those defined in terms of the
transcendental functions, the fitted curve or surface will be quicker to
evaluate than the original function, i.e., the approximation will be more
efficient (evaluation of a bicubic spline is not cheap, however, requiring
computation of a linear combination of sixteen coefficients for each output
point).
.le
.ls o
The nonlinear transformation will define the mapping from pixel coordinates
to world coordinates.  The inverse transformation will be computed by numerical
inversion (iterative search).  This technique may be too inefficient for some
applications.
.le
.le


For example, the WCS for a three dimensional image might consist of a bivariate
Nth order chebyshev polynomial mapping X and Y to RA and DEC via gnomic
projection, plus a univariate piecewise linear function mapping each discrete
image band (Z) to a wavelength value.  If the image were subsequently shifted,
rotated, magnified, block averaged, etc., or sampled via an image section,
a linear term would be added to the WCS record of each axis affected by the
transformation.

A WCS is represented by a \fIset\fR of records in the WCS relation.  One record
is required for each axis mapped by the transformation.  The attributes of the
WCS relation are described below.  The records forming a given WCS all share
the same value of the \fIwcs\fR field.
.ls
.ls 12 wcs
The world coordinate system number, a unique integer code assigned by the WCS
package when the WCS is added to the database.
.le
.ls image
The name of the image with which the WCS is associated.
If a WCS is to be associated with more
than one image retrieval must be via the \fIwcs\fR number rather than the
\fIimage\fR name field.
.le
.ls type
A keyword supplied by the application identifying the type of coordinate
system defined by the WCS.  This attribute is used in combination with the
\fIimage\fR attribute for keyword based retrieval in cases where an image
may have multiple world coordinate systems.
.le
.ls axis
The image axis mapped by the transformation stored in this record.  The X
axis is number 1, Y is number 2, and so on.
.le
.ls axin1
The first input axis (independent variable in the transformation).
.le
.ls axin2
The second input axis, set to zero in the case of a univariate transformation.
.le
.ls axout
The number of the input axis (1 or 2) to be used for world coordinate output,
in the case where there is only the linear term but there are two input axes
(in which case the linear term produces a pair of world coordinate values).
.le
.ls linflg
A flag indicating whether the linear term is present in the transformation.
.le
.ls nlnflg
A flag indicating whether the nonlinear term is present in the transformation.
.le
.ls p1,p2
Linear transformation: origin in pixel space for input axes 1, 2.
.le
.ls w1,w2
Linear transformation: origin in world space for input axes 1, 2.
.le
.ls s1,s2
Linear transformation: Scale factor DW/DP for input axes 1, 2.
.le
.ls rot
Linear transformation: Rotation angle in degrees counterclockwise from the
X axis.
.le
.ls cvdat
The curve or surface descriptor for the nonlinear term.  The internal format
of this descriptor is controlled by the relevant math package.
This is a variable length array of type real.
.le
.ls label
Axis label for plots.
.le
.ls format
Tick label format for plots, e.g., "0.2h" specifies HMS format in a variable
field width with two decimal places in the seconds field.
.le
.le


As noted earlier, the full transformation for an axis involves a linear
transformation followed by a nonlinear transformation.  The linear term
is defined in terms of the WCS attributes \fIp1, p2\fR, etc. as shown below.
The variables X and Y are the input values of the axes \fIaxin1\fR and
\fIaxin2\fR, which need not correspond to the X and Y axes of the image.


.ks
.nf
	x' = (x - p1) * s1
	y' = (y - p2) * s2

	x" = x' * cos(rot) + y' * sin(rot)
	y" = y' * cos(rot) - x' * sin(rot)

	u  = x" + w1
	v  = y" + w2
.fi
.ke


The output variables U and V are then used as input to the nonlinear mapping,
producing the world coordinate value W for the specified image axis \fIaxis\fR
as output.

	w  = eval (cvdat, u, v)

The mappings for the special cases [1] no linear transformation,
[2] no nonlinear transformation, and [3] univariate rather than bivariate
transformation, are easily derived from the full transformation shown above.
Note that if there is no nonlinear term the linear term produces world
coordinates as output, otherwise the intermediate values (U,V) are in
pixel coordinates.  Note also that if there is no nonlinear term but there
are two input axes (as in the case of a rotation), attribute \fIaxout\fR
must be set to indicate whether U or V is to be returned as the output world
coordinate.

.nh 4
Image Histogram

    Histogram records are stored in a separate histogram relation outside
the image header.  An image may have any number of histograms associated
with it, each defined for a different section of the image.  A given image
section may have multiple associated histogram records differing in time,
number of sampling bins, etc., although normally recomputation of the
histogram for a given section will result in a record update rather than an
insertion.  A subpackage within IMIO is responsible for the computation of
histogram records.  Histogram records are not propagated when an image is
copied.  Modifications to an image made subsequent to computation of a
histogram record may invalidate or obsolete the histogram.
.ls 4
.ls 12 image
The name of the image or image section to which the histogram record
applies.
.le
.ls time
The date and time when the histogram was computed.
.le
.ls z1
The pixel value associated with the first bin of the histogram.
.le
.ls z2
The pixel value associated with the last bin of the histogram.
.le
.ls npix
The total number of pixels used to compute the histogram.
.le
.ls nbins
The number of bins in the histogram.
.le
.ls bins
The histogram itself, i.e., an array giving the number of pixels in each
intensity range.
.le
.le


The histogram limits Z1 and Z2 will normally correspond to the minimum and
maximum pixel values in the image section to which the histogram applies.

.nh 4
Bad Pixel List

    The bad pixel list records the positions of all bad pixels in an image.
A "bad" pixel is a pixel which has an invalid value and which therefore should
not be used for image analysis.  As far as IMIO is concerned a pixel is either
good or bad; if an application wishes to assign a fractional weight to
individual pixels then a second weight image must be associated with the
data image by the applications program.

Images tend to have few or no bad pixels.  When bad pixels are present they
are often grouped into bad regions.  This makes it possible to use data
compression techniques to efficiently represent the set of bad pixels,
which is conceptually a simple boolean mask image.

The bad pixel list is represented in the image header as a variable length
integer array (the runtime structure is slightly more complex).
This integer array consists of a set of lists.  Each list in the set lists
the bad pixels in a particular image line.  Each linelist consists of a record
length field and a line number field, followed by the bad pixel list for that
line.  The bad pixel list is a series of either column numbers or ranges of
column numbers.  Single columns are represented in the list as positive
integers; ranges are indicated by a negative second value.


.ks
.nf
	15   2 512 512
	 6  23   4   8   15 -18  44
	 4  72  23 -29   35
.fi
.ke


An example of a bad pixel list describing a total of 15 bad pixels is shown
above.  The first line is the pixel list header which records the total list
length (15 ints), the number of dimensions (2), and the sizes of each dimension
(512, 512).  There follow a set of variable length line list records.
Two such lists are shown in the example, one for line 23 and one for line 72.
On line 23 columns 4, 8, 15 though 18, and 44 are all bad.  Note that each
linelist contains only a line number since the list is two dimensional;
in general an N dimensional image requires N-1 subscripts after the record
length field, starting with the line number and proceeding to higher dimensions
to the right.

Even though IMIO provides a bad pixel list capability, many applications will
not want to bother to check for bad pixels.  In general, pointwise image
operators which produce a new image as output will not need to check for bad
pixels.  Non-pointwise image operators, e.g., filtering opertors, may or may
not wish to check for bad pixels (in principle they should use kernel collapse
to ignore bad pixels).  Analysis programs, i.e., programs which produce
database records as output rather than create new images, will usually check
for and ignore bad pixels.

To avoid machine traps when running the pointwise image operators, all bad
pixels must have reasonable values, even if these values have to be set
artificially when the data is archived.  IMAGES SHOULD NOT BE ARCHIVED WITH
MAGIC IN-PLACE VALUES FOR THE BAD PIXELS (as in FITS) since this forces the
system to conditionally test the value of every pixel when the image is read,
an unnecessary operation which is quite expensive for large images.
The simplicity of the reserved value scheme does not warrant such an expense.
Note that the reverse operation, i.e., flagging the bad pixels by setting
them to a magic value, can be carried out very efficiently by the reader
program given a bad pixel list.

For maximum efficiency those operators which have to deal with bad pixels may
provide two separate data paths internally, one for data which contains no
bad pixels and one for data containing some bad pixels.  The path to be taken
would be chosen dynamically as each image line is input, using the bad pixel
list to determine which lines contain bad pixels.  Alternatively a program
may elect to have the bad pixels flagged upon input by assignment of a magic
value.  The two-path approach is the most desirable one for simple operators.
The magic value approach is often simplest for the more complex applications
where duplicating the code to provide two data paths would be costly and the
operation is already so expensive that the conditional test is not important.

All operations and queries on bad pixel lists are via a general pixel list
package which is used by IMIO for the bad pixel list but which may be used
for any other type of pixel list as well.  The pixel list package provides
operators for creating new lists, adding and deleting pixels and ranges of
pixels from a list, merging lists, and so on.

.nh 4
Region Mask

    A region mask is a pixel list which defines some subset of the pixels in
an image.  Region masks are used to define the region or regions of an image
to be operated upon.  Region masks are stored in a separate mask relation.
A mask is a type of pixel list and the standard pixel list package is used
to maintain and access the mask.  Any number of different region masks may be
associated with an image, and a given region mask may be used in operations
upon any number of different images.
.ls 4
.ls 12 mask
The mask number, a unique integer code assigned by the pixel list package
when the mask is added to the database.
.le
.ls image
The image or image section associated with the mask, if any.
.le
.ls type
The logical type of the mask, a keyword supplied by the applications program
when the mask is created.
.le
.ls naxis
The number of axes in the mask image.
.le
.ls naxis[1-4]
The length of each image axis in pixels.  If \fInaxis\fR is greater than 4
additional axis length attributes must be provided.
.le
.ls npix
The total number of pixels in the subset defined by the mask.
.le
.ls pixels
The mask itself, a variable length integer array.
.le
.le


Examples of the use of region masks include specifying the regions to be
used in a surface fit to a two dimensional image, or specifying the regions
to be used to correlate two or more images for image registration.
A variety of utility tasks will be provided in the \fIimages\fR package for
creating mask images, interactively and otherwise.  For example, it will
be possible to display an image and use the image cursor to mark the regions
interactively.

.nh 3
Group Data

    The group data format associates a set of keyword = value type
\fBgroup header\fR parameters with a group of images.  All of the images in
a group should have the same size, number of dimensions, and datatype;
this is required for images to be in group format even though it is not
physically required by the database system.  All of the images in a group
share the parameters in the group header.  In addition, each image in a
group has its own private set of parameters (attributes), stored in the
image header for that image.

The images forming a group are stored in the database as a named base table
of type \fIimage\fR.  The name of the base table must be the same as the name
of the group.  Each group is stored in a separate table.  The group headers
for all groups in the database are stored in a separate \fIgroups\fR table.
The attributes of the \fIgroups\fR relation are described below.
.ls 4
.ls 12 group
The name of the group (\fIimage\fR table) to which this record belongs.
.le
.ls keyword
The name of the group parameter represented by the current record.
The keyword name should be FITS compatible, i.e., the name must not exceed
eight characters in length.
.le
.ls value
The value of the group parameter represented by the current record, encoded
FITS style as a character string not to exceed 20 characters in length.
.le
.ls comment
An optional comment string, not to exceed 49 characters in length.
.le
.le


Group format is provided primarily for the STScI/SDAS applications, which
require data to be in group format.  The format is however useful for any
application which must associate an arbitrary set of \fIglobal\fR parameters
with a group of images.  Note that the member images in a group may be
accessed independently like any other IRAF image since each image has a
standard image header.  The primary physical attributes will be identical
in all images in the group, but these attributes must still be present in
each image header.  For the SDAS group format the \fInaxis\fR, \fInaxisN\fR,
and \fIbitpix\fR parameters are duplicated in the group header.

.nh 3
Image I/O

    In this section we describe the facilities available for accessing
image headers and image data.  The discussion will be limited to those
aspects of IMIO relevant to a discussion of the DBSS.  The image i/o (IMIO)
interface and the image database interface (IDBI) are existing interfaces
which are more properly described in detail elsewhere.

.nh 4
Image Templates
    
    Most IRAF image operators are set up to operate on a group of images,
rather than a single image.  Membership in such a group is determined at
runtime by a so-called \fIimage template\fR which may select any subset
of the images in the database, i.e., and subset of images from any subset
of \fIimage\fR type base tables.  This type of group should not be confused
with the \fIgroup format\fR discussed in the last section.  The image template
is normally entered by the user on the command line and is dynamically
converted into a list of images by expansion of the template on the current
contents of the database.

Given an image template the IRAF applications program calls an IMIO routine
to "open" the template.  Successive calls to a get image name routine are made
to operate upon the individual images in the group.  When all images have been
processed the template is closed.

The images in a group defined by an image template must exist by definition
when the template is expanded, hence the named images must either be input
images or the operation must update or delete the named images.  If an
output image is to be produced for each input image the user must supply the
name of the table into which the new images are to be inserted.  This is
exactly the same type of operation performed by the DBMS operators, and in
fact most image operators are relational operators, i.e., they take a
relation as input and produce a new relation as output.  Note that the user
is required to supply only the name of the output table, not the names of
the individual images.  The output table may be one of the input tables if
desired.

An image template is syntactically equivalent to a DBIO record selection
expression with one exception: each image name may optionally be modified
by appending an \fIimage section\fR to specify the subset of the pixels in
the image to be operated upon.  An example of an image section string is
"[*,100]"; this references column 100 of the associated image.  The image
section syntax is discussed in detail in the \fICL User's Guide\fR.

Since the image template syntax is nearly identical to the general DBIO record
selection syntax the reader is referred to the discussion of the latter syntax
presented in section 4.5.6 for further details.  The new DBIO syntax is largely
upwards compatible with the image template syntax currently in use.

.nh 4
Image Pixel Access

    IMIO provides quite sophisticated pixel access facilities which it is
beyond the scope of the present document to discuss in detail.  Complete
data independence is provided, i.e., the applications program in general
need not know the actual dimensionality, size, datatype, or storage mode
of the image, what format the image is stored in, or even what device or
machine the image resides on.  This is not to say that the application is
forbidden from knowing these things, since more efficient i/o is possible
if there is a match between the logical and physical views of the data.

Pixel access via IMIO is via the FIO interface.  The DBSS is charged with
management of the pixel storage file (if any) and with setting up the
FIO interface so that IMIO can access the pixels.  Both buffered and virtual
memory mapped access is supported; which is actually used is transparent to
the user.  The types of i/o operations provided are "get", "put", and "update".
The objects upon which i/o may be performed are image lines, image columns,
N-dimensional subrasters, and pixel lists.

New in the DBIO based version of IMIO are update mode and column and pixel
list i/o, plus direct access via virtual memory mapping using the static file
driver.

.nh 4
Image Database Interface (IDBI)

    The image database interface is a simple keyword based interface to the
(non array valued) fields of the standard image header.  The IDBI isolates
the image oriented applications program from the method used to store the
header, i.e., programs which access the header via the IDBI don't care whether
the header is implemented upon DBIO or some other record i/o interface.
In particular, the IDBI is an existing interface which is \fInot\fR currently
implemented upon DBIO, but which will be converted to use DBIO when it becomes
available.  Programs which currently use the IDBI should require few if any
changes when DBIO is installed.

The philosophy of isolating the applications program using IMIO from the
underlying interfaces is followed in all the subpackages forming the IMIO
interface.  Additional IMIO subpackages are provided for appending history
records, creating and reading histograms, and so on.

.nh 3
Summary of IMIO Data Structures

    As we have seen, an image is represented as a record in a table in some
database.  The image record consists of a set of standard fields, a set of
user defined fields, and the pixel segment, or at least sufficient information
to locate and access the pixel segment if it is stored externally.
An image database may contain a number of other tables; these are summarized
below.


.ks
.nf
	<images>	Image storage (a set of tables named by the user)
	groups		Header records for group format data
	histograms	Histograms of images or image sections
	history		Image history records
	masks		Region masks
	wcs		World coordinate systems
.fi
.ke


Any number of additional application specific tables may be present in an
actual database.  The names of the application and user defined tables must
not conflict with the reserved table names shown above (or with the names of
the DBIO system tables discussed in the next section).  The pixel segment of
an image and possibly the image header may be stored in a non-DBSS format
accessed via the HDBI.  All the other tables are stored in the standard DBSS
format.

.nh 2
The DBIO Interface
.nh 3
Overview

    The database i/o (DBIO) interface is the interface by which all compiled
programs directly or indirectly access data maintained by the DBSS.  DBIO is
primarily a high level record manager interface.  DBIO defines the logical
structure of a database and directly implements most of the operations
possible upon the objects in a database.

The major functions of DBIO are to translate a record select/project expression
into a series of physical record accesses, and to provide the applications
program with access to the contents of the specified records.  DBIO hides the
the physical structure and contents of the stored records from the applications
program; providing data independence is one of the major concerns of DBIO.
DBIO is not directly concerned with the physical storage of tables and records
in mass storage, nor with the methods used to physically access such objects.
The latter operations, i.e., the \fIaccess method\fR, are provided by a database
kernel (DBK).

We first review the philosophy underlying the design of DBIO, and discuss
how DBIO differs from most commercial database systems.  Next we describe
the logical structure of a database and introduce the objects making up a
database.  The method used to define an actual database is described,
followed by a description of the methods used to access the contents of a
database.  Lastly we describe the mapping of a DBIO database into physical
files.

.nh 3
Comparision of DBIO and Commercial Databases

    The design of the DBIO interface is based on a thorough study of existing
database systems (most especially System-R, DB2 and INGRES).  It was clear from
the beginning that these systems were not ideally suited to our application,
even if the proprietary and portability issues were ignored.  Eventually the
differences between these commercial database systems and the system we need
became clear.  The differences are due to a change in focus and emphasis as
much as to the obvious differences between scientific and commercial
applications, and are summarized below.
.ls 4
.ls o
The commercial systems are not sufficiently flexible in the types of data that
can be stored.  In particular these systems do not in general support variable
length arrays of arbitrary datatype; most do not support even static arrays.
Only a few systems allow new attributes to be added to existing tables.
Most systems talk about domains but few implement them.  We need both array
storage and the ability to dynamically add new attributes, and it appears that
domains will be quite useful as well.
.le
.ls o
Most commercial systems emphasize the query language, which forms the basis
for the host language interface as well as the user interface.  The query
language is the focus of these systems.  In our case the DBSS is embedded
within IRAF as one of many subsystems.  While we do need query language
facilities at the user level, we do not need such sophisticated facilities
at the DBIO level and would rather do without the attendant complexity and
overhead.
.le
.ls o
Commercial database systems are designed for use in a multiuser transaction
processing environment.  Many users may simultaneously be performing update
and revtrieval operations upon a single centralized database.  The financial
success of the company may well depend upon the integrity of the database.
Downtime can be very expensive.

In contrast we anticipate having many independent databases.  These will be
of two kinds: public and private.  The public databases will virtually always be
accessed read only and the entire database can be locked for exclusive access
if it should ever need updating.  Only the private databases are subject to
heavy updating; concurrent access is required for background jobs but the
granularity of locking can be fairly coarse.  If a database should become
corrupted it can be fixed at leisure or even regenerated from scratch without
causing great hardship.  Concurrency, integrity, and recovery are therefore
less important for our applications than in a commercial environment.
.le
.ls o
Most commercial database systems (with the exception of the UNIX based INGRES)
are quite machine, device, and host system dependent.  In our case portability
of both the software and the data is a primary concern.  The requirement that
we be able to archive data in a machine independent format and read it on a
variety of machines seems to be an unusual one.
.le
.le


In summary, we need a simple interface which provides flexibility in the way
in which data can be stored, and which supports complex, dynamic data structures
containing variable length arrays of any datatype and size.  The commercial
database systems do not provide enough flexibility in the types of data
structures they can support, nor do they provide enough flexibility in storage
formats.  On the other hand, the commercial systems provide a more sophisticated
host language interface than we need.  DBIO should therefore emphasize flexible
data structures but avoid a complex syntax and all the problems that come with
such.  Concurrency and integrity are important but are not the major concerns
they would be in a commercial system.

.nh 3
Query Language Interface

    We noted in the last section that DBIO should be a simple record manager
type interface rather than an embedded query language type interface.  This
approach should yield the simplest interface meeting our primary requirements.
Nonetheless a host language interface to the query language is possible and
can be added in the future without compromising the present DBIO interface
design.

The query language will be implemented as a conventional CL callable task in
the DBMS package.  Command input to the query language will be interactively
via the terminal (the usual case), or noninteractively via a string type
command line argument or via a file.  Any compiled program can send commands
to the query language (or to any CL task) using the CLIO \fBclcmd\fR procedure.
Hence a crude but usable HLI query language interface will exist as soon as
a query language becomes available.  A true high level embedded query language
interface could be built using the same interface internally, but this should
be left to some future compiled version of SPP rather than attempted with the
current preprocessor.  We have no immediate plans to build such an embedded
query language interface but there is nothing in the current design to hinder
such a project should it someday prove worthwhile.

.nh 3
Logical Schema

    In this section we present the logical schema of a DBIO database.
A DBIO database consists of a set of \fBsystem tables\fR and a set of
\fBuser tables\fR.  The system tables define the structure of the database
and its contents; the user tables contain user data.  All tables are instances
of named \fBrelations\fR or \fBviews\fR.  Relations and views are ordered
collections of \fBattributes\fR or \fBgroups\fR of attributes.  Each attribute
is defined upon some particular \fBdomain\fR.  The structure of the objects
in a database is defined at runtime by processing a specification written in
the \fBdata definition language\fR.

.nh 4
Databases

    A DBIO database is a collection of named tables.  All databases include
a standard set of \fBsystem tables\fR defining the structure and contents
of the database.  Any number of user or application defined tables may also
be present in the database.  The most important system table is the database
\fIcatalog\fR which includes a record describing each user or system table
in the database.

Conceptually a database is similar to a directory containing files.  The catalog
corresponds to the directory and the tables correspond to the files.
A database is however a different type of object; there need be no obvious
connection between the objects in a database and the physical directories and
files used to store a database, e.g., several tables might be stored in one
file, one table might be stored in many files, the tables might be stored on
a special device and not in files at all, and so on.

In general the mapping of tables into physical objects is hidden from the user
and is not important.  The only exception to this is the association of a
database with a specific FIO directory.  The mapping between databases and
directories is one to one, i.e., a directory may contain only one database,
and a database is contained in a single directory.  An entire database can
be physically moved, copied, backed up, or restored by merely performing a
binary copy of the contents of the directory.  DBIO dynamically generates all
file names relative to the database directory, hence moving a database to
a different directory is harmless.

To hide the database directory from the user DBIO supports the concept of a
\fBcurrent database\fR in much the way that FIO supports the concept of a
current directory.  Tables are normally referenced by name, e.g., "ptable masks"
without explicitly naming the database (i.e., directory) in which the table
resides.  The current database is maintained independently of the current
directory, allowing the user to change directories without affecting the
current database.  This is particularly useful when accessing public databases
(maintained in a write protected directory) or when accessing databases which
reside on a remote node.  To list the contents of the current database the
user must type "pcat" rather than "dir".  The current database defaults to
the current directory until the user explicitly sets the current database
with the \fBchdb\fR command.

Databases are referred to by the filename of the database directory.
The IRAF system will provide a "master catalog" of public databases,
consisting of little more than a set of CL environment definitions assigning
logical names to the database directories.  Whenever possible logical names
should be used rather than pathnames to hide the pathname of the database.

.nh 4
System Tables

    The structure and contents of a DBIO database are described by the same
table mechanism used to maintain user data.  DBIO automatically maintains
the system tables, which are normally protected from writing by the user
(the system tables can be manually updated like any other table in a desperate
situation).  Since the system tables are ordinary tables, they can be
inspected, queried, etc., using the same utilities used to access the user
data tables.  The system tables are summarized below.
.ls 4
.ls 12 syscat
The database catalog.
Contains an entry (record) for every table or view in the database.
.le
.ls sysatt
The attribute list table.
Contains an entry for every attribute in every table in the database.
.le
.ls sysddt
The domain descriptor table.
Contains an entry for every defined domain in the database.  Any number of
attributes may share the same domain.
.le
.ls sysidt
The index descriptor table.
Contains an entry for every primary or secondary index in the database.
.le
.le


The system tables are visible to the user, i.e., they appear in the database
catalog.  Like the user tables, the system tables are themselves described by
entries in the database catalog, attribute list table, and domain descriptor
table.

.nh 4
The System Catalog

    The \fBsystem catalog\fR is effectively a "table of contents" for the
database.  The fields of the catalog relation \fBsyscat\fR are as follows.
.ls 4
.ls 12 table
The name of the user or system table described by the current record.
Table names may contain any combination of the alphanumeric characters,
underscore, or period and must not exceed 32 characters in length.
.le
.ls relid
The table identifier.  A unique integer code by which the table is referred
to internally.
.le
.ls type
Identifies the type of table, e.g., base table or view.
.le
.ls ncols
The number of columns (attributes) in the table.
.le
.ls nrows
The number of rows (records, tuples) in the table.
.le
.ls rsize
The size of a record in bytes, not including array storage.
.le
.ls tsize
An estimate of the total number of bytes of storage currently in use by the
table, including array storage.
.le
.ls ctime
The date and time when the table was created.
.le
.ls mtime
The date and time when the table was last modified.
.le
.ls flags
A small integer containing flag bits used internally by DBIO.
These include the protection bits for the table.  Initially only write
protection and delete protection will be supported (for everyone).
Additional protections are of course provided by the file system.
A flag bit is also used to indicate that the table has one or more
indexes, to avoid an unnecessary search of the \fBsysidx\fR table when
accessing an unindexed table.
.le
.le


Only a subset of these fields will be of interest to the user in ordinary
catalog listings.  The \fBpcatalog\fR task will by default print only the
most interesting fields.  Any of the other DBMS output tasks may be used
to inspect the catalog in detail.

.nh 4
Relations

    A \fBrelation\fR is an ordered set of named attributes, each of which is
defined upon some specific domain.  A \fBbase table\fR is a named instance
of some relation.  A base table is a real object like a file; a base table
appears in the catalog and consumes storage on disk.  The term "table" is
more general, and is normally used to refer to any object which can be
accessed like a base table.

A DBIO relation is defined by a set of records describing the attributes
of the relation.  The attribute lists of all relations are stored in the
\fBsysatt\fR table, described in the next section.

.nh 4
Attributes

    An \fBattribute\fR of a relation is a datum which describes some aspect
of the object described by the relation.  Each attribute is defined by a
record in the \fBsysatt\fR table, the fields of which are described below.
The attribute descriptor table, while visible to the user if they wish to
examine the structure of the database in detail, is primarily an internal
table used by DBIO to define the structure of a record.
.ls 4
.ls 12 name
The name of the attribute described by the current record.
Attribute names may contain any combination of the alphanumeric characters
or underscore and must not exceed 16 characters in length.
.le
.ls attid
The attribute identifier.  A unique integer code by which the attribute is
referred to internally.  The \fIattid\fR is unique within the relation to
which the attribute belongs, and defines the ordering of attributes within
the relation.
.le
.ls relid
The relation identifier of the table to which this attribute belongs.
.le
.ls domid
The domain identifier of the domain to which this attribute belongs.
.le
.ls dtype
A single character identifying the atomic datatype of this attribute.
Note that domain information is not used for most runtime record accesses.
.le
.ls prec
The precision of the atomic datatype of this attribute, i.e., the number
of bytes of storage per element.
.le
.ls count
The number of elements of type \fIdtype\fR in the attribute.  If this value
is one the attribute is a scalar.  Zero implies a variable length array
and N denotes a static array of N elements.
.le
.ls offset
The offset of the field in bytes from the start of the record.
.le
.ls width
The width of the field in bytes.  All fields occupy a fixed amount of space
in a record.  In the case of variable length arrays fields \fBoffset\fR and
\fBwidth\fR refer to the array descriptor.
.le
.le


In summary, the attribute list defines the physical structure of a record
as stored in mass storage.  DBIO is responsible for encoding and decoding
records as well as for all access to the fields of records.  A record is
encoded as a byte stream in a machine independent format.  The physical
representation of a record is discussed further in a later section describing
the DBIO storage structures.

.nh 4
Domains

    A domain is a restricted implementation of an abstract datatype.
Simple examples are the atomic datatypes char, integer, real, etc.; no doubt
these will be the most commonly used domains.  A more interesting example is
the \fItime\fR domain.  Times are stored in DBIO as attributes defined upon
the \fItime\fR domain.  The atomic datatype of a time attribute is a four byte
integer; the value is the long integer value returned by the IRAF system
procedure \fBclktime\fR.  Integer time values are convenient for time domain
arithmetic, but are not good for printed output.  The definition of the
\fItime\fR domain therefore includes a specification for the output format
which will cause time attributes to be printed as a formatted date/time string.

Domains are used to verify input and to format output, hence there is no
domain related overhead during record retrieval.  The only exception to
this rule occurs when returning the value of an uninitialized attribute,
in which case the default value must be fetched from the domain descriptor.

Domains may be defined either globally for the entire database or locally for
a specific table.  Attributes in any table may be defined upon a global domain.
The system table \fBsysddt\fR defines all global and local domains.
The attributes of this table are outlined below.
.ls 4
.ls 12 name
The name of the domain described by the current record.
Domain names may contain any combination of the alphanumeric characters
or underscore and must not exceed 16 characters in length.
.le
.ls domid
The domain identifier.  A unique integer code by which the domain is referred
to internally.  The \fIdomid\fR is unique within the table for which the domain
is defined.
.le
.ls relid
The relation identifier of the table to which this domain belongs.
This is set to zero if the domain is defined globally.
.le
.ls grpid
The group identifier of the group to which this domain belongs.
This is set to zero if the domain does not belong to a special group.
A negative value indicates that the named domain is itself a group
(groups are discussed in the next section).
.le
.ls dtype
A single character identifying the atomic datatype upon which the domain
is defined.
.le
.ls prec
The precision of the atomic datatype of this domain, i.e., the number
of bytes of storage per element.
.le
.ls defval
The default value for attributes defined upon this domain (a byte string of
length \fIprec\fR bytes).  If no default value is specified DBIO will assume
that null values are not permitted for attributes defined upon this domain.
.le
.ls minval
The minimum value permitted.  This attribute is used only for integer or real
valued domains.
.le
.ls maxval
The maximum value permitted.  This attribute is used only for integer or real
valued domains.
.le
.ls enumval
If the domain is string valued with a fixed number of permissible value strings,
the legal values may be enumerated in this string valued field.
.le
.ls units
The units label for attributes defined upon this domain.
.le
.ls format
The default output format for printed output.  All SPP formats are supported
(e.g., including HMS, HM, octal, etc.) plus some special DBMS formats, e.g.,
the time format.
.le
.ls width
The field width in characters for printed output.
.le
.le


Note that the \fIunits\fR and \fIformat\fR fields and the four "*val" fields
are stored as variable length character arrays, hence there is no fixed limit
on the sizes of these strings.  Use of a variable length field also minimizes
storage requirements and makes it easy to test for an uninitialized value.
Only fixed length string fields and scalar valued numeric fields may be used
in indexes and selection predicates, however.

A number of global domains are predefined by DBIO.  These are summarized
in the table below.


.ks
.nf
	NAME	DTYPE	PREC	DEFVAL

	byte	  u	 1	  0
	char	  c	arb	nullstr
	short	  i	 2	INDEFS
	int	  i	 4	INDEFI
	long	  i	 4	INDEFL
	real	  r	 4	INDEFR
	double    r	 8	INDEFD
	time      i	 4	  0
.fi
.ke


The predefined global domains, as well as all user defined domains, are defined
in terms of the four DBK variable precision atomic datatypes.  These are the
following:


.ks
.nf
	NAME	DTYPE	PREC	  DESCRIPTION

	char	  c	>=1	character
	uint	  u	1-4	unsigned integer
	int	  i	1-4	signed integer
	real	  r	2-8	floating point
.fi
.ke


DBIO stores records with the field values encoded in the machine independent
variable precision DBK data format.  The precision of an atomic datatype is
specified by an integer N, the number of bytes of storage to be reserved for
the value.  The permissible precisions for each DBK datatype are shown in
the preceding table.  The actual encoding used is designed to simplify the
semantics of the DBK and is not any standard format.  The DBK binary encoding
will be described in a later section.

.nh 4
Groups

    A \fBgroup\fR is a logical grouping of several related attributes.
A group is much like a relation except that a group is a type of domain
and may be used as such to define the attributes of relations.  Since groups
are similar to relations groups are defined in the \fBsysatt\fR table
(groups do not however appear in the system catalog).  Each member of a
group is an attribute defined upon some domain; nesting of groups is permitted.

Groups are expanded when a relation is defined, hence the runtime system
need not be aware of groups.  Expansion of a group produces a set of ordinary
attributes wherein each attribute name consists of the group name glued
to the member name with a period, e.g., the resolved attributes "cv.ncoeff"
and "cv.type" are the result of expansion of a two-member group attribute
named "cv".

The main purposes of the group construct are to simplify data definition and
to give the forms generator additional information for structuring formatted
output.  Groups provide a simple capability for structuring data within a table.
Whenever the same grouping of attributes occurs in several tables the group
mechanism should be used to ensure that all instances of the group are
defined equivalently.

.nh 4
Views

    A \fBview\fR is a virtual table defined in terms of one or more base
tables or other views via a record select/project expression.  Views provide
different ways of looking at the same data; the view mechanism can be very
useful when working with large, complex base tables (it saves typing).
Views allow the user to focus on just the data that interests them and ignore
the rest.  The view mechansism also significantly increases the amount of data
independence provided by DBIO, since a base table can be made to look
differently to different applications programs without physically modifying
the table or producing several copies of the same table.  This capability can
be invaluable when the tables involved are very large or cannot be modified
for some reason.

A view provides a "window" into one or more base tables.  The window is
dynamic in the sense that changes to the underlying base tables are immediately
visible through the window.  This is because a view does not contain any data
itself, but is rather a \fIdefinition\fR via record selection and projection
of a new table in terms of existing tables.  For example, consider the
following imaginary select/project expression (SPE):

	data1 [x >= 10 and x <= 20] % obj, x, y

This defines a new table with attributes \fIobj\fR, \fIx\fR, and \fIy\fR
consisting of all records of table \fIdata1\fR for which X is in the range
10 to 20.  We could use the SPE shown to copy the named fields of the
selected records to produce a new base table, e.g. \fId1x\fR.
The view mechanism allows us to define table \fId1x\fR as a view-table,
storing only the SPE shown.  When the view-table \fId1x\fR is subsequently
queried DBIO will \fImerge\fR the SPE supplied in the new query with that
stored in the view, returning only records which satisfy both selection
expressions.  This works because the output of an SPE is a table and can
therefore be used as input to another SPE, i.e., two or more selection
expressions can be combined to form a more complex expression.

A view appears to the user (or to a program) as a table, behaving equivalently
to a base table in most operations.  View-tables appear in the catalog and
can be created and deleted much like ordinary tables.

.nh 4
Null Values

    Null valued attributes are possible in any database system; they are
guaranteed to occur when the system permits new attributes to be dynamically
added to existing, nonempty base tables.  DBIO deals with null values by
the default value mechanism mentioned earlier in the discussion of domains.
When the value of an uninitialized attribute is referenced DBIO automatically
supplies the user specified default value of the attribute.  The defaulting
mechanism supports three cases; these are summarized below.
.ls 4
.ls o
If null values are not permitted for the referenced attribute DBIO will
return an error condition.  This case is indicated by the absence of a
default value.
.le
.ls o
Indefinite (or any special value) may be returned as the default value if
desired, allowing the calling program to test for a null value.
.le
.ls o
A valid default value may be returned, with no checking for null values
occurring in the calling program.
.le
.le


Testing for null values in predicates is possible only if the default value
is something recognizable like INDEF, and is handled by the conventional
equality operator.  Indefinites are propagated in expressions by the usual
rules, i.e., the result of any arithmetic expression containing an indefinite
is indefinite, order comparison where an operand is indefinite is illegal,
and equality or inequality comparison is legal and is well defined.

.nh 3
Data Definition Language

    The data definition language (DDL) is used to define the objects in a
database, e.g., during table creation.  The function of the DBIO table
creation procedure is to add tuples to the system tables to define a new
table and all attributes, groups, and domains used in the table.  The data
definition tuples can come from either of two sources: [1] they can be
copied in compiled form from an existing table, or [2] they can be
generated by compilation of a DDL source specification.

In appearance DDL looks much like a series of structure declarations such
as one finds in most modern compiled languages.  DDL text may be entered
either via a string buffer in the argument list (no file access required)
or via a text file named in the argument list to the table creation procedure.

The DDL syntax has not yet been defined.  An example of what a DDL declaration
for the IMIO \fImasks\fR relation might look like is shown below.  The syntax
shown is a generalization of the SPP+ syntax for a structure declaration with
a touch of the CL thrown in.  If a relation is defined only in terms of the
predefined domains or atomic datatypes and has no primary key, etc., then the
declaration would look very much like an SPP+ (or C) structure declaration.


.ks
.nf
	relation masks {
		u2	mask	{ width=6 }
		c64	image	{ defval="", format="%20.20s", width=21 }
		c15	type	{ defval="generic" }
		byte	naxis
		long	naxis1, naxis2, naxis3, naxis4
		long	npix
		i2	pixels[]
	} where {
		key	= mask+image+type
		comment	= "image region masks"
	}
.fi
.ke


The declaration shown identifies the primary key for the relation and gives
a comment describing the relation, then declares the attributes of the
relation.  In this example all domains are either local and are declared
implicitly, or they are global and are predefined.  For example, DBIO will
automatically create a domain named "type" belonging to the relation "masks"
for the attribute named "type".  DBIO is assumed to provide default values
for the attributes of each domain (e.g., "format", "width", etc.) not
specified explicitly in the declaration.  It should be possible to keep
the DDL syntax simple enough that a LALR parser does not have to be used,
reducing text memory requirements and the time required to process the DDL,
and improving error diagnostics.

.nh 3
Record Select/Project Expressions

    Most programs using DBIO will be relational operators, taking a table
as input, performing some operation or transformation upon the table, and
either updating the table or producing a new table as output.  DBIO record
select/project expressions (SPE) are used to define the input table.
By using an SPE one can define the input table to be any subset of the
fields (projection) of any subset of the records (selection) of any set of
base tables or views (set union).

The general form of a select/project expression is shown below.  The syntax
is patterned after the algebraic languages and even happens to be upward
compatible with the existing IMIO image template syntax.


.ks
.nf
	tables [pred] [upred] % fields

where

	tables		Is a comma delimited list of tables.

	,		Is the set union operator (in the tables and
			    fields lists).

	[		Is the selection operator.

	pred		Is a predicate, i.e., a boolean condition.
			    The simplest predicate is a constant or
			    list of constants, specifying a set of
			    possible values for the primary key.

	upred		Is a user predicate, passed back to the
			    calling program appended to the record
			    name but not used by DBIO.  This feature
			    is used to implement image sections.

	%		Is the projection operator.

	fields		Is a comma delimited list of \fIexpressions\fR
			    defined upon the attributes of the input
			    relation, defining the attributes of the
			    output relation.
.fi
.ke


All components of an SPE are optional except \fItables\fR; the simplest
SPE is the name of a single table.  Some simple examples follow.

.nh 4
Examples

    Print all fields of table \fInite1\fR.  The table \fInite1\fR is an image
table containing several images with primary keys 1, 2, 3, and so on.

	cl> ptable nite1

Print selected fields of table \fInite1\fR.

	cl> ptable nite1%image,title

Plot line 200 of image 2 in table \fInite1\fR.

	cl> graph nite1[2][*,200]

Print image statistics on the indicated images in table \fInite1\fR.
The example shows a predicate specifying images 1, 3, and 5 through 12,
not an image section.

	cl> imstat nite1[1,3,5:12]

Print the names and number of bad pixels in tables \fInite1\fR and \fIm87\fR
for all images that have any bad pixels.

	cl> ptable "nite1,m87 [nbadpix > 0] % image, nbadpix"


The tables in an SPE may be general select/project expressions, not just the
names of base tables or views as in the examples.  In other words, SPEs
may be nested, using parenthesis around the inner SPE if necessary to indicate
the order of evaluation.  As noted earlier in the discussion of views,
the ability of SPEs to nest is used to implement views.  Nesting may also
be used to perform selection or projection upon the individual input tables.
For example, the SPE used in the following command specifies the union of
selected records from tables \fInite1\fR and \fInite2\fR.

	cl> imstat nite1[1,8,21:23],nite2[9]

.nh 3
Operators