diff options
author | Joseph Hunkeler <jhunkeler@gmail.com> | 2015-07-08 20:46:52 -0400 |
---|---|---|
committer | Joseph Hunkeler <jhunkeler@gmail.com> | 2015-07-08 20:46:52 -0400 |
commit | fa080de7afc95aa1c19a6e6fc0e0708ced2eadc4 (patch) | |
tree | bdda434976bc09c864f2e4fa6f16ba1952b1e555 /sys/dbio/new/dbio.hlp | |
download | iraf-linux-fa080de7afc95aa1c19a6e6fc0e0708ced2eadc4.tar.gz |
Initial commit
Diffstat (limited to 'sys/dbio/new/dbio.hlp')
-rw-r--r-- | sys/dbio/new/dbio.hlp | 3202 |
1 files changed, 3202 insertions, 0 deletions
diff --git a/sys/dbio/new/dbio.hlp b/sys/dbio/new/dbio.hlp new file mode 100644 index 00000000..d5d9c77f --- /dev/null +++ b/sys/dbio/new/dbio.hlp @@ -0,0 +1,3202 @@ +.help dbss Sep85 "Design of the IRAF Database Subsystem" +.ce +\fBDesign of the IRAF Database Subsystem\fR +.ce +Doug Tody +.ce +September 1985 +.sp 2 + +.nh +Preface + + The primary purpose of this document is to define the interfaces comprising +the IRAF database i/o subsystem to the point where they can be built rapidly +and efficiently, with confidence that major changes will not be required after +implementation begins. The document also serves to inform all interested +parties of what is planned while there is still time to change the design. +A change which can easily be made to the design prior to implementation may +become prohibitively expensive as implementation proceeds. After implementation +is completed and the new subsystem has been in use for several months the basic +interfaces will be frozen and the opportunity for change will have passed. + +The description of the database subsystem presented in this document should +be considered to be no more than a close approximation to the system which +will actually be built. The specifications of the interface can be expected +to change in detail as the implementation proceeds. Any code which is written +according to the interface specifications presented in this document may have +to modified slightly before system testing with the final interfaces can +proceed. + +.nh 2 +Scope of this Document + + The scope of this document is the conceptual design and specification of +all IRAF packages and i/o interfaces directly involved with either user or +program access to binary data maintained in mass storage. Versions of some +of the interfaces described are already in use; when this is the case it will +be noted in the text. + +This document is neither a user's guide nor a reference manual. The reader +is assumed to be familiar with both database technology and with the IRAF +system. In particular, the reader should be familiar with the concept of the +IRAF VOS (virtual operating system), with the features of the IMIO (image i/o), +FIO (file i/o), and OS (host system interface) interfaces, as well as with the +architecture of the network interface. + +.nh 2 +Relationship to Previous Documents + + This document supercedes the document "IRAF Database I/O", November 1984. +Most of the concepts presented in that document are still valid but have been +expanded upon greatly in the present document. The scope of the original +document was limited to the DBIO interface alone, whereas the scope of the +present document has been expanded to encompass all subsystems or packages +directly involved with binary data access. This expansion in the scope of +the project was necessary to meet our primary goal of completing and freezing +the program interface, of which DBIO is only a small part. Furthermore, it +is difficult to have confidence in the design of a single subsystem without +working out the details of all closely related or dependent subsystems. + +In addition to expanding the scope of the database design project to cover +more interfaces, the requirements which the database subsystem must meet have +been expanded since the original conceptual design was done. In particular +it has become clear that data format conversions are prohibitively expensive +for our increasingly large datasets. Conversions such as those between FITS +and internal format (for an image), or between FITS table and internal format +(for a database) are too expensive to be performed routinely. Data which is +archived in a machine independent format should not have to be reformatted +to be accessed by the online system. The archival format may vary from site +to site and it should be possible to read the different formats without +reformatting the data. Large datasets should not have to be reformatted to +be moved between machines with incompatible binary data formats. + +A change such as this in the requirements for an interface can have a major +impact on the design of the final interface. It is essential that all such +requirements be identified and dealt with in the design before implementation +begins. + +.nh +Introduction + + In this section we introduce the database subsystem and summarize the +reasons why we need such a system. We then introduce the major components +of the database subsystem and briefly mention some related subsystems. + +.nh 2 +The Database Subsystem + + The database subsystem (DBSS) is conceived as a single comprehensive system +to be used to manage and access all binary (non textfile) data accessed by IRAF +programs. Simple applications are perhaps most easily and flexibly dealt with +using text files for the storage of data, descriptors, and control information. +As the amount of data to be processed grows or as the data structures to be +accessed grow in complexity, however, the text file approach becomes seriously +inefficient and cumbersome. Converting the text files to binary files makes +processing more efficient but does little to address the problems of complex +data structures. Efficient access to complex data structures requires complex +and expensive software. Developing such software specially for each and every +application is prohibitively expensive in a large system; hence the need for +a general purpose database system becomes clear. + +Use of a single central database system has significant additional advantages. +A standard user interface can be used to examine, edit, list, copy, etc., all +data maintained under the database system. Many technical problems may be +addressed in a general purpose system that would be too expensive to address +in a particular application, e.g., the problems of storing variable size data +elements, of dynamically and randomly updating a dataset, of byte packing to +conserve storage, of maintaining indexes so that a record may be found +efficiently in a large dataset, of providing data independence so that storage +formats may be changed without need to change the program accessing the data, +and of transport of binary datasets between incompatible machines. All of +these are examples of problems which are \fInot\fR adequately addressed by the +current IRAF i/o interfaces nor by the applications programs which use them. + +.nh 2 +Major Subsystems + + The major subsystems comprising the IRAF DBSS are depicted in Figure 1. +At the highest level are the CL (command language) packages, each of which +consists of a set of user callable tasks. The IMAGES package (consisting +of general image processing operators) is shown for completeness but since +there are many such packages in the system they are not considered part of +the DBSS and will not be discussed further here. +The DBMS (database management) package is the user interface to the DBSS, +and some day will possibly be the largest part of the DBSS in terms of number +of lines of code. + +In the center of the figure we see the VOS (virtual operating system) packages +IMIO, DBIO and FIO. FIO (file i/o) is the standard IRAF file interface and +will not be discussed further here. IMIO (image i/o) and DBIO (database i/o) +are the two major i/o interfaces in the DBSS and are the topic of much of the +rest of this document. IMIO and DBIO are the two parts of the DBSS of interest +to applications programmers; these interfaces are implemented as libraries of +subroutines to be called directly by the applications program. IMIO and FIO +are existing interfaces. + +At the bottom of the figure is the DB Kernel. The DB Kernel is the component +of the DBSS which physically accesses the data in mass storage (via FIO). +The DB Kernel is called only by DBIO and hence is invisible to both the user +and the applications programmer. There is a lot more to the DB Kernel than +is evident from the figure, and indeed the DB Kernel will be the subject of +another figure when we discuss the system architecture in section 4.2. + + +.ks +.nf + DBMS IMAGES(etc) (CL) + \ / + \ / --------- + \ / + \ IMIO + \ / \ + \ / \ + \/ \ (VOS) + DBIO FIO + | + | + | --------- + | + | + (DB Kernel) (VOS or Host System) + +.fi +.ce +Figure 1. Major Components of the Database Subsystem +.ke + + +With the exception of certain optional subsystems to be outlined later, +the entire DBSS is machine independent and portable. The IRAF system may +be ported to a new machine without any knowledge whatsoever of the +architecture or functioning of the DBSS. + +.nh 2 +Related Subsystems + + Several additional IRAF subsystems or packages are of interest from the +standpoint of the DBSS. These are the PLOT package, the graphics interface +GIO, and the LISTS package. + +The PLOT package is a CL level package consisting of general plotting +utilities. In general PLOT tasks can accept input in a number of standard +formats, e.g., \fBlist\fR (text file) format and \fBimagefile\fR format. +The DBSS will provide an additional standard format which should perhaps be +directly accessible by the PLOT tasks. Even if this is not done a very +general plotting capability will automatically be provided by "piping" the +list format output of a DBMS task to a PLOT task. Additional graphics +capabilities will be provided as built in functions in the DBMS +\fBquery language\fR, which will access GIO directly to make plots. +The query language graphics facilities will be faster and more convenient +to use but less extensive and less sophisticated than those provided by PLOT. + +The LISTS package is interesting because the facilities provided and operations +performed resemble those provided by the DBMS package in many respects. +The principle difference between the two packages is that the LISTS package +operates on arbitrary text files whereas the DBMS package operates only +upon DBIO format binary files. The textual output of \fIany\fR IRAF or +non-IRAF program may serve as input to a LISTS operator, as may any ordinary +text file, e.g., the source files for a program or package. A typical LISTS +database is a directory full of source files or documentation; LISTS can also +operate on tables of numbers but the former application is perhaps more +common. Using LISTS it is possible to conveniently and rapidly perform +operations (evaluate queries) which would be cumbersome or impossible to +perform with a conventional database system such as DBMS. On the other hand, +the LISTS operators would be hopelessly inefficient for the types of +applications for which DBMS is designed. + +.nh +Requirements + + Requirements define the problem to be solved by a software system. +There are two types of requirements, non-functional requirements, i.e., +restrictions or constraints, and functional requirements, i.e., the functions +which the system must perform. Since nearly all IRAF science software will +be heavily dependent on the DBSS, the requirements for this subsystem are as +strict as those for any subsystem in IRAF. + +.nh 2 +General Requirements + + The general requirements which the DBSS must satisfy primarily take the +form of constraints or restrictions. These requirements are common to +all mainline IRAF system software. Note that these requirements are \fInot\fR +automatically enforced for all system software. If a particular subsystem is +prototype or optional (not required for the normal functioning of IRAF) then +these requirements can be relaxed. In particular, certain parts of the DBSS +(e.g, the host database interface) are optional and are not subject +to the same constraints as the mainline software. The primary functional +requirements discussed in section 3.2, however, must be met by software which +satisfies all of the general requirements discussed here. + +.nh 3 +Portability + + All software in the DBMS, IMIO, and DBIO interfaces and in the DB kernel +must be fully portable under IRAF. To meet this requirement the software +must be written in the IRAF SPP language using only the facilities provided +by the IRAF VOS. In particular, this rules out complicated record locking +schemes in the DB kernel, as well as any type of centralized database server +which relies on process control, IPC, or signal handling facilities not +provided by the IRAF VOS. For most processes the requirement is even more +strict, i.e., ordinary IRAF processes are not permitted to rely upon the VOS +process control or IPC facilities for their normal functioning (the IPC +connection to the CL is an exception since it is not required to run an +IRAF process standalone). + +.nh 3 +Efficiency + + The database interface must be efficient, particularly when used for +image access and intermodule communication. There are as many ways to +measure the efficiency of an interface as there are applications for the +interface, and we cannot address them all here. The dimensions of the +efficiency matrix we are concerned with here are the cpu time consumed +during execution, the clock time consumed during execution, e.g, the number +of file opens and disk seeks or required, and the disk space consumed for +table storage. Where necessary efficient cpu utilization will be achieved +at the expense of memory requirements for code and buffers. + +A simple and well defined efficiency requirement is that the cpu and clock +time required to access the pixels of an image stored in the database from +a "cold start" (no open files) must not noticeably exceed that required +by the old IMIO interface. The efficiency of the new interface for the +case when many images are to be accessed is expected to be a major improvement +over that provided by the old IMIO interface, since the old interface +stores each image in two separate files, whereas the new interface will +be capable of storing the entire contents of many (small) images in a single +file. The amount of disk space required for image header storage is also +expected to decrease by a large factor when multiple images are stored +in a single physical file. + +.nh 3 +Code Size + + We have already established that a process must directly access the +database in mass storage to meet our portability and efficiency requirements. +This type of access requires that the necessary IMIO, DBIO and DB Kernel +routines be linked into each process requiring database access. Minimizing +the amount of text space used by the database code is desirable to minimize +disk and memory requirements and process spawn time, but is not critical +since memory is cheap and plentiful and is likely to become even cheaper +and more plentiful in the future. Furthermore, the multitask nature of +IRAF processes allows the text segment used by the database code to be shared +by many tasks, saving both disk and memory. + +The main problem remaining today with large text segments seems to be the +process spawn time; loading the text segment by demand paging in a virtual +memory environment can be quite slow. The fault here seems to lie more with +the operating system than with IRAF, and probably the solution will require +tuning either the IRAF system interface or the operating system itself. + +Taking all these factors into account it would seem that typical memory +requirements for the executable database code (not including data buffers) +in the range 50 to 100 Kb would be acceptable, with 50 Kb being a reasonable +goal. This would make the database interface the largest i/o interface in +IRAF but that seems inevitable considering the complexity of the problem to +be solved. + +.nh 3 +Use of Proprietary Software + + A mainline IRAF interface, i.e., any interface required for the normal +operation of the system, must belong to IRAF and must be distributed with +the IRAF system at no additional charge and with no licensing restrictions. +The source code must be part of the system and is subject to strict +configuration control by the IRAF group, i.e., the IRAF group is responsible +for the software and must control it. This rules out the use of a commercial +database system for any essential part of the DBSS, but does not rule out +IRAF access to a commercial database system provided such access is optional, +i.e., not required for the operation of the standard applications packages. +The host database interface provided by the DB kernel is an example of such +an interface. + +.nh 2 +Special Requirements + + In this section we present the functional requirements of the DBSS. +The major applications for which the DBSS in intended are described and +the desirable characteristics of the DBSS for each application are outlined. +The major applications thus far identified are catalog storage, image storage, +intermodule communication, and data archiving. + +.nh 3 +Catalog Storage + + The catalog storage application is probably the closest thing in IRAF to a +conventional database application. A catalog is a set of records, each of +which describes a single object. Each record consists of a set of fields +of various datatypes describing the attributes of the object. A record is +produced by numerical analysis of the object represented as a region of a +digital array. All records have the same structure, i.e., set of fields; +often the records are all the same size (but not necessarily). A large catalog +might contain several hundred thousand records. Examples of such catalogs are +the SAO star catalog, the IRAS point source catalog, and the catalogs produced +by analysis programs such as FOCAS (a faint object detection and classification +program) and RICHFLD (a digital stellar photometry program). Many similar +examples can be identified. + +Generation of such a catalog by an analysis program is typically a cpu bound +batch operation requiring many hours of computer time for a large catalog. +Once the catalog has been generated there are typically numerous questions of +scientific interest which can be answered using the data in the catalog. +It is highly desirable that this phase of the analysis be interactive and +spontaneous, as one question will often lead to another in an unpredictable +fashion. A general purpose analysis capability is required which will permit +the scientist to pose arbitrary queries of arbitrary complexity, to be answered +by the system in a few seconds (or minutes for large problems), with the answer +taking the form of a number or name, set or table of numbers or names, plot, +subcatalog, etc. + +Examples of such queries are given below. Clearly, the set of all possible +queries of this type is infinite, even assuming a limited number of operators +operating on a single catalog. The set of potentially interesting queries +is equally large. +.ls 4 +.ls [1] +Find all objects of type "pqr" for which X is in the range A to B and +Z is less than 10. +.le +.ls [2] +Compute the mean and standard deviation of attribute X for all objects +in the set [1]. +.le +.ls [3] +Compute and plot (X-Y) for all objects in set [1]. +.le +.ls [4] +Plot a circle of size (log2(Z-3.2) * 100) at the position (X,Y) of all objects +in set [1]. +.le +.ls [5] +Print the values of the attributes OBJ, X, Y, and Z of all objects for which +X is in the range A to B and Y is greater than 30. +.le +.le + + +In the past queries such as these have all too often been answered by writing +a program to answer each query, or worse, by wading though a listing of the +program output and manually computing the result or manually plotting points +on a graph. + +Given the preceding description of the catalog storage application, we can +make the following observations about the application of the DBSS to catalog +storage. +.ls +.ls o +A catalog is typically written once and then read many times. +.le +.ls o +Both public and private catalogs are common. +.le +.ls o +Catalog records are infrequently updated or are not updated at all once the +original entry has been made in the catalog. +.le +.ls o +Catalog records are rarely if ever deleted. +.le +.ls o +Catalogs can be very large, making efficient storage structures important +in order to minimize disk storage requirements. +.le +.ls o +Since catalogs can be very large, indexing facilities are required for +efficient record retrieval and for the efficient evaluation of queries. +.le +.ls o +A general purpose interactive query capability is required for the user to +effectively make use of the data in a catalog. +.le +.le + + +In DBSS terminology a user catalog will often be referred to as a \fBtable\fR +to avoid confusion with the use of the DBSS term \fBcatalog\fR which refers +to the system table which lists the contents of a database. + +.nh 3 +Image Storage + + A primary requirement for the DBSS, if not \fIthe\fR primary requirement, +is that the DBSS be suitable for the storage of bulk data or \fBimages\fR. +An image consists of two parts: an \fIimage header\fR describing the image, +and a multidimensional array of \fBpixels\fR. The pixel array is sometimes +small and sometimes very large indeed. For efficiency and other reasons the +actual pixel array is not required to be stored in the database. Even if the +pixels are stored directly in the database they are not expected to be used +in queries. + +We can make the following observations about the use of the DBSS for image +storage. The reader concerned about how all this might map into the storage +structures provided by a relational database should assume that the image +header is stored as a single large, variable size record (tuple), whereas +a group of images is stored as one or more tables (relations). If the images +are large assume the pixels are be stored outside the DBSS in a file, storing +only the name of the file in the header record. +.ls +.ls o +Images tend to be grouped into sets that have some logical meaning to the user, +e.g., "nite1", "nite2", "raw", "reduced", etc. Each group typically contains +dozens or hundreds of images (enough to require use of an index for efficient +retrieval). +.le +.ls o +Within a group the individual images are often referred to by a unique ordinal +number which is automatically assigned by some program (e.g., "nite1.10", +"nite1.11", etc). +.le +.ls o +Image databases tend to be private databases, created and accessed by a +single user. +.le +.ls o +The size of the pixel segment of an image varies enormously, e.g., from +1 kilobyte to 8 megabytes, even 40 megabytes in some cases. +.le +.ls o +Small pixel segments are most efficiently stored directly in the image header +to minimize the number of file opens and disk seeks required to access the +pixels once the header has been accessed (as well as to minimize file clutter). +.le +.ls o +Large pixel segments are most efficiently stored separately from the image +headers to increase clustering and speed sequential searches of a group of +headers. +.le +.ls o +It is occasionally desirable to store either the image header or the pixel +segment on a special, non file-structured device. +.le +.ls o +The image header logically consists of a closed set of standard attributes +common to all images, plus an open set of attributes peculiar to the data +or to the type of analysis being performed on the data. +.le +.ls o +The operations performed on images are often functions which produce a +modified version of the input image(s) as a new output image. It is desirable +for most header information to be automatically preserved in such a mapping. +For this to happen automatically without the DBSS requiring knowledge of +the contents of a header, it is necessary that the header be a single object +to the DBSS, i.e., a single record in some table, rather than a set of +related records in several tables. +.le +.ls o +Since the image header needs to be maintained as a single record and since +the header may contain an unpredictable number of application or data specific +attributes, image headers can be quite large. +.le +.ls o +Not all image header attributes are simple scalar values or even fixed size +arrays. Variable size attributes, i.e., arrays, are common in image headers. +Examples of such attributes are the bad pixel list, history text, and world +coordinate system (more on this in a later section). +.le +.ls o +Image header attributes often form logical groupings, e.g., several logically +related attributes may be required to define the bad pixel list or the world +coordinate system. +.le +.ls o +The image header structure is often dynamically updated and may change in +size when updated. +.le +.ls o +It is often necessary to add new attributes to an existing image header. +.le +.ls o +Images are often selectively deleted. Any subordinate files logically +associated with the image should be automatically deleted when the image +header is deleted. If this is not possible under the DBSS then the DBSS +should forbid deletion of the image header unless special action is taken +to remove delete protection. +.le +.ls o +For historical or other reasons, a given site will often maintain images +in several different and completely incompatible formats. It is desirable +for the DBSS to be capable of directly accessing images maintained in a foreign +format without a format conversion, even if only limited (e.g., read only) +access is possible. +.le +.le + + +In summary, images are characterized by a header with a highly variable set +of fields, some of which may vary in size during the lifetime of the image. +New fields may be added to the image header at any time. Array valued fields +are common and fields tend to form logical groupings. The image header is +best maintained as a single structure under the DBSS. Image headers can be +quite large. The pixel segment of an image can be extremely large and may +be best maintained outside the DBSS. Since many existing image archives exist, +each with its own unique format, it is desirable for the DBSS to be capable +of accessing multiple storage formats. + +Storage of the pixel segment or any other portion of an image in a separate +file outside the DBSS causes problems which must be dealt with at some level +in the system, if not by the DBSS. In particular, problems occur if the user +tries to backup, restore, copy, rename, or delete any portion of an image using +a host system utility. These problems are minimized if all logically related +data is kept in a single data directory, allowing the database as a whole to +be moved or backed up with host system utilities. All pathnames should be +defined relative to the data directory to permit relocation of the database +to a different directory. Ideally all binary datafiles in the database should +be maintained in a machine independent format to permit movement of the +database between different machines without reformatting the entire database. + +.nh 3 +Intermodule Communication + + A large applications package consists of many separate tasks or programs. +These tasks are best defined and understood in terms of their operation on a +central package database. For example, one task might fit some function to +an image, leaving a record describing the fit in the database. A second task +might take this record as input and use it to control a transformation on +the original image. Additional operators implementing a range of algorithms +or optimized for a discrete set of cases are easily added, each relying upon +the central database for intermodule communication. + +This application of the DBSS is a fairly conventional database application +except that array valued attributes and logical groupings of attributes are +common. For example, assume that a polynomial has been fitted to a data +vector and we wish to record the fit in the database. A typical set of +attributes describing a polynomial fit are shown below. + + +.ks +.nf + image_name char*30 # name of source image + nfeatures int # number of features fitted + features.x real*4[*] # x values of the features + features.y real*4[*] # y values of the features + curve.type char*10 # curve type + curve.ncoeff int # number of coefficients + curve.coeff real*4[*] # coefficients +.fi +.ke + + +The data structure shown records the positions (X) and world coordinates (Y) +of the data features to which the curve was fitted, plus the coefficients of +the fitted curve itself. There is no way of predicting the number of features +hence the X and Y arrays are variable length. Since the fitted curve might +be a spline or some other piecewise function rather than a simple polynomial, +there is likewise no reasonable way to place an upper limit on the amount of +storage required to store the fitted curve. This type of record is common in +scientific applications. + +We can now make the following observations regarding the use of the DBSS for +intermodule communication. +.ls +.ls o +The number of fields in a record tends to be small, but array valued fields +of variable size are common hence the physical size of a record may be large. +.le +.ls o +A large table might contain several hundred records in typical applications, +requiring the use of an index for efficient retrieval. +.le +.ls o +Record access is usually random rather than sequential. +.le +.ls o +Random record updates will be rare in some applications, but common in others. +.le +.ls o +Records will often change in size when updated. +.le +.ls o +Selective record deletion is rare, occurring mostly during cleanup following +an error. +.le +.ls o +New fields are rarely, if ever, added to existing records. The record structure +is usually determined by the programmer rather than by the user and tends to +be well defined. +.le +.ls o +This type of database is typically a private database created and used by a +single user to process a specific dataset with a specific applications package. +.le +.le + + +Application specific information may sometimes be stored directly in the header +of the image being analyzed, but more often will be stored in one or more +separate tables, recording the name of the image analyzed in the new record +as a backpointer, as in the example. Hence a typical scientific database +might consist of several tables containing the input images, several tables +containing intermodule records of various types, and one or more tables +containing either reduced images or catalog records, depending on whether a +reduction or analysis operation was performed. + +.nh 3 +Data Archiving + + Data archiving refers to the long term storage of raw or reduced data. +Data archiving is important for the following reasons. +.ls +.ls o +Archiving is currently necessary just to \fItransport\fR data from the +telescope to the site where reduction and analysis takes place. +.le +.ls o +Permanently archiving the raw (or pre-reduced) data is necessary in case +an error in the reduction process is later discovered, making it necessary +for the observer to repeat the reductions. +.le +.ls o +Archiving of the reduced data is desirable to save computer and human time +in case the analysis phase has to be repeated, or in case additional analysis +is later discovered to be necessary. +.le +.ls o +Archived data could conceivably be of considerable value to future researchers +who, given access to such data, might not have to make observations of their +own, or who might be able to use the archived data to augment or plan their +own observations. +.le +.ls o +Archived data could be invaluable for future projects studying the variability +of an object or objects over a period of years. +.le +.le + + +Ideally data should be archived as it is taken at the telescope, possibly +performing some simple pipeline reductions before archiving takes place. +Subsequent reduction and analysis using the archived data should be possible +without the format conversion (e.g., FITS to IRAF) currently required. +This conversion wastes cpu time and disk space as well as user time. +The problem is already serious and is expected to grow by an order of +magnitude in the next several years as digital detectors grow in size and +are used more frequently. + +Archival data consists of the digital data itself (the pixels) plus information +describing the object, the observer, how the data was taken, when and where +the data was taken, and so on. This is just the type of information assumed +to be present in an IRAF image. In addition one would expect the archive to +contain one or more \fBmaster catalogs\fR containing exhaustive information +describing the observations but no data. + +Since a permanent digital data archive can be expected to be around for many +years and to be read on many types of machines, data images should be archived +in a machine independent format; this format would almost certainly be FITS. +It is also desirable, though not essential, that the master catalogs be +readable on a variety of machines and hence be maintained and distributed in +a machine independent format. The ideal storage medium for archiving and +transporting large amounts of digital data appears to be the optical disk. + +Archival data and catalog access via the DBSS differs from conventional image +and catalog access only in the storage format, which is assumed to be machine +independent, and in the storage medium, which is assumed to be an archival +medium such as the optical disk. Direct access to a database on optical +disk requires that the DBSS be able to read the machine independent format +directly. + +To achieve acceptable performance for direct access it is necessary that +the storage medium be randomly accessible (unlike, say, a magnetic or optical +tape) and that the hardware seek time and transfer rate be comparable to those +provided by magnetic disk technology. Note that current optical disk readers +often do not have fast seek times, and that those that do have fast seek times +generally have a lower storage density than sequential devices due to the gaps +between sectors. Even if a device is not fast enough to be used directly it +is still possible to eliminate the expensive format conversion and do only a +disk to disk copy, accessing the machine independent format on magnetic disk. + +There is no requirement that the IRAF DBSS be used to support data archiving, +but the DBSS \fIis\fR required to be able to access the data in an archive. +Accessing the master catalogs as well seems reasonable since such a catalog +is no different than those described in sections 3.2.1. and 3.2.3; IRAF will +have the capability to maintain, access, and query such a catalog without +developing any additional software. + +The main obstacle likely to limit the success of data archiving may well be +the difficulty involved in gaining access to the archive. If the master +catalogs were maintained on magnetic disk but released periodically in +optical disk format for astronomers to refer to at their home institutions, +access would be much easier (and probably more frequent) than if all the +astronomers in the country were required to access a single distant computer +via modem. Telephone access by sites not on the continent would probably +be too expensive or problematic to be feasible. + +.nh 2 +Other Requirements + + In earlier sections we have discussed the principle constraints and +primary requirements for the DBSS. Several other requirements or +non-requirements deserve mention. + +.nh 3 +Concurrency + + All of the applications identified thus far require either read-only access +to a public database or read-write access to a private database. +The DBSS is therefore not required to support simultaneous updating by many +users of a single centralized database, with all the overhead and complication +associated with record locking, deadlock avoidance and detection, and so on. +The only exception occurs when a single user has several concurrent processes +requiring simultaneous update access to the user's private database. It appears +that this case can be addressed adequately by distributing the database in +several datasets and using host system file locking to lock the datasets, +a technique discussed further in a later section. + +.nh 3 +Recovery + + If a database update is aborted for some reason a dataset can be corrupted, +possibly preventing further access to the dataset. The DBSS should of course +protect datasets from corruption in normal circumstances, but it is always +possible for a hardware or software error (e.g., disk overflow or reboot) to +cause a dataset to be corrupted. Some mechanism is required for recovering a +database that has been corrupted. The minimum requirement is that the DBSS, +when asked to access a corrupted dataset, detect that the dataset has been +corrupted and abort, after which the user runs a recovery task to rebuild the +dataset minus the corrupted records. + +.nh 3 +Data Independence + + Data independence is a fundamental property inherent in virtually all +database systems. One of the major reasons one uses a database system is to +provide data independence. Data independence is so fundamental that we will +not discuss it further here. Suffice it so say that the DBSS must provide +a high degree of data independence, allowing applications programs to function +without detailed knowledge of the structure or contents of the database they +are accessing, and allowing databases to change significantly without +affecting the programs which access them. + +.nh 3 +Host Database Interface + + The host database interface (HDBI) makes it possible for the DBSS to +interface to a host database system. The ability to interface to a host +database system is not a primary requirement for the DBSS but is a highly +desirable one for many of the same reasons that direct access to archival data +is important. The problems of accessing a HDB and of accessing an archive +maintained in non-DBSS format are similar and might perhaps be addressed +by a single interface. + +.nh +Conceptual Design + + In this section we develop the design of the various subsystems comprising +the DBSS at the conceptual level, without bothering with the details of specific +language bindings or with the details of implementation. We start by defining +some important terms and then describe the system architecture. Lastly we +describe each of the major subsystems in turn, starting at the highest level +and working down. + +.nh 2 +Terminology + + The DBSS is an implementation of a \fBrelational database\fR. A relational +database views data as a collection of \fBtables\fR. Each table has a fixed +set of named columns and may contain any number of rows of data. The rows +of a table are often referred to as \fBrecords\fR. A record consists of a set +of named \fBfields\fR. The fields of a record are the columns of the table +containing the record. + +We shall use this informal terminology when discussing the contents of a +physical database. When discussing the \fIstructure\fR of a database we shall +use the formal relational terms relation, tuple, attribute, and so on. +The correspondence between the formal relational terms and their informal +equivalents is given in the table below. + + +.ks +.nf + \fBformal relational term\fR \fBinformal equivalents\fR + + relation table + tuple record, row + attribute field, column + primary key unique identifier + domain pool of legal values +.fi +.ke + + +A \fBrelation\fR is a set of like tuples. A \fBtuple\fR is a set of +\fBattributes\fR, each of which is defined upon a specific domain. +A \fBdomain\fR is an abstract type which defines the legal values an +attribute may take on (e.g., "posint" or "color"). The tuples of a relation +must be unique within the containing relation. The \fBprimary key\fR is +a subset of the attributes of a relation which is sufficient to uniquely +identify any tuple in the relation (often a single attribute serves as +the primary key). + +The relational data model was chosen for the DBSS because it is the simplest +conceptual data model which meets our requirements. Other possibilites +considered were the \fBhierarchical\fR model, in which data is organized in +a tree structure, and the \fBnetwork\fR model, in which data is organized in +a potentially recursive graph structure. Virtually all new database systems +implemented since the mid-seventies have been based on the relational model +and most database research today is in support of the relational model (the +remainder goes to the new fifth-generation technology, not to the old data +models). + +The term "relational" in "relational database" comes from the \fBrelational +algebra\fR, a branch of mathematics based on set theory which defines a +fundamental and mathematically complete set of operations upon relations +(tables). The relational algebra is fundamental to the DBMS query language +(section 4.3) but can be safely ignored in the rest of the DBSS. The reader +is referred to any introductory database text for a discussion of the relational +algebra and other database technotrivia. The classic introductory database +text is \fI"An Introduction to Database Systems"\fR, Volume 1 (Fourth Edition, +1986) by C. J. Date. + +.nh 2 +System Architecture + + The system architecture of the DBSS is depicted in Figure 2. The parts +of the figure above the "DBKI" have already been discussed in section 2.2. +The remainder of the figure is what has been referred to previously as the +DB kernel. + +The primary function of DBIO is record access (retrieval, update, insertion, +and deletion) based on evaluation of a \fBselect\fR statement input as a string. +DBIO can also process symbolic definitions of relations and other database +objects so that new tables may be created. DBIO does not implement any +relational operators more complex than select; the more complex relational +operations are left to the DBMS query language to minimize the size and +complexity of DBIO. + +The basic concept underlying the design of the lower level portions of the DBSS +is that the DB kernel provides the \fBaccess method\fR for efficiently accessing +records in mass storage, while DBIO takes care of all higher level functions. +In particular, DBIO implements all functions required to access the contents +of a record, while the DB kernel is responsible for storage allocation and for +the maintenance and use of indexes, but has no knowledge of the actual contents +of a record (the HDBI is an exception to this rule as we shall see later). + +The database kernel interface (DBKI) provides a layer of indirection between +DBIO and the underlying database kernel (DBK). The DBKI can support a number +of different kernels, much the way FIO can support a number of different device +drivers. The DBKI also provides network access to a remote database, using +the existing IRAF kernel interface (KI) to communicate with a DBKI on the +remote node. Two standard database kernels are provided. + +The primary DBK (at the right in the figure) maintains and accesses DBSS +binary datasets; this is the most efficient kernel and probably the only +kernel which will fully implement the semantic actions of the DBKI. +The second DBK (at the left in the figure) supports the host database +interface (HDBI) and is used to access archival data, any foreign image +formats, and the host database system (HDB), if any. Specialized HDBI +drivers are required to access foreign image formats or to interface to +an HDB. + + +.ks +.nf + DBMS IMAGES(etc) (CL) + \ / + \ / --------- + \ / + \ IMIO + \ / \ + \ / \ + \/ \ + DBIO FIO (VOS) + | + | + | + DBKI + | + +------+------+-------+ + | | | + DBK DBK (KI) + | | | + | | | + HDBI | | + | | | + +----+----+ | | --------- + | | | | + | | | | + [archive] [HDB] [dataset] | + | + | (host system) + - + (LAN) + - + | + | --------- + | + (Kernel-Server) + | + | + DBKI (VOS) + | + +---+---+ + | | + DBK DBK + + +.fi +.ce +Figure 2. \fBDatabase Subsystem Architecture\fR +.ke + + +.nh 2 +The DBMS Package +.nh 3 +Overview + + The user interfaces with a database in either of two ways. The first way +is via the tasks in an applications package, which perform highly specialized +operations upon objects stored in the database, e.g., to reduce a certain kind +of data. The second way is via the database management package (DBMS), which +gives the user direct access to any dataset (but not to large pixel arrays +stored outside the DBSS). The DBMS provides an assortment of general purpose +operators which may be used regardless of the type of data stored in the +database and regardless of the applications program which originally created +the structures stored in the database. + +The DBMS package consists of an assortment of simple procedural operators +(conventional CL callable parameter driven tasks), a screen editor for tables, +and the query language, a large program which talks directly to the terminal +and which has its own special syntax. Lastly there is a subpackage containing +tasks useful only for datasets maintained by the primary DBK, i.e., a package +of relatively low level tasks for things like crash recovery and examining +the contents of physical datasets. + +.nh 3 +Procedural Interface + + The DBMS procedural interface provides a number of the most commonly +performed database operations in the form of CL callable tasks, allowing +these simple operations to be performed without the overhead involved in +entering the query language. Extensive database manipulations are best +performed from within the query language, but if the primary concern of +the user is data reduction in some package other than DBMS the procedural +operators will be more convenient and less obtrusive. + +.nh 4 +General Operators + + DBMS tasks are required to implement the following general database +management operations. Detailed specifications for the actual tasks are +given later. +.ls +.ls \fBchdb\fR newdb +Change the default database. To minimize typing the DBSS provides a +"default database" paradigm analogous to the default directory of FIO. +Note that there need be no obvious connection between database objects +and files since multiple tables may be stored in a single physical file, +and the physical database may reside on an optical disk or worse may be +an HDB. Therefore the FIO "directory" cannot be used to examine the +contents of a database. The default database may be set independently +of the current directory. +.le +.ls \fBpcatalog\fR [database] +Print the catalog of the named database. The catalog is a system table +containing one entry for every table in the database; it is analogous +to a FIO directory. Since the catalog is a table it can be examined like +any other table, but a special task is provided since the print catalog +operation is so common. If no argument is given the catalog of the default +database is printed. +.le +.ls \fBptable\fR spe +Print the contents of the specified relation in list form on the standard +output. The operand \fIspe\fR is a general select expression defining +a new table as a projection of some subset of the records in a set of one or +more named tables. The simplest select expression is the name of a single +table, in which case all fields of all records in the table will be printed. +More generally, one might print all fields of a single table, selected fields +of a single table (projection), all fields of selected records of a single +table (selection), or selected fields of selected records from one or more +tables (selection plus projection). +.le +.ls \fBrcopy\fR spe output_table +Copy (insert) the records specified by the general select expression +\fIspe\fR into the named \fIoutput_table\fR. If the named output table +does not exist a new one will be created. If the attributes of the output +table are different than those of the input table the proper action of +this operator is not obvious and has not yet been defined. +.le +.ls \fBrmove\fR spe output_table +Move (insert) the relation specified by the general select expression +\fIspe\fR into the named \fIoutput_table\fR. If the named output table +does not exist a new one will be created. The original records are deleted. +This operator is used to generate the union of two or more tables. +.le +.ls \fBrdelete\fR spe +Delete the records specified by the general select expression \fIspe\fR. +Note that this operator deletes records from tables, not the tables themselves. +.le +.ls \fBmkdb\fR newdb [ddl_file] +Create a new, empty database \fInewdb\fR. If a data definition file +\fIddl_file\fR is named it will be scanned and any domain, relation, etc. +definitions therein entered into the new database. +.le +.ls \fBmktable\fR table relation +Create a new, empty table \fItable\fR of type \fIrelation\fR. The parameter +\fIrelation\fR may be the name of a DDL file, the name of an existing base +table, or any general record select/project expression. +.le +.ls \fBmkview\fR table relation +Create a new virtual table (view) defined in terms of one or more existing +base tables by the operand \fIrelation\fR, which is the same as for the +task \fImktable\fR. Operationally, \fBmkview\fR is much like \fBrcopy\fR, +except that it is considerably faster and the new table does not physically +store any data. The new view-table behaves like any other table in most +operations (except some types of updates). Note that the new table may +reference tuples in several different base tables. A view-table may +subsequently be converted into a base table with \fBrcopy\fR. Views are +discussed in more detail in section 4.5. +.le +.ls \fBmkindex\fR table fields +Make a new index on the named base table over the listed fields. +.le +.ls \fBrmtable\fR table +Drop (delete, remove) the named base table (or view) and any indexes defined +on the table. +.le +.ls \fBrmindex\fR table fields +Drop (delete, remove) the index defined over the listed fields on the named +base table. +.le +.ls \fBrmdb\fR [database] +Destroy the named database. Unless explicitly overridden \fBrmdb\fR will +refuse to delete a database until all tables therein have been dropped. +.le +.le + + +Several terms were introduced in the discussion above which have not yet been +defined. A \fBbase table\fR is a physical table (instance of a defined +relation), unlike a \fBview\fR which is a virtual table defined via selection +and projection over one or more base tables or other views. Both types of +objects behave equivalently in most operations. +A \fBdata definition language\fR (DDL) is a language syntax used to define +database objects. + +.nh 4 +Forms Based Data Entry and Retrieval + + Many of the records typically stored in a database are too large to be +printed in list format on a single line. Some form of multiline output is +necessary; this multiline representation is called a \fBform\fR. The full +terminal screen is used to display a form, e.g. with the fields labeled +in reverse video and the field values in normal video. Records are viewed +one at a time. + +Data entry via a form is an interactive process similar to editing a file with +a screen editor. The form is displayed, possibly with default values for the +fields, and the user types in new values for the fields. Editor commands are +provided for positioning the cursor to the field to be edited and for editing +within a field. The DBSS verifies each value as it is entered using the range +information supplied with the domain definition for that field. +Additional checks may be made before the new record is inserted into the +output table, e.g., the DBSS may verify that values have been entered for +all fields which do not permit null values. +.ls +.ls \fBetable\fR spe +Call up the forms editor to edit a set of records. The operand \fIspe\fR +may be any general select expression. +.le +.ls \fBpform\fR spe +Print a set of records on the standard output, using the forms generator to +generate a nice self documenting format. +.le +.le + + +The \fBforms editor\fR (etable) may be used to display or edit existing records +as well as to enter new ones. It is desirable for the forms editor to be able +to move backward as well as forward in a table, as well as to move randomly +to a record satisfying a predicate, i.e., search through the table for a +record. This makes the forms editor a powerful tool for browsing through a +database. If the predicate for a search is specified by entering values or +boolean expressions into the fields contributing to the predicate then we have +a query-by-form utility, which has been reported in the literature to be very +popular with users (since one does not have to remember a syntax and typing +is minimized). + +A variation on the forms editor is \fBpform\fR, used to output records in +"forms" format. This will be most useful for large records or for cases where +one is more interested in studying individual records than in comparing +different records. The alternative to forms output is list or tabular format +output. This form of output is more concise and can be used as input to the +\fBlists\fR operators, but may be harder to read and may overflow the output +line. List format output is discussed further in the next section. + +By default the format of a form is determined automatically by a +\fBforms generator\fR using information given in the DDL when the database +was created. The domain definition capability of the DDL includes provisions +for specifying the default output format for a field as well as the field label. +In most cases this will be sufficient information for the forms generator to +generate an esthetically acceptable form. If desired the user or programmer can +modify this form or create a new form from scratch, and the forms generator +will use the customized form rather than create one of its own. + +The CL \fBeparam\fR parameter file editor is an example of a simple forms +editor. The main differences between \fBeparam\fR and \fBetable\fR are the +forms generator and the browsing capability. + +.nh 4 +List Interface + + The \fBlist\fR is one of the standard IRAF data structures. A list is +an ascii table wherein the standard record delimiter is the newline and the +standard field delimiter is whitespace. Comment lines and blank lines are +ignored within lists; double comment lines ("## ...") may optionally be used +to label the columns of a list. By default, non-DBMS lists are free format; +strings must be quoted if they contain one of the field delimiter characters. +The field and record delimiter characters may be changed if necessary, e.g., +to permit multiline records. Fixed format lists are available as an option +and are often required to interface to external (non-IRAF) programs. + +The primary advantages of the list or tabular format for printed tables are +the following. +.ls +.ls [1] +The list or tabular format is the most concise form of printed output. +The eye can rapidly scan up and down a column to compare the values of +the same field in a set of records. +.le +.ls [2] +DBMS list output may be used as input to the tasks in the \fBlists\fR, +\fBplot\fR, and other packages. Using the pipe syntax, tasks which +communicate via lists may be connected together to perform arbitrarily +complex operations. +.le +.ls [3] +List format output is the defacto standard format for the interchange of +tabular data (e.g., DBSS tables) amongst different computers and programs. +A list (usually the fixed format variety) may be written onto a cardimage +tape for export, and conversely, a list read from a cardimage tape may be +used to enter a table into a DBSS database. +.le +.le + + +The most common use for list format output will probably be to print tables. +When a table is too wide to fit on a line the user will learn to use +\fBprojection\fR to print only the fields of interest. The default format +for DBMS lists will be fixed format, using the format information provided +in the DDL specification to set the default output format. Fixed format +is best for DBMS lists since it forces the field values to line up in nice +orderly columns, which are easier for a human to read (fixed format is easier +and more efficient for a computer to read as well, if not to write). +The type of format used will be recorded in the list header and a +\fBlist interface\fR will be provided so that all list processing programs +can access lists equivalently regardless of their format. + +As mentioned above, the list interface can be used to import and export tables. +In particular, an astronomical catalog distributed on card image tape can be +read directly into a DBSS table once a format descriptor has been prepared +and the DDL for the new table has been written and used to create an empty +table ready to receive the data. After only a few minutes of setup a user can +have a catalog entered into the database and be getting final results using +the query language interface! +.ls +.ls \fBrtable\fR listfile output_table +The list \fIlistfile\fR is scanned, inserting successive records from the +list into the named output table. A new output table is created if one does +not already exist. The format of the list is taken from the list header +if there is one, otherwise the format specification is provided by the user +in a separate file. +.le +.ls \fBptable\fR spe +Print the contents of the relation \fIspe\fR in list form on the standard +output. The operand \fIspe\fR may be any general select/project expression. +.le +.le + + +The \fBptable\fR operator (introduced in section 4.3.2.1) is used to generate +list output. The inverse operation is provided by \fBrtable\fR. + +.nh 4 +FITS Table Interface + + The FITS table format is a standard format for the transport of tabular +data. The idea is very similar to the cardimage format discussed in the last +section except that the FITS table standard includes a table header used to +define the format of the encoded table, hence the user does not have to +prepare a format descriptor to read a FITS table. The FITS reader and writer +programs are part of the \fBdataio\fR package. + +.nh 4 +Graphics Interface + + All of the \fBplot\fR package graphics facilities are available for plotting +DBMS data via the \fBlist\fR interface discussed in section 4.3.2.3. List +format output may also be used to generate output to drive external (non-IRAF) +graphics packages. Plotting facilities are also available via a direct +interface within the query language; this latter interface is the most efficient +and will be the most suitable for most graphics applications. See section +2.3 for additional comments on the graphics interface. + +.nh 3 +Command Language Interface + + All of the DBMS tasks are CL callable and hence part of the command language +interface to the DBSS. For example, a CL script task may implement arbitrary +relational operators using \fBptable\fR to copy a table into a list, \fBfscan\fR +and \fBprint\fR to read the list and format the modified list, and finally +\fBrtable\fR to insert the output list into a table. The query language may +also be called from within a CL script to process commands passed on the +command line, via the standard input, or via a temporary file. + +Additional operators are required for randomly accessing records without the +use of a list; suitable operators are shown below. +.ls +.ls \fBdbgets\fR record fields +The named fields of the indicated record are returned as a free format string +suitable for decoding into individual fields with \fBfscan\fR. +.le +.ls \fBdbputs\fR record fields values +The named fields of the indicated record are set to the values given in the +free format value string. +.le +.le + + +More sophisticated table and record access facilities are conceivable but +cannot profitably be implemented until an enhanced CL becomes available. + +.nh 3 +Record Selection Syntax + + As we have seen, many of the DBMS operators employ a general record +selection syntax to specify the set of records to be operated upon. +The selection syntax will include a list of tables and optionally a +predicate (boolean expression) to be evaluated for each record in the +listed tables to determine if the record is to be included in the final +selection set. In the simplest case a single table is named with no +predicate in which case the selection set consists of all records in the +named table. Parsing and evaluation of the record selection expression +is performed entirely by the DBIO interface hence we defer detailed +discussion of selection syntax to the sections describing DBIO. + +.nh 3 +Query Language + + In most database systems the \fBquery language\fR is the primary user +interface, both for the end-user interactively entering adhoc queries, and for +the programmer entering queries via the host language interface. The major +reasons for this are outlined below. +.ls +.ls [1] +A query language interface is much more powerful than a "task" or subroutine +based interface such as that described in section 4.3.2. A query language +can evaluate queries much more complex than the simple "select" operation +implemented by DBIO and made available to the user in tasks such as +\fBptable\fR and \fBrcopy\fR. +.le +.ls [2] +A query language is much more efficient than a task interface for repeated +queries. Information about a database may be cached between queries and +files may remain open between queries. Complex queries may be executed as +a series of simpler queries, cacheing the intermediate results in memory. +Graphs may be generated directly from the data without encoding, writing, +reading, decoding, and deleting an intermediate list. +.le +.ls [3] +A query language can perform many functions via a single interface, reducing +the amount of code to be written and supported, as well as simplifying the +user interface. For example, a query language can be used to globally +update (edit) tables, as well as to evaluate queries on the database. +Lacking a query language, such an editing operation would have to be +implemented with a separate task which would no doubt have its own special +syntax for the user to remember (e.g, the \fBhedit\fR task in the \fBimages\fR +package). +.le +.le + + +Unlike most commercial database systems, the DBSS is not built around the +query language. The heart of the IRAF DBSS is the DBIO interface, which is +little more than a glorified record access interface. The query language +is a high level applications task built upon DBIO, GIO, and the other interfaces +constituting the IRAF VOS. This permits us to delay implementation of the +query language until after the DBSS is in use and our primary requirements have +been met, and then implement the query language as an experimental prototype. +Like all data analysis software, the query language is not required to meet +our primary requirements (data acquisition and reduction), rather it is needed +to do interesting things with our data once it has been reduced. + +.nh 4 +Query Language Functions + + The query language is a prominent part of the user interface and is +often used interactively directly by the user, but may also be called +noninteractively from within CL scripts and by SPP programs. The major +functions performed by the query language are as follows. +.ls +.ls [1] +The database management operations, i.e., create/destroy database, +create/drop table or index, sort table, alter table (add new attribute), +and so on. +.le +.ls [2] +The relational operations, i.e., select, project, join, and divide +(the latter is rarely implemented). These are the operations most used +to evaluate queries on the database. +.le +.ls [3] +The traditional set operations, i.e., union, intersection, difference, +and cartesian product. +.le +.ls [4] +The editing operations, i.e, selective record update and delete. +.le +.ls [5] +Operations on the columns of tables. Compute the sum, average, minimum, +maximum, etc. of the values in a column of a table. These operations +are also required for queries. +.le +.ls [6] +Tabular and graphical output. The result of any query may be printed or +plotted in a variety of ways, without need to repeat the query. +.le +.le + + +The most important function performed by the query language is of course the +interactive evaluation of queries, i.e., questions about the data in the +database. It is beyond the scope of this document to try to give the reader +a detailed understanding of how a query language is used to evaluate queries. + +.nh 4 +Language Syntax + + The great flexibility of a query language derives from the fact that it is +syntax rather than parameter driven. The syntax of the DBMS query language +has not yet been defined. In choosing a language syntax there are a several +possible courses of action: [1] implement a standard syntax, [2] extend a +standard syntax, or [3] develop a new syntax, e.g., as a variation on some +existing syntax. + +The problem with rigorously implementing a standard syntax is that all query +languages currently in wide use were developed for commercial applications, +e.g., for banking, inventory, accounting, customer mailing lists, etc. +Experimental query languages are currently under development for CAD +applications, analysis of Landsat imagery, and other applications similar +to ours, but these are all research projects at the present time. +The basic characteristics desirable in a query language intended for scientific +data reduction and analysis seem little different than those provided by a query +language intended for commercial applications, hence the most practical +approach is probably to start with some existing query language syntax and +modify or extend it as necessary for our type of data. + +There is no standard query language for relational databases. +The closest thing to a standard is SQL, a language originally developed by +IBM for System-R (one of the first relational database systems, actually an +experimental prototype), and still in use in the latest IBM product, DB2. +This language has since been used in many relational products by many companies. +SQL is the latest in a series of relational query languages from IBM; earlier +languages include SQUARE and SEQUEL. The second most widely used relational +query language appears to be QUEL, the query language used in both educational +and commercial INGRES. + +Both SQL and QUEL are examples of the so-called "calculus" query languages. +The other major type of query language is the "algebraic" query language +(excluding the forms and menu based query languages which are not syntax +driven). Examples of algebraic languages are ISBL (PRTV, Todd 1976), +TABLET (U. Mass.), ASTRID (Gray 1979), and ML (Li 1984). +These algebraic languages have all been implemented and used, but nowhere +near as widely as SQL and QUEL. + +It is interesting to note that ASTRID and ML were developed by researchers +active in the area of logic languages. In particular, the ML (Mathematics-Like) +query language was implemented in Prolog and some of the character of Prolog +shows through in the syntax of the language. There is a close connection +between the relational algebra and the predicate calculus (upon which the +logic languages are based) which is currently being actively explored. +One of the most promising areas of application for the logic languages +(upon which the so-called "fifth generation" technology is based) is in +database applications and query languages in particular. + +There appears to be no compelling reason for the current dominance of the +calculus type query language, other than the fact that is what IBM decided +to use in System-R. Anything that can be done in a calculus language can +also be done in an algebraic language and vice versa. + +The primary difference between the two languages is that the calculus languages +want the user to express a complex query as a single large statement, +whereas the algebraic languages encourage the user to execute a complex +query as a series of simpler queries, storing the intermediate results as +snapshots or views (either language can be used either way, but the orientation +of the two languages is as stated). For simple queries there is little +difference between the two languages, although the calculus languages are +perhaps more readable (more English-like) while the algebraic languages are +more concise and have a more mathematical character. + +The orientation of the calculus languages towards doing everything in a single +statement provides more scope for optimization than if the equivalent query is +executed as a series of simpler queries; this is often cited as one of the +major advantages of the calculus languages. The procedural nature of the +algebraic languages does not permit the type of global optimizations employed +in the calculus languages, but this approach is perhaps more user-friendly +since the individual steps are easy to understand, and one gets to examine +the intermediate results to figure out what to do next. Since a complex query +is executed incrementally, intermediate results can be recomputed without +starting over from scratch. It is possible that, taking user error and lack +of forethought into account, the less efficient algebraic languages might end +up using less computer time than the super efficient calculus languages for +comparable queries. + +A further advantage of the algebraic language in a scientific environment is +that there is more of a distinction between executing a query and printing +the results of the query than in a calculus language. The intermediate results +of a complex query in an algebraic language are named relations (snapshots +or views); an extra print command must be entered to examine the intermediate +result. This is an advantage if the query language provides a variety of ways +to examine the result of a query, e.g., as a printed table or as some type +of plot. + +.nh 4 +Sample Queries + + At this point several examples of actual queries, however simple they may +be, should help us to visualize what a query language is like. Several +examples of typical scientific queries were given (in English) in section 3.2.1. +For the convenience of the reader these are duplicated here, followed by actual +examples in the query languages SQL, QUEL, ASTRID, and ML. It should be noted +that these are all examples of very simple queries and these examples do little +to demonstrate the power of a fully relational query language. +.ls +.ls [1] +Find all objects of type "pqr" for which X is in the range A to B and +Z is less than 10. +.le +.ls [2] +Compute the mean and standard deviation of attribute X for all objects +in the set [1]. +.le +.ls [3] +Compute and plot (X-Y) for all objects in set [1]. +.le +.ls [4] +Plot a circle of size (log2(Z-3.2) * 100) at the position (X,Y) of all objects +in set [1]. +.le +.ls [5] +Print the values of the attributes OBJ, X, Y, and Z of all objects of type +"pqr" for which X is in the range A to B and Y is greater than 30. +.le +.le + + +It should not be difficult for the imaginative reader to make up similar +queries for a particular astronomical catalog or data archive. +For example (I can't resist), "find all objects for which B-V exceeds X", +"find all recorded observations of object X", "find all observing runs on +telescope X in which astronomer Y participated during the years 1975 to +1985", "compute the number of cloudy nights in August during the years +1985 to 1990", and so on. The possibilities are endless. + +Query [5] is an example of a simple select/project query. This query is +shown below in the different query languages. Note that whitespace may be +redistributed in each query as desired; in particular, the entire query may +be entered on a single line if desired. Keywords are shown in upper case +and data names or values in lower case. The object "table" is the table +from which records are to be selected, "pqr" is the desired value of the +field "type" of table "table", and "x", "y", and "z" are numeric fields of +the table. + + +.ks +.nf +SQL: + + SELECT obj, x, y, z + FROM table + WHERE type = 'pqr' + AND x >= 10 + AND x <= 20 + AND z > 30; +.fi +.ke + + +.ks +.nf +QUEL: + + RANGE OF t IS table + RETRIEVE (t.obj, t.x, t.y, t.z) + WHERE t.type = 'pqr' + AND t.x >= 10 + AND t.y <= 20 + AND t.z > 30 +.fi +.ke + + +.ks +.nf +ASTRID (mnemonic form): + + table + SELECTED_ON [ + type = 'pqr' + AND x >= 10 + AND x <= 20 + AND z > 30 + ] PROJECTED_TO + obj, x, y, z +.fi +.ke + + +.ks +.nf +ASTRID (mathematical form): + + table ;[ type = 'pqr' AND x >= 10 AND x <= 20 AND z < 10 ] % + obj, x, y, z +.fi +.ke + + +.ks +.nf +ASTRID (alternate query showing use of intermediates): + + a := table ;[ type = 'pqr' AND z > 30 ] + b := a ;[ x >= 10 AND x <= 20 ] + b % obj,x,y,z +.fi +.ke + + +.ks +.nf +ML (Li/Prolog): + + table : type=pqr, x >= 10, x <= 20, z < 10 [obj,x,y,z] +.fi +.ke + + +Note that in ASTRID and ML selection and projection are implemented as +operators or qualifiers modifying the relation on the left. To print all +fields of all records of a table one need only enter the name of the table. +The logic language nature of such queries is evident if one thinks of the +query as a predicate or true/false assertion. Given such an assertion (query), +the query processor tries to prove the assertion true by finding all tuples +satisfying the predicate, using the set of rules given (the database). + +For simple queries such as these it makes little difference what query language +is used; many users would probably prefer the SQL or QUEL syntax for these +simple queries because of the English like syntax. To seriously evaluate the +differences between the different languages more complex queries must be tried, +but such an exercise is beyond the scope of the present document. + +As a final example we present, without supporting explanation, an example +of a more complex query in SQL (from Date, 1986). This example is based +upon a "suppliers-parts-projects" database, consisting of four tables: +suppliers (S), parts (P), projects (J), and number of parts supplied to +a specified project by a specified supplier (SPJ), with fields 'supplier +number' (S#), 'part number' (P#) and 'project number' (J#). The names +SPJX and SPJY are aliases for SPJ. This example is rather contrived and +the data is not interesting, but it should serve to illustrate the use of +SQL for complex queries. + + +.ks +.nf +Query: Get part numbers for parts supplied to all projects in London. + + SELECT DISTINCT p# + FROM spj spjx + WHERE NOT EXISTS + ( SELECT * + FROM j + WHERE city = 'london' + AND NOT EXISTS + ( SELECT * + FROM spj spjy + WHERE spjy.p# = spjx.p# + AND spjy.j# = j.j# )); +.fi +.ke + + +The nesting shown in this example is characteristic of the calculus languages +when used to evaluate complex queries. Each SELECT implicitly returns an +intermediate relation used as input to the next higher level subquery. + +.nh 3 +DB Kernel Operators + + All DBMS operators described up to this point have been general purpose +operators with no knowledge of the form in which data is stored internally. +Additional operators are required in support of the standard IRAF DB kernels. +These will be implemented as CL callable tasks in a subpackage off DBMS. + +.nh 4 +Dataset Copy and Load + + Since our intention is to store the database in a machine independent +format, special operators are not required to backup, reload, or copy dataset +files. The binary file copy facilities provided by IRAF or the host system +may be used to backup, reload, or copy dataset files. + +.nh 4 +Rebuild Dataset + + Over a period of time a dataset which is subjected to heavy updating +may become disordered internally, reducing the efficiency of a most record +access operations. A utility task is required to efficiently rebuild such +datasets. The same result can probably be achieved by an \fIrcopy\fR +operation but a lower level operator may be more efficient. + +.nh 4 +Mount Foreign Dataset + + Before a foreign dataset (archive or local format imagefile) can be +accessed it must be \fImounted\fR, i.e., the DBSS must be informed of the +existence and type of the dataset. The details of the mount operation are +kernel dependent; ideally the mount operation will consist of little more +than examining the structure of the foreign dataset and making appropriate +entries in the system catalog. + +.nh 4 +Crash Recovery + + A utility is required for recovering datasets which have been corrupted +as a result of a hardware or software failure. There should be sufficient +redundancy in the internal data structures of a dataset to permit automated +recovery. The recover operation is similar to a rebuild so perhaps the +same task can be used for both operations. + +.nh 2 +The IMIO Interface +.nh 3 +Overview + + The Image I/O (IMIO) interface is an existing subroutine interface used +to maintain and access bulk data arrays (images). The IMIO interface is built +upon the DBIO interface, using DBIO to maintain and access the image headers +and sometimes to access the stored data (the pixels) as well. For reasons of +efficiency IMIO directly accesses the bulk data array when large images are +involved. + +Most of the material presented in this section on the image header is new. +The pixel access facilities provided by the existing IMIO interface will +remain essentially unchanged, but the image header facilities provided by +the current interface are quite limited and badly need to be extended. +The existing header facilities provide support for the major physical image +attributes (dimensionality, length of each axis, pixel datatype, etc.) plus +a limited facility for storing user defined attributes. The main changes +in the new interface will be excellent support for history records, world +coordinates, histograms, a bad pixel list, and image masks. In addition +the new interface will provide improved support for user defined attributes, +and greatly improved efficiency when accessing large groups of images. +The storage structures will be more localized, hopefully causing less +confusion for the user. + +In this section we first discuss the components of an image, concentrating +primarily on the different parts of the image header, which is quite a +complex structure. We then discuss briefly the (mostly existing) facilities +for header and pixel access. Lastly we discuss the storage structures +normally used to maintain images in mass storage. + +.nh 3 +Logical Schema + + Images are stored as records in one or more tables in a database. More +precisely, the main part of an image header is a record (row) in some table +in a database. In general some of the other tables in a database will contain +auxiliary information describing the image. Some of these auxiliary tables +are maintained by IMIO and will be discussed in this section. Other tables +will be created by the applications programs used to reduce the image data. + +As far as the DBSS is concerned, the pixel segment of an image is a pretty +minor item, a single array type attribute in the image header. Since the +size of this array can vary enormously from one image to the next some +strategic questions arise concerning where to store the data. In general, +small pixel segments will be stored directly in the image header, while large +pixel segments will be stored in a separate file from that used to store +the header records. + +The major components of an image (as far as IMIO is concerned) are summarized +below. More detailed information on each component is given in the following +sections. +.ls +.ls Standard Header Fields +An image header is a record in a relation initially of type "image". +The standard header fields include all attributes necessary to describe +the physical characteristics of the image, i.e., all attributes necessary +to access the pixels. +.le +.ls History +History records for all images in a database are stored in a separate history +relation in time sequence. +.le +.ls World Coordinates +An image may have any number of world coordinate systems associated with it. +These are stored in a separate world coordinate system relation. +.le +.ls Histogram +An image may have any number of histograms associated with it. +Histograms for all images in a database are stored in a separate histogram +relation in time sequence. +.le +.ls Pixel Segment +The pixel segment is stored in the image header, at least from the point of +view of the logical schema. +.le +.ls Bad Pixel List +The bad pixel list, a variable length integer array, is required to physically +describe the image hence is stored in the image header. +.le +.ls Region Mask +An image may have any number of region masks associated with it. Region masks +for all images in a database are stored in a separate mask relation. A given +region mask may be associated with any number of different images. +.le +.le + + +In summary, the \fBimage header\fR contains the standard header fields, +the pixels, the bad pixel list, and any user defined fields the user wishes +to store directly in the header. All other information describing an image +is stored in external non-image relations, of which there may be any number. +Note that the auxiliary tables (world coordinates, histograms, etc.) are not +considered part of the image header. + +.nh 4 +Standard Header Fields + + The standard header fields are those fields required to describe the +physical attributes of the image, plus those fields required to physically +access the image pixels. The standard header fields are summarized below. +These fields necessarily reflect the current capabilities of IMIO. Since +the DBSS provides data independence, however, new fields may be added in +the future to support future versions of IMIO without rendering old images +unreadable. +.ls +.ls 12 image +An integer value automatically assigned by IMIO when the image is created +which uniquely identifies the image within the containing table. This field +is used as the primary key in \fIimage\fR type relations. +.le +.ls naxis +Number of axes, i.e., the dimensionality of the image. +.le +.ls naxis[1-4] +A group of 4 attributes, i.e., \fInaxis1\fR through \fInaxis4\fR, +each specifying the length of the associated image axis in pixels. +Axis 1 is an image line, 2 is a column, 3 is a band, and so on. +If \fInaxis\fR is greater than four additional axis length attributes +are required. If \fInaxis\fR is less than four the extra fields are +set to one. Distinct attributes are used rather than an array so that +the image dimensions will appear in printed output, to simplify the use +of the dimension attributes in queries, and to make the image header +more FITS-like. +.le +.ls linelen +The physical length of axis one (a line of the image) in pixels. Image lines +are often aligned on disk block boundaries (stored in an integral number of +disk blocks) for greater i/o efficiency. If \fIlinelen\fR is the same as +\fInaxis1\fR the image is said to be stored in compressed format. +.le +.ls pixtype +A string valued attribute identifying the datatype of the pixels as stored +on disk. The possible values of this attribute are discussed in detail below. +.le +.ls bitpix +The number of bits per pixel. +.le +.ls pixels +The pixel segment. +.le +.ls nbadpix +The number of bad pixels in the image. +.le +.ls badpix +The bad pixel list. This is effectively a boolean image stored in compressed +form as a variable length integer array. The bad pixel list is maintained by +the pixel list package, a subpackage of IMIO, also used to maintain region +masks. +.le +.ls datamin +The minimum pixel value. This field is automatically invalidated (set to a +value greater than \fIdatamax\fR) whenever the image is modified, unless +explicitly updated by the caller. +.le +.ls datamax +The maximum pixel value. This field is automatically invalidated (set to a +value less than \fIdatamin\fR) whenever the image is modified, unless +explicitly updated by the caller. +.le +.ls title +The image title, a one line character string identifying the image, +for annotating plots and other forms of output. +.le +.le + + +The possible values of the \fIpixtype\fR field are shown below. The format +of the value string is "type.host", where \fItype\fR is the logical datatype +and \fIhost\fR is the host machine encoding used to represent that datatype. + + +.ks +.nf + TYPE DESCRIPTION MAPS TO + + byte.m unsigned byte ( 8 bits) short.spp + ushort.m unsigned word (16 bits) long.spp + + short.m short integer, signed short.spp + long.m long integer, signed long.spp + real.m single precision floating real.spp + double.m double precision floating double.spp + complex.m (real,real) complex.spp +.fi +.ke + + +Note that the first character of each keyword is sufficient to uniquely +identify the datatype. The ".m" suffix identifies the "machine" to which +the datatype refers. When new images are written \fIm\fR will usually be +the name of the host machine. When images written on a different machine +are read on the local host there is no guarantee that the i/o system will +recognize the formats for the named machine, but at least the format will +be uniquely defined. Some possible values for \fIm\fR are shown below. + + +.ks +.nf + dbk DBK (database kernel) mip-format + mip machine independent (MII integer, + IEEE floating) + sun SUN formats (same as mip?) + vax DEC Vax data formats + mvs DG MV-series data formats +.fi +.ke + + +The DBK format is used when the pixels are stored directly in the image header, +since only the DBK binary formats are supported in DBK binary datafiles. +The standard i/o system will be support at least the MIP, DBK, SUN (=MIP), +and VAX formats. If the storage format is not the host system format +conversion to and from the corresponding SPP (host) format will occur at the +level of the FIO interface to avoid an N-squared type conversion matrix in +IMIO, i.e., IMIO will see only the SPP datatypes. + +Examples of possible \fIpixtype\fR values are "short.vax", i.e., a 16 bit signed +twos-complement byte-swapped integer format, and "real.mip", the 32 bit IEEE +single precision floating point format. + +.nh 4 +History Text + + The intent of the \fIhistory\fR relation is to record all events which +modify the image data in a dataset, i.e., all operations which create, delete, +or modify images. The attributes of the history relation are shown below. +Records are added to the history table in time sequence. Each record logically +contains one line of history text. +.ls 4 +.ls 12 time +The date and time of the event. This value of this field is automatically +set by IMIO when the history record is inserted. +.le +.ls parent +The name of the parent image in the case of an image creation event, +or the name of the affected image in the case of an image modification +event affecting a single image. +.le +.ls child +The name of the child or newly created image in the case of an image creation +event. This field is not used if only a single image is involved in an event. +.le +.ls event +The history text, i.e., a one line description of the event. The suggested +format is a task or procedure call naming the task or procedure which modified +the image and listing its arguments. +.le +.le + + +.ks +.nf +Example: + + TIME PARENT CHILD EVENT + + Sep 23 20:24 nite1[12] -- imshift (1.5, -3.4) + Sep 23 20:30 nite1[10] nite1[15] + Sep 23 20:30 nite1[11] nite1[15] + Sep 23 20:30 nite1[15] -- nite1[10] - nite1[11] +.fi +.ke + + +The principal reason for collecting all history text together in a single +relation rather than storing it scattered about in string attributes in the +image headers is to permit use of the DBMS facilities to pose queries on the +history of the dataset. Secondary reasons are the completeness of the history +record thus provided for the dataset as a whole, and increased efficiency, +both in the amount of storage required and in the time required to record an +event (in particular, the time required to create a new image). Note also that +the history relation may be used to record events affecting dataset objects +other than images. + +The history of any particular image is easily recovered by printing the values +of the \fItext\fR field of all records with a particular value of the +\fIimage\fR key. The parents or children of any image are easily traced +using the information in the history relation. The history of the dataset +as a whole is given by printing all history records in time sequence. +History information is not lost when intermediate images are deleted unless +deletes are explicitly performed upon the history relation. + +.nh 4 +World Coordinates + + In general, an image may simultaneously have any number of world coordinate +systems (WCS) associated with it. It would be quite awkward to try to store an +arbitrary number of WCS descriptors in the image header, so a separate WCS +relation is used instead. If world coordinates are not used no overhead is +incurred. + +Maintenance of the WCS descriptor, transformation of the WCS itself (e.g., +when an image changes spatially), and coordinate transformations using the WCS +are all managed by a dedicated package, also called WCS. The WCS package +is a general purpose package usable not only in IMIO but also in GIO and +other places. IMIO will be responsible for copying the WCS records for an +image when a new image is created, as well as for correcting the WCS for the +effects of subsampling, coordinate flip, etc. when a section of an image is +mapped. + +A general solution to the WCS problem requires that the WCS package support +both linear and nonlinear coordinate systems. The problem is further +complicated by the variable number of dimensions in an image. In general +the number of possible types of nonlinear coordinate systems is unlimited. +Our solution to this difficult problem is as follows. +.ls 4 +.ls o +Each image axis is associated with a one or two dimensional mapping function. +.le +.ls o +Each mapping function consists of a general linear transformation followed +by a general nonlinear transformation. Either transformation may be unitary +(may be omitted) if desired. +.le +.ls o +The linear transformation for an axis consists of some combination of a shift, +scale change, rotation, and axis flip. +.le +.ls o +The nonlinear transformation for an axis consists of a numerical approximation +to the underlying nonlinear analytic function. A one dimensional function is +approximated by a curve x=f(a) and a two dimensional function is approximated +by a surface x=f(a,b), where X, A, and B may be any of the image axes. +A choice of approximating functions is provided, e.g., chebyshev or legendre +polynomial, piecewise cubic spline, or piecewise linear. +.le +.ls o +The polynomial functions will often provide the simplest solution for well +behaved coordinate transformations. The piecewise functions (spline and linear) +may be used to model any slowly varying analytic function represented in +cartesian coordinates. The piecewise functions \fIinterpolate\fR the original +analytic function on a regular grid, approximating the function between grid +points with a first or third order polynomial. The approximation may be made +arbitrarily good by sampling on a finer grid, trading table space for increased +precision. +.le +.ls o +For many nonlinear functions, especially those defined in terms of the +transcendental functions, the fitted curve or surface will be quicker to +evaluate than the original function, i.e., the approximation will be more +efficient (evaluation of a bicubic spline is not cheap, however, requiring +computation of a linear combination of sixteen coefficients for each output +point). +.le +.ls o +The nonlinear transformation will define the mapping from pixel coordinates +to world coordinates. The inverse transformation will be computed by numerical +inversion (iterative search). This technique may be too inefficient for some +applications. +.le +.le + + +For example, the WCS for a three dimensional image might consist of a bivariate +Nth order chebyshev polynomial mapping X and Y to RA and DEC via gnomic +projection, plus a univariate piecewise linear function mapping each discrete +image band (Z) to a wavelength value. If the image were subsequently shifted, +rotated, magnified, block averaged, etc., or sampled via an image section, +a linear term would be added to the WCS record of each axis affected by the +transformation. + +A WCS is represented by a \fIset\fR of records in the WCS relation. One record +is required for each axis mapped by the transformation. The attributes of the +WCS relation are described below. The records forming a given WCS all share +the same value of the \fIwcs\fR field. +.ls +.ls 12 wcs +The world coordinate system number, a unique integer code assigned by the WCS +package when the WCS is added to the database. +.le +.ls image +The name of the image with which the WCS is associated. +If a WCS is to be associated with more +than one image retrieval must be via the \fIwcs\fR number rather than the +\fIimage\fR name field. +.le +.ls type +A keyword supplied by the application identifying the type of coordinate +system defined by the WCS. This attribute is used in combination with the +\fIimage\fR attribute for keyword based retrieval in cases where an image +may have multiple world coordinate systems. +.le +.ls axis +The image axis mapped by the transformation stored in this record. The X +axis is number 1, Y is number 2, and so on. +.le +.ls axin1 +The first input axis (independent variable in the transformation). +.le +.ls axin2 +The second input axis, set to zero in the case of a univariate transformation. +.le +.ls axout +The number of the input axis (1 or 2) to be used for world coordinate output, +in the case where there is only the linear term but there are two input axes +(in which case the linear term produces a pair of world coordinate values). +.le +.ls linflg +A flag indicating whether the linear term is present in the transformation. +.le +.ls nlnflg +A flag indicating whether the nonlinear term is present in the transformation. +.le +.ls p1,p2 +Linear transformation: origin in pixel space for input axes 1, 2. +.le +.ls w1,w2 +Linear transformation: origin in world space for input axes 1, 2. +.le +.ls s1,s2 +Linear transformation: Scale factor DW/DP for input axes 1, 2. +.le +.ls rot +Linear transformation: Rotation angle in degrees counterclockwise from the +X axis. +.le +.ls cvdat +The curve or surface descriptor for the nonlinear term. The internal format +of this descriptor is controlled by the relevant math package. +This is a variable length array of type real. +.le +.ls label +Axis label for plots. +.le +.ls format +Tick label format for plots, e.g., "0.2h" specifies HMS format in a variable +field width with two decimal places in the seconds field. +.le +.le + + +As noted earlier, the full transformation for an axis involves a linear +transformation followed by a nonlinear transformation. The linear term +is defined in terms of the WCS attributes \fIp1, p2\fR, etc. as shown below. +The variables X and Y are the input values of the axes \fIaxin1\fR and +\fIaxin2\fR, which need not correspond to the X and Y axes of the image. + + +.ks +.nf + x' = (x - p1) * s1 + y' = (y - p2) * s2 + + x" = x' * cos(rot) + y' * sin(rot) + y" = y' * cos(rot) - x' * sin(rot) + + u = x" + w1 + v = y" + w2 +.fi +.ke + + +The output variables U and V are then used as input to the nonlinear mapping, +producing the world coordinate value W for the specified image axis \fIaxis\fR +as output. + + w = eval (cvdat, u, v) + +The mappings for the special cases [1] no linear transformation, +[2] no nonlinear transformation, and [3] univariate rather than bivariate +transformation, are easily derived from the full transformation shown above. +Note that if there is no nonlinear term the linear term produces world +coordinates as output, otherwise the intermediate values (U,V) are in +pixel coordinates. Note also that if there is no nonlinear term but there +are two input axes (as in the case of a rotation), attribute \fIaxout\fR +must be set to indicate whether U or V is to be returned as the output world +coordinate. + +.nh 4 +Image Histogram + + Histogram records are stored in a separate histogram relation outside +the image header. An image may have any number of histograms associated +with it, each defined for a different section of the image. A given image +section may have multiple associated histogram records differing in time, +number of sampling bins, etc., although normally recomputation of the +histogram for a given section will result in a record update rather than an +insertion. A subpackage within IMIO is responsible for the computation of +histogram records. Histogram records are not propagated when an image is +copied. Modifications to an image made subsequent to computation of a +histogram record may invalidate or obsolete the histogram. +.ls 4 +.ls 12 image +The name of the image or image section to which the histogram record +applies. +.le +.ls time +The date and time when the histogram was computed. +.le +.ls z1 +The pixel value associated with the first bin of the histogram. +.le +.ls z2 +The pixel value associated with the last bin of the histogram. +.le +.ls npix +The total number of pixels used to compute the histogram. +.le +.ls nbins +The number of bins in the histogram. +.le +.ls bins +The histogram itself, i.e., an array giving the number of pixels in each +intensity range. +.le +.le + + +The histogram limits Z1 and Z2 will normally correspond to the minimum and +maximum pixel values in the image section to which the histogram applies. + +.nh 4 +Bad Pixel List + + The bad pixel list records the positions of all bad pixels in an image. +A "bad" pixel is a pixel which has an invalid value and which therefore should +not be used for image analysis. As far as IMIO is concerned a pixel is either +good or bad; if an application wishes to assign a fractional weight to +individual pixels then a second weight image must be associated with the +data image by the applications program. + +Images tend to have few or no bad pixels. When bad pixels are present they +are often grouped into bad regions. This makes it possible to use data +compression techniques to efficiently represent the set of bad pixels, +which is conceptually a simple boolean mask image. + +The bad pixel list is represented in the image header as a variable length +integer array (the runtime structure is slightly more complex). +This integer array consists of a set of lists. Each list in the set lists +the bad pixels in a particular image line. Each linelist consists of a record +length field and a line number field, followed by the bad pixel list for that +line. The bad pixel list is a series of either column numbers or ranges of +column numbers. Single columns are represented in the list as positive +integers; ranges are indicated by a negative second value. + + +.ks +.nf + 15 2 512 512 + 6 23 4 8 15 -18 44 + 4 72 23 -29 35 +.fi +.ke + + +An example of a bad pixel list describing a total of 15 bad pixels is shown +above. The first line is the pixel list header which records the total list +length (15 ints), the number of dimensions (2), and the sizes of each dimension +(512, 512). There follow a set of variable length line list records. +Two such lists are shown in the example, one for line 23 and one for line 72. +On line 23 columns 4, 8, 15 though 18, and 44 are all bad. Note that each +linelist contains only a line number since the list is two dimensional; +in general an N dimensional image requires N-1 subscripts after the record +length field, starting with the line number and proceeding to higher dimensions +to the right. + +Even though IMIO provides a bad pixel list capability, many applications will +not want to bother to check for bad pixels. In general, pointwise image +operators which produce a new image as output will not need to check for bad +pixels. Non-pointwise image operators, e.g., filtering opertors, may or may +not wish to check for bad pixels (in principle they should use kernel collapse +to ignore bad pixels). Analysis programs, i.e., programs which produce +database records as output rather than create new images, will usually check +for and ignore bad pixels. + +To avoid machine traps when running the pointwise image operators, all bad +pixels must have reasonable values, even if these values have to be set +artificially when the data is archived. IMAGES SHOULD NOT BE ARCHIVED WITH +MAGIC IN-PLACE VALUES FOR THE BAD PIXELS (as in FITS) since this forces the +system to conditionally test the value of every pixel when the image is read, +an unnecessary operation which is quite expensive for large images. +The simplicity of the reserved value scheme does not warrant such an expense. +Note that the reverse operation, i.e., flagging the bad pixels by setting +them to a magic value, can be carried out very efficiently by the reader +program given a bad pixel list. + +For maximum efficiency those operators which have to deal with bad pixels may +provide two separate data paths internally, one for data which contains no +bad pixels and one for data containing some bad pixels. The path to be taken +would be chosen dynamically as each image line is input, using the bad pixel +list to determine which lines contain bad pixels. Alternatively a program +may elect to have the bad pixels flagged upon input by assignment of a magic +value. The two-path approach is the most desirable one for simple operators. +The magic value approach is often simplest for the more complex applications +where duplicating the code to provide two data paths would be costly and the +operation is already so expensive that the conditional test is not important. + +All operations and queries on bad pixel lists are via a general pixel list +package which is used by IMIO for the bad pixel list but which may be used +for any other type of pixel list as well. The pixel list package provides +operators for creating new lists, adding and deleting pixels and ranges of +pixels from a list, merging lists, and so on. + +.nh 4 +Region Mask + + A region mask is a pixel list which defines some subset of the pixels in +an image. Region masks are used to define the region or regions of an image +to be operated upon. Region masks are stored in a separate mask relation. +A mask is a type of pixel list and the standard pixel list package is used +to maintain and access the mask. Any number of different region masks may be +associated with an image, and a given region mask may be used in operations +upon any number of different images. +.ls 4 +.ls 12 mask +The mask number, a unique integer code assigned by the pixel list package +when the mask is added to the database. +.le +.ls image +The image or image section associated with the mask, if any. +.le +.ls type +The logical type of the mask, a keyword supplied by the applications program +when the mask is created. +.le +.ls naxis +The number of axes in the mask image. +.le +.ls naxis[1-4] +The length of each image axis in pixels. If \fInaxis\fR is greater than 4 +additional axis length attributes must be provided. +.le +.ls npix +The total number of pixels in the subset defined by the mask. +.le +.ls pixels +The mask itself, a variable length integer array. +.le +.le + + +Examples of the use of region masks include specifying the regions to be +used in a surface fit to a two dimensional image, or specifying the regions +to be used to correlate two or more images for image registration. +A variety of utility tasks will be provided in the \fIimages\fR package for +creating mask images, interactively and otherwise. For example, it will +be possible to display an image and use the image cursor to mark the regions +interactively. + +.nh 3 +Group Data + + The group data format associates a set of keyword = value type +\fBgroup header\fR parameters with a group of images. All of the images in +a group should have the same size, number of dimensions, and datatype; +this is required for images to be in group format even though it is not +physically required by the database system. All of the images in a group +share the parameters in the group header. In addition, each image in a +group has its own private set of parameters (attributes), stored in the +image header for that image. + +The images forming a group are stored in the database as a named base table +of type \fIimage\fR. The name of the base table must be the same as the name +of the group. Each group is stored in a separate table. The group headers +for all groups in the database are stored in a separate \fIgroups\fR table. +The attributes of the \fIgroups\fR relation are described below. +.ls 4 +.ls 12 group +The name of the group (\fIimage\fR table) to which this record belongs. +.le +.ls keyword +The name of the group parameter represented by the current record. +The keyword name should be FITS compatible, i.e., the name must not exceed +eight characters in length. +.le +.ls value +The value of the group parameter represented by the current record, encoded +FITS style as a character string not to exceed 20 characters in length. +.le +.ls comment +An optional comment string, not to exceed 49 characters in length. +.le +.le + + +Group format is provided primarily for the STScI/SDAS applications, which +require data to be in group format. The format is however useful for any +application which must associate an arbitrary set of \fIglobal\fR parameters +with a group of images. Note that the member images in a group may be +accessed independently like any other IRAF image since each image has a +standard image header. The primary physical attributes will be identical +in all images in the group, but these attributes must still be present in +each image header. For the SDAS group format the \fInaxis\fR, \fInaxisN\fR, +and \fIbitpix\fR parameters are duplicated in the group header. + +.nh 3 +Image I/O + + In this section we describe the facilities available for accessing +image headers and image data. The discussion will be limited to those +aspects of IMIO relevant to a discussion of the DBSS. The image i/o (IMIO) +interface and the image database interface (IDBI) are existing interfaces +which are more properly described in detail elsewhere. + +.nh 4 +Image Templates + + Most IRAF image operators are set up to operate on a group of images, +rather than a single image. Membership in such a group is determined at +runtime by a so-called \fIimage template\fR which may select any subset +of the images in the database, i.e., and subset of images from any subset +of \fIimage\fR type base tables. This type of group should not be confused +with the \fIgroup format\fR discussed in the last section. The image template +is normally entered by the user on the command line and is dynamically +converted into a list of images by expansion of the template on the current +contents of the database. + +Given an image template the IRAF applications program calls an IMIO routine +to "open" the template. Successive calls to a get image name routine are made +to operate upon the individual images in the group. When all images have been +processed the template is closed. + +The images in a group defined by an image template must exist by definition +when the template is expanded, hence the named images must either be input +images or the operation must update or delete the named images. If an +output image is to be produced for each input image the user must supply the +name of the table into which the new images are to be inserted. This is +exactly the same type of operation performed by the DBMS operators, and in +fact most image operators are relational operators, i.e., they take a +relation as input and produce a new relation as output. Note that the user +is required to supply only the name of the output table, not the names of +the individual images. The output table may be one of the input tables if +desired. + +An image template is syntactically equivalent to a DBIO record selection +expression with one exception: each image name may optionally be modified +by appending an \fIimage section\fR to specify the subset of the pixels in +the image to be operated upon. An example of an image section string is +"[*,100]"; this references column 100 of the associated image. The image +section syntax is discussed in detail in the \fICL User's Guide\fR. + +Since the image template syntax is nearly identical to the general DBIO record +selection syntax the reader is referred to the discussion of the latter syntax +presented in section 4.5.6 for further details. The new DBIO syntax is largely +upwards compatible with the image template syntax currently in use. + +.nh 4 +Image Pixel Access + + IMIO provides quite sophisticated pixel access facilities which it is +beyond the scope of the present document to discuss in detail. Complete +data independence is provided, i.e., the applications program in general +need not know the actual dimensionality, size, datatype, or storage mode +of the image, what format the image is stored in, or even what device or +machine the image resides on. This is not to say that the application is +forbidden from knowing these things, since more efficient i/o is possible +if there is a match between the logical and physical views of the data. + +Pixel access via IMIO is via the FIO interface. The DBSS is charged with +management of the pixel storage file (if any) and with setting up the +FIO interface so that IMIO can access the pixels. Both buffered and virtual +memory mapped access is supported; which is actually used is transparent to +the user. The types of i/o operations provided are "get", "put", and "update". +The objects upon which i/o may be performed are image lines, image columns, +N-dimensional subrasters, and pixel lists. + +New in the DBIO based version of IMIO are update mode and column and pixel +list i/o, plus direct access via virtual memory mapping using the static file +driver. + +.nh 4 +Image Database Interface (IDBI) + + The image database interface is a simple keyword based interface to the +(non array valued) fields of the standard image header. The IDBI isolates +the image oriented applications program from the method used to store the +header, i.e., programs which access the header via the IDBI don't care whether +the header is implemented upon DBIO or some other record i/o interface. +In particular, the IDBI is an existing interface which is \fInot\fR currently +implemented upon DBIO, but which will be converted to use DBIO when it becomes +available. Programs which currently use the IDBI should require few if any +changes when DBIO is installed. + +The philosophy of isolating the applications program using IMIO from the +underlying interfaces is followed in all the subpackages forming the IMIO +interface. Additional IMIO subpackages are provided for appending history +records, creating and reading histograms, and so on. + +.nh 3 +Summary of IMIO Data Structures + + As we have seen, an image is represented as a record in a table in some +database. The image record consists of a set of standard fields, a set of +user defined fields, and the pixel segment, or at least sufficient information +to locate and access the pixel segment if it is stored externally. +An image database may contain a number of other tables; these are summarized +below. + + +.ks +.nf + <images> Image storage (a set of tables named by the user) + groups Header records for group format data + histograms Histograms of images or image sections + history Image history records + masks Region masks + wcs World coordinate systems +.fi +.ke + + +Any number of additional application specific tables may be present in an +actual database. The names of the application and user defined tables must +not conflict with the reserved table names shown above (or with the names of +the DBIO system tables discussed in the next section). The pixel segment of +an image and possibly the image header may be stored in a non-DBSS format +accessed via the HDBI. All the other tables are stored in the standard DBSS +format. + +.nh 2 +The DBIO Interface +.nh 3 +Overview + + The database i/o (DBIO) interface is the interface by which all compiled +programs directly or indirectly access data maintained by the DBSS. DBIO is +primarily a high level record manager interface. DBIO defines the logical +structure of a database and directly implements most of the operations +possible upon the objects in a database. + +The major functions of DBIO are to translate a record select/project expression +into a series of physical record accesses, and to provide the applications +program with access to the contents of the specified records. DBIO hides the +the physical structure and contents of the stored records from the applications +program; providing data independence is one of the major concerns of DBIO. +DBIO is not directly concerned with the physical storage of tables and records +in mass storage, nor with the methods used to physically access such objects. +The latter operations, i.e., the \fIaccess method\fR, are provided by a database +kernel (DBK). + +We first review the philosophy underlying the design of DBIO, and discuss +how DBIO differs from most commercial database systems. Next we describe +the logical structure of a database and introduce the objects making up a +database. The method used to define an actual database is described, +followed by a description of the methods used to access the contents of a +database. Lastly we describe the mapping of a DBIO database into physical +files. + +.nh 3 +Comparision of DBIO and Commercial Databases + + The design of the DBIO interface is based on a thorough study of existing +database systems (most especially System-R, DB2 and INGRES). It was clear from +the beginning that these systems were not ideally suited to our application, +even if the proprietary and portability issues were ignored. Eventually the +differences between these commercial database systems and the system we need +became clear. The differences are due to a change in focus and emphasis as +much as to the obvious differences between scientific and commercial +applications, and are summarized below. +.ls 4 +.ls o +The commercial systems are not sufficiently flexible in the types of data that +can be stored. In particular these systems do not in general support variable +length arrays of arbitrary datatype; most do not support even static arrays. +Only a few systems allow new attributes to be added to existing tables. +Most systems talk about domains but few implement them. We need both array +storage and the ability to dynamically add new attributes, and it appears that +domains will be quite useful as well. +.le +.ls o +Most commercial systems emphasize the query language, which forms the basis +for the host language interface as well as the user interface. The query +language is the focus of these systems. In our case the DBSS is embedded +within IRAF as one of many subsystems. While we do need query language +facilities at the user level, we do not need such sophisticated facilities +at the DBIO level and would rather do without the attendant complexity and +overhead. +.le +.ls o +Commercial database systems are designed for use in a multiuser transaction +processing environment. Many users may simultaneously be performing update +and revtrieval operations upon a single centralized database. The financial +success of the company may well depend upon the integrity of the database. +Downtime can be very expensive. + +In contrast we anticipate having many independent databases. These will be +of two kinds: public and private. The public databases will virtually always be +accessed read only and the entire database can be locked for exclusive access +if it should ever need updating. Only the private databases are subject to +heavy updating; concurrent access is required for background jobs but the +granularity of locking can be fairly coarse. If a database should become +corrupted it can be fixed at leisure or even regenerated from scratch without +causing great hardship. Concurrency, integrity, and recovery are therefore +less important for our applications than in a commercial environment. +.le +.ls o +Most commercial database systems (with the exception of the UNIX based INGRES) +are quite machine, device, and host system dependent. In our case portability +of both the software and the data is a primary concern. The requirement that +we be able to archive data in a machine independent format and read it on a +variety of machines seems to be an unusual one. +.le +.le + + +In summary, we need a simple interface which provides flexibility in the way +in which data can be stored, and which supports complex, dynamic data structures +containing variable length arrays of any datatype and size. The commercial +database systems do not provide enough flexibility in the types of data +structures they can support, nor do they provide enough flexibility in storage +formats. On the other hand, the commercial systems provide a more sophisticated +host language interface than we need. DBIO should therefore emphasize flexible +data structures but avoid a complex syntax and all the problems that come with +such. Concurrency and integrity are important but are not the major concerns +they would be in a commercial system. + +.nh 3 +Query Language Interface + + We noted in the last section that DBIO should be a simple record manager +type interface rather than an embedded query language type interface. This +approach should yield the simplest interface meeting our primary requirements. +Nonetheless a host language interface to the query language is possible and +can be added in the future without compromising the present DBIO interface +design. + +The query language will be implemented as a conventional CL callable task in +the DBMS package. Command input to the query language will be interactively +via the terminal (the usual case), or noninteractively via a string type +command line argument or via a file. Any compiled program can send commands +to the query language (or to any CL task) using the CLIO \fBclcmd\fR procedure. +Hence a crude but usable HLI query language interface will exist as soon as +a query language becomes available. A true high level embedded query language +interface could be built using the same interface internally, but this should +be left to some future compiled version of SPP rather than attempted with the +current preprocessor. We have no immediate plans to build such an embedded +query language interface but there is nothing in the current design to hinder +such a project should it someday prove worthwhile. + +.nh 3 +Logical Schema + + In this section we present the logical schema of a DBIO database. +A DBIO database consists of a set of \fBsystem tables\fR and a set of +\fBuser tables\fR. The system tables define the structure of the database +and its contents; the user tables contain user data. All tables are instances +of named \fBrelations\fR or \fBviews\fR. Relations and views are ordered +collections of \fBattributes\fR or \fBgroups\fR of attributes. Each attribute +is defined upon some particular \fBdomain\fR. The structure of the objects +in a database is defined at runtime by processing a specification written in +the \fBdata definition language\fR. + +.nh 4 +Databases + + A DBIO database is a collection of named tables. All databases include +a standard set of \fBsystem tables\fR defining the structure and contents +of the database. Any number of user or application defined tables may also +be present in the database. The most important system table is the database +\fIcatalog\fR which includes a record describing each user or system table +in the database. + +Conceptually a database is similar to a directory containing files. The catalog +corresponds to the directory and the tables correspond to the files. +A database is however a different type of object; there need be no obvious +connection between the objects in a database and the physical directories and +files used to store a database, e.g., several tables might be stored in one +file, one table might be stored in many files, the tables might be stored on +a special device and not in files at all, and so on. + +In general the mapping of tables into physical objects is hidden from the user +and is not important. The only exception to this is the association of a +database with a specific FIO directory. The mapping between databases and +directories is one to one, i.e., a directory may contain only one database, +and a database is contained in a single directory. An entire database can +be physically moved, copied, backed up, or restored by merely performing a +binary copy of the contents of the directory. DBIO dynamically generates all +file names relative to the database directory, hence moving a database to +a different directory is harmless. + +To hide the database directory from the user DBIO supports the concept of a +\fBcurrent database\fR in much the way that FIO supports the concept of a +current directory. Tables are normally referenced by name, e.g., "ptable masks" +without explicitly naming the database (i.e., directory) in which the table +resides. The current database is maintained independently of the current +directory, allowing the user to change directories without affecting the +current database. This is particularly useful when accessing public databases +(maintained in a write protected directory) or when accessing databases which +reside on a remote node. To list the contents of the current database the +user must type "pcat" rather than "dir". The current database defaults to +the current directory until the user explicitly sets the current database +with the \fBchdb\fR command. + +Databases are referred to by the filename of the database directory. +The IRAF system will provide a "master catalog" of public databases, +consisting of little more than a set of CL environment definitions assigning +logical names to the database directories. Whenever possible logical names +should be used rather than pathnames to hide the pathname of the database. + +.nh 4 +System Tables + + The structure and contents of a DBIO database are described by the same +table mechanism used to maintain user data. DBIO automatically maintains +the system tables, which are normally protected from writing by the user +(the system tables can be manually updated like any other table in a desperate +situation). Since the system tables are ordinary tables, they can be +inspected, queried, etc., using the same utilities used to access the user +data tables. The system tables are summarized below. +.ls 4 +.ls 12 syscat +The database catalog. +Contains an entry (record) for every table or view in the database. +.le +.ls sysatt +The attribute list table. +Contains an entry for every attribute in every table in the database. +.le +.ls sysddt +The domain descriptor table. +Contains an entry for every defined domain in the database. Any number of +attributes may share the same domain. +.le +.ls sysidt +The index descriptor table. +Contains an entry for every primary or secondary index in the database. +.le +.le + + +The system tables are visible to the user, i.e., they appear in the database +catalog. Like the user tables, the system tables are themselves described by +entries in the database catalog, attribute list table, and domain descriptor +table. + +.nh 4 +The System Catalog + + The \fBsystem catalog\fR is effectively a "table of contents" for the +database. The fields of the catalog relation \fBsyscat\fR are as follows. +.ls 4 +.ls 12 table +The name of the user or system table described by the current record. +Table names may contain any combination of the alphanumeric characters, +underscore, or period and must not exceed 32 characters in length. +.le +.ls relid +The table identifier. A unique integer code by which the table is referred +to internally. +.le +.ls type +Identifies the type of table, e.g., base table or view. +.le +.ls ncols +The number of columns (attributes) in the table. +.le +.ls nrows +The number of rows (records, tuples) in the table. +.le +.ls rsize +The size of a record in bytes, not including array storage. +.le +.ls tsize +An estimate of the total number of bytes of storage currently in use by the +table, including array storage. +.le +.ls ctime +The date and time when the table was created. +.le +.ls mtime +The date and time when the table was last modified. +.le +.ls flags +A small integer containing flag bits used internally by DBIO. +These include the protection bits for the table. Initially only write +protection and delete protection will be supported (for everyone). +Additional protections are of course provided by the file system. +A flag bit is also used to indicate that the table has one or more +indexes, to avoid an unnecessary search of the \fBsysidx\fR table when +accessing an unindexed table. +.le +.le + + +Only a subset of these fields will be of interest to the user in ordinary +catalog listings. The \fBpcatalog\fR task will by default print only the +most interesting fields. Any of the other DBMS output tasks may be used +to inspect the catalog in detail. + +.nh 4 +Relations + + A \fBrelation\fR is an ordered set of named attributes, each of which is +defined upon some specific domain. A \fBbase table\fR is a named instance +of some relation. A base table is a real object like a file; a base table +appears in the catalog and consumes storage on disk. The term "table" is +more general, and is normally used to refer to any object which can be +accessed like a base table. + +A DBIO relation is defined by a set of records describing the attributes +of the relation. The attribute lists of all relations are stored in the +\fBsysatt\fR table, described in the next section. + +.nh 4 +Attributes + + An \fBattribute\fR of a relation is a datum which describes some aspect +of the object described by the relation. Each attribute is defined by a +record in the \fBsysatt\fR table, the fields of which are described below. +The attribute descriptor table, while visible to the user if they wish to +examine the structure of the database in detail, is primarily an internal +table used by DBIO to define the structure of a record. +.ls 4 +.ls 12 name +The name of the attribute described by the current record. +Attribute names may contain any combination of the alphanumeric characters +or underscore and must not exceed 16 characters in length. +.le +.ls attid +The attribute identifier. A unique integer code by which the attribute is +referred to internally. The \fIattid\fR is unique within the relation to +which the attribute belongs, and defines the ordering of attributes within +the relation. +.le +.ls relid +The relation identifier of the table to which this attribute belongs. +.le +.ls domid +The domain identifier of the domain to which this attribute belongs. +.le +.ls dtype +A single character identifying the atomic datatype of this attribute. +Note that domain information is not used for most runtime record accesses. +.le +.ls prec +The precision of the atomic datatype of this attribute, i.e., the number +of bytes of storage per element. +.le +.ls count +The number of elements of type \fIdtype\fR in the attribute. If this value +is one the attribute is a scalar. Zero implies a variable length array +and N denotes a static array of N elements. +.le +.ls offset +The offset of the field in bytes from the start of the record. +.le +.ls width +The width of the field in bytes. All fields occupy a fixed amount of space +in a record. In the case of variable length arrays fields \fBoffset\fR and +\fBwidth\fR refer to the array descriptor. +.le +.le + + +In summary, the attribute list defines the physical structure of a record +as stored in mass storage. DBIO is responsible for encoding and decoding +records as well as for all access to the fields of records. A record is +encoded as a byte stream in a machine independent format. The physical +representation of a record is discussed further in a later section describing +the DBIO storage structures. + +.nh 4 +Domains + + A domain is a restricted implementation of an abstract datatype. +Simple examples are the atomic datatypes char, integer, real, etc.; no doubt +these will be the most commonly used domains. A more interesting example is +the \fItime\fR domain. Times are stored in DBIO as attributes defined upon +the \fItime\fR domain. The atomic datatype of a time attribute is a four byte +integer; the value is the long integer value returned by the IRAF system +procedure \fBclktime\fR. Integer time values are convenient for time domain +arithmetic, but are not good for printed output. The definition of the +\fItime\fR domain therefore includes a specification for the output format +which will cause time attributes to be printed as a formatted date/time string. + +Domains are used to verify input and to format output, hence there is no +domain related overhead during record retrieval. The only exception to +this rule occurs when returning the value of an uninitialized attribute, +in which case the default value must be fetched from the domain descriptor. + +Domains may be defined either globally for the entire database or locally for +a specific table. Attributes in any table may be defined upon a global domain. +The system table \fBsysddt\fR defines all global and local domains. +The attributes of this table are outlined below. +.ls 4 +.ls 12 name +The name of the domain described by the current record. +Domain names may contain any combination of the alphanumeric characters +or underscore and must not exceed 16 characters in length. +.le +.ls domid +The domain identifier. A unique integer code by which the domain is referred +to internally. The \fIdomid\fR is unique within the table for which the domain +is defined. +.le +.ls relid +The relation identifier of the table to which this domain belongs. +This is set to zero if the domain is defined globally. +.le +.ls grpid +The group identifier of the group to which this domain belongs. +This is set to zero if the domain does not belong to a special group. +A negative value indicates that the named domain is itself a group +(groups are discussed in the next section). +.le +.ls dtype +A single character identifying the atomic datatype upon which the domain +is defined. +.le +.ls prec +The precision of the atomic datatype of this domain, i.e., the number +of bytes of storage per element. +.le +.ls defval +The default value for attributes defined upon this domain (a byte string of +length \fIprec\fR bytes). If no default value is specified DBIO will assume +that null values are not permitted for attributes defined upon this domain. +.le +.ls minval +The minimum value permitted. This attribute is used only for integer or real +valued domains. +.le +.ls maxval +The maximum value permitted. This attribute is used only for integer or real +valued domains. +.le +.ls enumval +If the domain is string valued with a fixed number of permissible value strings, +the legal values may be enumerated in this string valued field. +.le +.ls units +The units label for attributes defined upon this domain. +.le +.ls format +The default output format for printed output. All SPP formats are supported +(e.g., including HMS, HM, octal, etc.) plus some special DBMS formats, e.g., +the time format. +.le +.ls width +The field width in characters for printed output. +.le +.le + + +Note that the \fIunits\fR and \fIformat\fR fields and the four "*val" fields +are stored as variable length character arrays, hence there is no fixed limit +on the sizes of these strings. Use of a variable length field also minimizes +storage requirements and makes it easy to test for an uninitialized value. +Only fixed length string fields and scalar valued numeric fields may be used +in indexes and selection predicates, however. + +A number of global domains are predefined by DBIO. These are summarized +in the table below. + + +.ks +.nf + NAME DTYPE PREC DEFVAL + + byte u 1 0 + char c arb nullstr + short i 2 INDEFS + int i 4 INDEFI + long i 4 INDEFL + real r 4 INDEFR + double r 8 INDEFD + time i 4 0 +.fi +.ke + + +The predefined global domains, as well as all user defined domains, are defined +in terms of the four DBK variable precision atomic datatypes. These are the +following: + + +.ks +.nf + NAME DTYPE PREC DESCRIPTION + + char c >=1 character + uint u 1-4 unsigned integer + int i 1-4 signed integer + real r 2-8 floating point +.fi +.ke + + +DBIO stores records with the field values encoded in the machine independent +variable precision DBK data format. The precision of an atomic datatype is +specified by an integer N, the number of bytes of storage to be reserved for +the value. The permissible precisions for each DBK datatype are shown in +the preceding table. The actual encoding used is designed to simplify the +semantics of the DBK and is not any standard format. The DBK binary encoding +will be described in a later section. + +.nh 4 +Groups + + A \fBgroup\fR is a logical grouping of several related attributes. +A group is much like a relation except that a group is a type of domain +and may be used as such to define the attributes of relations. Since groups +are similar to relations groups are defined in the \fBsysatt\fR table +(groups do not however appear in the system catalog). Each member of a +group is an attribute defined upon some domain; nesting of groups is permitted. + +Groups are expanded when a relation is defined, hence the runtime system +need not be aware of groups. Expansion of a group produces a set of ordinary +attributes wherein each attribute name consists of the group name glued +to the member name with a period, e.g., the resolved attributes "cv.ncoeff" +and "cv.type" are the result of expansion of a two-member group attribute +named "cv". + +The main purposes of the group construct are to simplify data definition and +to give the forms generator additional information for structuring formatted +output. Groups provide a simple capability for structuring data within a table. +Whenever the same grouping of attributes occurs in several tables the group +mechanism should be used to ensure that all instances of the group are +defined equivalently. + +.nh 4 +Views + + A \fBview\fR is a virtual table defined in terms of one or more base +tables or other views via a record select/project expression. Views provide +different ways of looking at the same data; the view mechanism can be very +useful when working with large, complex base tables (it saves typing). +Views allow the user to focus on just the data that interests them and ignore +the rest. The view mechansism also significantly increases the amount of data +independence provided by DBIO, since a base table can be made to look +differently to different applications programs without physically modifying +the table or producing several copies of the same table. This capability can +be invaluable when the tables involved are very large or cannot be modified +for some reason. + +A view provides a "window" into one or more base tables. The window is +dynamic in the sense that changes to the underlying base tables are immediately +visible through the window. This is because a view does not contain any data +itself, but is rather a \fIdefinition\fR via record selection and projection +of a new table in terms of existing tables. For example, consider the +following imaginary select/project expression (SPE): + + data1 [x >= 10 and x <= 20] % obj, x, y + +This defines a new table with attributes \fIobj\fR, \fIx\fR, and \fIy\fR +consisting of all records of table \fIdata1\fR for which X is in the range +10 to 20. We could use the SPE shown to copy the named fields of the +selected records to produce a new base table, e.g. \fId1x\fR. +The view mechanism allows us to define table \fId1x\fR as a view-table, +storing only the SPE shown. When the view-table \fId1x\fR is subsequently +queried DBIO will \fImerge\fR the SPE supplied in the new query with that +stored in the view, returning only records which satisfy both selection +expressions. This works because the output of an SPE is a table and can +therefore be used as input to another SPE, i.e., two or more selection +expressions can be combined to form a more complex expression. + +A view appears to the user (or to a program) as a table, behaving equivalently +to a base table in most operations. View-tables appear in the catalog and +can be created and deleted much like ordinary tables. + +.nh 4 +Null Values + + Null valued attributes are possible in any database system; they are +guaranteed to occur when the system permits new attributes to be dynamically +added to existing, nonempty base tables. DBIO deals with null values by +the default value mechanism mentioned earlier in the discussion of domains. +When the value of an uninitialized attribute is referenced DBIO automatically +supplies the user specified default value of the attribute. The defaulting +mechanism supports three cases; these are summarized below. +.ls 4 +.ls o +If null values are not permitted for the referenced attribute DBIO will +return an error condition. This case is indicated by the absence of a +default value. +.le +.ls o +Indefinite (or any special value) may be returned as the default value if +desired, allowing the calling program to test for a null value. +.le +.ls o +A valid default value may be returned, with no checking for null values +occurring in the calling program. +.le +.le + + +Testing for null values in predicates is possible only if the default value +is something recognizable like INDEF, and is handled by the conventional +equality operator. Indefinites are propagated in expressions by the usual +rules, i.e., the result of any arithmetic expression containing an indefinite +is indefinite, order comparison where an operand is indefinite is illegal, +and equality or inequality comparison is legal and is well defined. + +.nh 3 +Data Definition Language + + The data definition language (DDL) is used to define the objects in a +database, e.g., during table creation. The function of the DBIO table +creation procedure is to add tuples to the system tables to define a new +table and all attributes, groups, and domains used in the table. The data +definition tuples can come from either of two sources: [1] they can be +copied in compiled form from an existing table, or [2] they can be +generated by compilation of a DDL source specification. + +In appearance DDL looks much like a series of structure declarations such +as one finds in most modern compiled languages. DDL text may be entered +either via a string buffer in the argument list (no file access required) +or via a text file named in the argument list to the table creation procedure. + +The DDL syntax has not yet been defined. An example of what a DDL declaration +for the IMIO \fImasks\fR relation might look like is shown below. The syntax +shown is a generalization of the SPP+ syntax for a structure declaration with +a touch of the CL thrown in. If a relation is defined only in terms of the +predefined domains or atomic datatypes and has no primary key, etc., then the +declaration would look very much like an SPP+ (or C) structure declaration. + + +.ks +.nf + relation masks { + u2 mask { width=6 } + c64 image { defval="", format="%20.20s", width=21 } + c15 type { defval="generic" } + byte naxis + long naxis1, naxis2, naxis3, naxis4 + long npix + i2 pixels[] + } where { + key = mask+image+type + comment = "image region masks" + } +.fi +.ke + + +The declaration shown identifies the primary key for the relation and gives +a comment describing the relation, then declares the attributes of the +relation. In this example all domains are either local and are declared +implicitly, or they are global and are predefined. For example, DBIO will +automatically create a domain named "type" belonging to the relation "masks" +for the attribute named "type". DBIO is assumed to provide default values +for the attributes of each domain (e.g., "format", "width", etc.) not +specified explicitly in the declaration. It should be possible to keep +the DDL syntax simple enough that a LALR parser does not have to be used, +reducing text memory requirements and the time required to process the DDL, +and improving error diagnostics. + +.nh 3 +Record Select/Project Expressions + + Most programs using DBIO will be relational operators, taking a table +as input, performing some operation or transformation upon the table, and +either updating the table or producing a new table as output. DBIO record +select/project expressions (SPE) are used to define the input table. +By using an SPE one can define the input table to be any subset of the +fields (projection) of any subset of the records (selection) of any set of +base tables or views (set union). + +The general form of a select/project expression is shown below. The syntax +is patterned after the algebraic languages and even happens to be upward +compatible with the existing IMIO image template syntax. + + +.ks +.nf + tables [pred] [upred] % fields + +where + + tables Is a comma delimited list of tables. + + , Is the set union operator (in the tables and + fields lists). + + [ Is the selection operator. + + pred Is a predicate, i.e., a boolean condition. + The simplest predicate is a constant or + list of constants, specifying a set of + possible values for the primary key. + + upred Is a user predicate, passed back to the + calling program appended to the record + name but not used by DBIO. This feature + is used to implement image sections. + + % Is the projection operator. + + fields Is a comma delimited list of \fIexpressions\fR + defined upon the attributes of the input + relation, defining the attributes of the + output relation. +.fi +.ke + + +All components of an SPE are optional except \fItables\fR; the simplest +SPE is the name of a single table. Some simple examples follow. + +.nh 4 +Examples + + Print all fields of table \fInite1\fR. The table \fInite1\fR is an image +table containing several images with primary keys 1, 2, 3, and so on. + + cl> ptable nite1 + +Print selected fields of table \fInite1\fR. + + cl> ptable nite1%image,title + +Plot line 200 of image 2 in table \fInite1\fR. + + cl> graph nite1[2][*,200] + +Print image statistics on the indicated images in table \fInite1\fR. +The example shows a predicate specifying images 1, 3, and 5 through 12, +not an image section. + + cl> imstat nite1[1,3,5:12] + +Print the names and number of bad pixels in tables \fInite1\fR and \fIm87\fR +for all images that have any bad pixels. + + cl> ptable "nite1,m87 [nbadpix > 0] % image, nbadpix" + + +The tables in an SPE may be general select/project expressions, not just the +names of base tables or views as in the examples. In other words, SPEs +may be nested, using parenthesis around the inner SPE if necessary to indicate +the order of evaluation. As noted earlier in the discussion of views, +the ability of SPEs to nest is used to implement views. Nesting may also +be used to perform selection or projection upon the individual input tables. +For example, the SPE used in the following command specifies the union of +selected records from tables \fInite1\fR and \fInite2\fR. + + cl> imstat nite1[1,8,21:23],nite2[9] + +.nh 3 +Operators |