diff options
author | Joseph Hunkeler <jhunkeler@gmail.com> | 2015-07-08 20:46:52 -0400 |
---|---|---|
committer | Joseph Hunkeler <jhunkeler@gmail.com> | 2015-07-08 20:46:52 -0400 |
commit | fa080de7afc95aa1c19a6e6fc0e0708ced2eadc4 (patch) | |
tree | bdda434976bc09c864f2e4fa6f16ba1952b1e555 /sys/dbio | |
download | iraf-linux-fa080de7afc95aa1c19a6e6fc0e0708ced2eadc4.tar.gz |
Initial commit
Diffstat (limited to 'sys/dbio')
-rw-r--r-- | sys/dbio/README | 3 | ||||
-rw-r--r-- | sys/dbio/db2.doc | 674 | ||||
-rw-r--r-- | sys/dbio/db2.hlp | 612 | ||||
-rw-r--r-- | sys/dbio/doc/dbio.hlp | 413 | ||||
-rw-r--r-- | sys/dbio/new/coords | 73 | ||||
-rw-r--r-- | sys/dbio/new/dbio.con | 202 | ||||
-rw-r--r-- | sys/dbio/new/dbio.hlp | 3202 | ||||
-rw-r--r-- | sys/dbio/new/dbio.hlp.1 | 346 | ||||
-rw-r--r-- | sys/dbio/new/dbki.hlp | bin | 0 -> 6401 bytes | |||
-rw-r--r-- | sys/dbio/new/ddl | 125 | ||||
-rw-r--r-- | sys/dbio/new/schema | 307 | ||||
-rw-r--r-- | sys/dbio/new/spie.ms | 17 |
12 files changed, 5974 insertions, 0 deletions
diff --git a/sys/dbio/README b/sys/dbio/README new file mode 100644 index 00000000..d5411a74 --- /dev/null +++ b/sys/dbio/README @@ -0,0 +1,3 @@ + +This directory shall contain the sources for the iraf database package. +See the discussion in the crib sheet for more information on DBIO. diff --git a/sys/dbio/db2.doc b/sys/dbio/db2.doc new file mode 100644 index 00000000..66a38c41 --- /dev/null +++ b/sys/dbio/db2.doc @@ -0,0 +1,674 @@ +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + + IRAF DATABASE I/O + Doug Tody + November 1984 + + + + + +1. INTRODUCTION + + The IRAF database i/o package (DBIO) is a library of SPP callable +procedures used to create, modify, and access IRAF database files. All +access to these database files shall be indirectly or directly via the +DBIO interface. DBIO shall be implemented using IRAF file i/o and +memory management facilities, hence the package will be compact and +portable. The separate CL level package DBMS shall be provided for +interactive database access and for procedural access to the database +from within CL scripts. The DBMS tasks will access the database via +DBIO. + +Virtually all runtime IRAF datafiles not maintained in text form shall +be maintained under DBIO, hence it is essential that the interface be +both efficient and compact. In particular, bulk data (images) and +large catalogs shall be maintained under DBIO. The requirement for +flexibility in defining and accessing IRAF image headers necessitates +quite a sophisticated interface. Catalog storage is required primarily +for module intercommunication and output of the results of the larger +IRAF applications packages, but will also be useful for accessing +astronomical catalogs prepared outside IRAF (e.g., the SAO star +catalog). In short, virtually all IRAF applications packages are +expected to make use of DBIO; many will depend upon it heavily. + +The relationship of the DBIO and DBMS packages to each other and to the +related standard IRAF interfaces is shown in Figure 1.1. + + + DBMS + DBIO + FIO + MEMIO + (kernel) + (datafiles) + + (cl) | (vos) | (host) + + + + Fig 1.1 Major Interfaces + + +While images will be maintained under DBIO, access to the pixels will +continue to be provided by the IMIO interface. IMIO is a higher level +interface which will use DBIO to maintain the image header. Pixel + + + -1- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + +storage will be either in a separate pixel storage file or in the +database file itself (as a one dimensional array), depending on the +size of the image. A system defined thresold value will determine +which type of storage is used. The relationship of IMIO to DBIO is +shown in Figure 1.2. + + + IMAGES + IMIO + DBIO + FIO + MEMIO + + (cl) | (vos) + + + Fig 1.2 Relationship of Database and Image I/O + + + +2. REQUIREMENTS + + The requirements for the DBIO interface are driven by its intended +usage for image and catalog storage. It is arguable whether the same +interface should be used for both types of data, but development of an +interface such as DBIO with all the associated DBMS utilities is +expensive, hence we would prefer to have to develop only one such +interface. Furthermore, it is desirable for the user to only have to +learn one such interface. The primary functional and performance +requirements which DBIO must meet are the following (in no particular +order). + + + [1] DBIO shall provide a high degree of data independence, i.e., a + program shall be able to access a data structure maintained + under DBIO without detailed knowledge of its contents. + + [2] A DBIO datafile shall be self describing and self contained, + i.e., it shall be possible to examine the structure and + contents of a DBIO datafile without prior knowledge of its + structure or contents. + + [3] DBIO shall be able to deal efficiently with records containing + up to N fields and with data groups containing up to M records, + where N and M are at least sysgen configurable and are order of + magnitude N=10**2 and M=10**6. + + [4] The time required to access an image header under DBIO must be + comparable to the time currently required for the equivalent + operation under IMIO. + + + + + + -2- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + + [5] It shall be possible for an image header maintained under DBIO + to contain application or user defined fields in addition to + the standard fields required by IMIO. + + [6] It shall be possible to dynamically add new fields to an + existing image header (or to any DBIO record). + + [7] It shall be possible to group similar records together in the + database and to perform global operations upon all or part of + the records in a group. + + [8] It shall be possible for a field of a record to be a + one-dimensional array of any of the primitive types. + + [9] Variant records (records containing variable size fields) shall + be supported, ideally without penalizing efficient access to + databases which do not contain such records. + + [A] It shall be possible to copy a record without knowledge of its + contents. + + [B] It shall be possible to merge (join) two records containing + disjoint sets of fields. + + [C] It shall be possible to update a record in place. + + [D] It shall be possible to simultaneously access (retrieve, + update, or insert) multiple records from the same data group. + + +To summarize, the primary requirements are data independence, efficient +access to both large and small databases, and flexibility in the +contents of the database. + + + +3. CONCEPTUAL DESIGN + + The DBIO database faciltities shall be based upon the relational +model. The relational model is preferred due to its simplicity (to the +user) and due to the demonstrable fact that relational databases can +efficiently handle large amounts of data. In the relational model the +database appears to be nothing more than a set of TABLES, with no +builtin connections between separate tables. The operations defined +upon these tables are based upon the relational algebra, which is in +turn based upon set theory. The major advantages claimed for +relational databases are the simplicity of the concept of a database as +a collection of tables, and the predictability of the relational +operators due to their being based on a formal theoretical model. + +None of the requirements listed in section 2 state that DBIO must +implement a relational database. Most of our needs can be met by +structuring our data according to the relational data model (i.e., as + + + -3- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + +tables), and providing a good SELECT operator for retrieving records +from the database. If a semirelational database is sufficient to meet +our requirements then most likely that is what will be built (at least +initially; the relational operators are very attractive for data +analysis). DBIO is not expected to be competitive with any commercial +relational database; to try to make it so would probably compromise the +requirement that the interface be compact. On the other hand, the +database requirements of IRAF are similar enough to those addressed by +commercial databases that we would be foolish not to try to make use of +some of the same technology. + + + FORMAL RELATIONAL TERM INFORMAL EQUIVALENTS + + relation table + tuple record, row + attribute field, column + domain datatype + primary key record id + + +A DBIO DATABASE shall consist of one or more RELATIONS (tables). Each +relation shall contain zero or more RECORDS (rows of the table). Each +record shall contain one or more FIELDS (columns of the table). All +records in a relation shall share the same set of fields, but all of +the fields in a record need not have been assigned values. When a new +ATTRIBUTE (column) is added to an existing relation a default valued +field is added to each current and future record in the relation. + +Each attribute is defined upon a particular DOMAIN, e.g., the set of +all nonnegative integer values less than or equal to 100. It shall be +possible to specify minimum and maximum values for integer and real +attributes and to enumerate the permissible values of a string type +attribute. It shall be possible to specify a default value for an +attribute. If no default value is given INDEF is assumed. One +dimensional arrays shall be supported as attribute types; these will be +treated as atomic datatypes by the relational operators. Array valued +attributes shall be either fixed in size (the most efficient form) or +variant. There need be no special character string datatype since one +dimensional arrays of type character are supported. + +Each relation shall be implemented as a separate file. If the relations +comprising a database are stored in a directory then the directory can +be thought of as the database. Public databases will be stored in well +known public (write protected) directories, private databases in user +directories. The logical directory name of each public database will be +the name of the database. Physical storage for a database need not +necessarily be allocated locally, i.e., a database may be centrally +located and remotely accessed if the host computer is part of a local +area network. + +Locking shall be at the level of entire relations rather than at the +record level, at least in the initial implementation. There shall be + + + -4- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + +no support for indices in the initial implementation except possibly +for the primary key. It should be possible to add either or both of +these features to a future implementation without changing the basic +DBIO interface. Modifications to the internal data structures used in +database files will likely be necessary when adding such a major +feature, making a save and restore operation necessary for each +database file to convert it to the new format. The save format chosen +(e.g. FITS table) should be independent of the internal format used at +a particular time on a particular host machine. + +Images shall be stored in the database as individual records. All +image records shall share a common subset of attributes. Related +images (image records) may be grouped together to form relations. The +IRAF image operators shall support operations upon relations (sets of +images) much as the IRAF file operators support operations upon sets of +files. + +A unary image operator shall take as input a relation (set of one or +more images), inserting the processed images into the output relation. +A binary image operator shall take as input either two relations or a +relation and a record, inserting the processed images into the output +relation. In all cases the output relation can be an input relation as +well. The input relation will be defined either by a list or by +selection using a theta-join (operationally similar to a filename +template). + + + +3.1 RELATIONAL OPERATORS + + DBIO shall support two basic types of database operations: +operations upon relations and operations upon records. The basic +relational operators are the following. All of these operators produce +as output a new relation. + + + create + Create a new base relation (physical relation as stored on + disk) by specifying an initial set of attributes and the + (file)name for the new relation. Attributes and domains may be + specified via a data definition file or by reference to an + existing relation. A primary key (limited to a single + attribute) should be identified. The new relation initially + contains no records. + + drop + Delete a (possibly nonempty) base relation and any associated + indices. + + alter + Add a new attribute or attributes to an existing base relation. + Attributes may be specified explicitly or by reference to + another relation. + + + -5- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + + select + Create a new relation by selecting records from one or more + existing base relations. Input consists of an algebraic + expression defining the output relation in terms of the input + relations (usage will be similar to filename templates). The + output relation need not have the same set of attributes as the + input relations. The SELECT operator shall ultimately implement + all the basic operations of the relational algebra, i.e., + select, project, join, and the set operations. At a minimum, + selection and projection are required in the initial + interface. The output of SELECT is not a named relation (base + relation), but is instead intended to be accessed by the record + level operators discussed in the next section. + + edit + Edit a relation. An interactive screen editor is entered + allowing the user to add, delete, or modify tuples (not + required in the initial version of the interface). Field + values are verified upon input. + + sort + Make the storage order of the records in a relation agree with + the order defined by the primary key (the index associated with + the primary key is always sorted but index order need not agree + with storage order). In general, retrieval on a sorted + relation is more efficient than on an unsorted relation. + Sorting also eliminates deadspace left by record deletion or by + updates involving variant records. + + +Additional nonalgebraic operators are required for examining the +structure and contents of relations, returning the number of records or +attributes in a relation, and determining whether a given relation +exists. + +The SELECT operator is the primary user interface to DBIO. Since most +of the relational power of DBIO is bound up in the SELECT operator and +since SELECT will be driven by an algebraic expression (character +string) there is considerable scope for future enhancement of DBIO +without affecting existing code. + + + +3.2 RECORD (TUPLE) LEVEL OPERATORS + + While the user should see primarily operations on entire relations, +record level processing is necessary at the program level to permit +data entry and implementation of special operators. The basic record +level operators are the following. + + + + + + + -6- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + + retrieve + Retrieve the next record from the relation defined by SELECT. + While the tuples in a relation theoretically form an unordered + set, tuples will normally be returned in either storage order + or in the sort order of the primary key. Although all fields + of a retrieved record are accessible, an application will + typically have knowledge of only a few fields. + + update + Rewrite the (possibly modified) current record. The updated + record is written back into the base table from which it was + read. Not all records produced by SELECT can be updated. + + insert + Insert a new record into an output relation. The output + relation may be an input relation as well. Records added to an + output relation which is also an input relation do not become + candidates for selection until another SELECT occurs. A + retrieve followed by an insert copies a record without + knowledge of its contents. A retrieve followed by modification + of selected fields followed by an insert copies all unmodified + fields of the record. The attributes of the input and output + relations need not match; unmatched output attributes take on + their default values and unmatched input attributes are + discarded. INSERT returns a pointer to the output record, + allowing insertions of null records to be followed by + initialization of the fields of the new record. + + delete + Delete the current record. + + +Additional operators are required to close or open a relation for record +level access and to count the number of records in a relation. + + + +3.2.1 CONSTRUCTING SPECIAL RELATIONAL OPERATORS + + The record level operations may be combined with SELECT in compiled +programs to implement arbitrary operations upon entire relations. The +basic scenario is as follows: + + + [1] The set of records to be operated upon, defined by the SELECT + operator, is opened as an unordered set (list) of records to be + processed. + + [2] The "next" record in the relation is accessed with RETRIEVE. + + + + + + + -7- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + + [3] The application reads or modifies a subset of the fields of the + record, updating modified records or inserting the record in + the output relation. + + [4] Steps [2] and [3] are repeated until the entire relation has + been processed. + + +Examples of such operators are conversion to and from DBIO and LIST file +formats, column extraction, mimimum or maximum of an attribute (domain +algebra), and all of the DBMS and IMAGES operators. + + + +3.3 FIELD (ATTRIBUTE) LEVEL OPERATORS + + Substantial processing of the contents of a database is possible +without ever accessing the individual fields of a record. If field +level access is required the record must first be retrieved or +inserted. Field level access requires knowledge of the names of the +attributes of the parent relation, but not their exact datatypes. +Automatic type conversion occurs when field values are queried or set. + + + get + Get the value of the named scalar or vector field (typed). + + put + Put the value of the named scalar or vector field (typed). + + read + Read the named fields into an SPP data structure, given the + name, datatype, and length (if vector) of each field in the + output structure. There must be an attribute in the parent + relation for each field in the output structure. + + write + Copy an SPP data structure into the named fields of a record, + given the name, datatype, and length (if vector) of each field + in the input structure. There must be an attribute in the + parent relation for each field in the input structure. + + access + Determine whether a relation has the named attribute. + + + +3.4 STORAGE STRUCTURES + + The DBIO storage structures are the data structures used by DBIO to +maintain relations in physical storage. The primary design goals are +simplicity and efficiency in time and space. Most actual relations are +expected to fall into three classes: + + + -8- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + + [1] Relations containing only a single record, e.g., an image + stored alone in a relation. + + [2] Relations containing several dozen or several hundred records, + e.g., a collection of spectra from an observing run. + + [3] Large relations containing 10**5 or 10**6 records, e.g., the + output of an analysis program or an astronomical catalog. + + +Updates and insertions are generally random access operations; retrieval +based on the values of several attributes requires efficient sequential +access. Efficient random access for relations [2] and [3] requires use +of an index. Efficient sequential access requires that records be +accessible in storage order without reference to the index, i.e., that +records be chained in storage order. Efficient field access where a +record contains several dozen attributes requires something better than +a linear search over the attribute list. + +The use of an index shall be limited initially to a single index for +the primary key. The primary key will be restricted to a single +attribute, with the application defining the attribute to be used (in +practice few attributes are usable as keys). The index will be a +standard B+ tree, with one exception: the root block of the tree will +be maintained in dedicated storage in the datafile. If and only if a +relation grows so large that it overflows the root block will a +separate index file be allocated for the index. This will eliminate +most of the overhead associated with the index for small relations. + +Efficient sequential access will be provided in either of two ways: via +the index in index order or via the records themselves in storage order, +depending on the operation being performed. If an external index is +used the leaves will be chained to permit efficient sequential access +in index order. If the relation also happens to be sorted in index +order then this mode of access will be very efficient. Link +information will also be stored directly in the records to permit +efficient sequential access when it is not necessary or possible to use +the index. + +Assuming that there is at most one index associated with a relation, at +most two files will be required to implement the relation. The relation +itself will have the file extension ".db". The index file, if any, will +have the extension ".dbi". The root name of both files will be the +name of the relation. + +The datafile header structure will probably have to be maintained in +binary if we are to keep the overhead of datafile access to acceptable +levels for small relations. Careful design of the basic header +structure should make most future refinements to the header possible +without modification of existing databases. The revision number of +DBIO used to create the datafile will be saved in the header to make at +least detection of obsolete headers possible. + + + + -9- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + +3.4.1 STRUCTURE OF A BINARY RELATION + + Putting all this together we come up with the following structure +for a binary relation: + + + BOF + relation header -+ + magic | + dbio revision number | + creation date | + relation name | + number of attributes |- fixed size header + primary key | + record size | + domain list | + attribute list | + miscellaneous | + string buffer | + root block of index -+ + record 1 + physical record length (offset to next record) + logical record length (implies number of attributes set) + field storage + <gap> + record 2 + ... + record N + EOF + + +Vector valued fields with a fixed upper size will be stored directly in +the record, prefixed by the length of the actual vector (which may vary +from record to record). Storage for variant fields will be allocated +outside the record, placing only a pointer to the data vector and byte +count in the record itself. Variant records are thus reduced to fixed +size records, simplifying record access and making sequential access +more efficient. + +Records will change size only when a new attribute is added to an +existing relation, followed by assignment into a record written when +there were fewer attributes. If the new record will not fit into the +physical slot already allocated, the record is written at EOF and the +original record is deleted. Deletion of a record is achieved by +setting the logical record length to zero. Storage is not reclaimed +until a sort occurs, hence recovery of deleted records is possible. + +To minimize buffer space and memory to memory copies when accessing a +relation it is desirable to work directly out of the FIO buffers. To +make this possible records will not be permitted to straddle logical +block boundaries. A file block will typically contain several records +followed by a gap. The gap may be used to accomodate record expansion +without moving a record to EOF. The size of a file block is fixed when + + + -10- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + +the relation is created. + + + +3.4.2 THE ATTRIBUTE LIST + + Efficient lookup of attribute names suggests maintenance of a hash +table in the datafile header. There will be a fixed upper limit on the +number of attributes permitted in a single relation (but not on the +number of records). Placing an upper limit on the number of attributes +simplifies the software considerably and permits use of a fixed size +header, making it possible to read or update the entire header in one +disk access. There will also be an upper limit on the number of +domains, but the domain list is not searched very often hence a linear +search will do. + +All information about the decomposition of a record into fields, other +than the logical length of vector valued fields, is given by the +attribute list. Records contain only data with no embedded structural +information other than the length of the vector fields. New attributes +are added to a relation by appending to the attribute list. Existing +records are not affected. By comparing the logical length of a record +to the offset for a particular field we can tell whether storage has +been allocated for that field in the record. + +Domains are used to limit the range of values a field can take on in an +assignment, and to flag attribute comparisons which are likely to be +erroneous (e.g. order comparison of a pixel coordinate and a +wavelength). The domains "bool", "char", "short", etc. are +predefined. The following information must be stored for each user +defined domain: + + + name may be same as attribute name + datatype bool, char, short, etc. + physical vector length 0=variant, 1=scalar, N=vector + default default value, INDEF if not given + minimum mimimum value (ints and reals) + maximum maximum value (ints and reals) + enumval enumerated values (strings) + + +The following information is required to describe each attribute. The +attribute list is maintained separately from the hash table of attribute +names and can be used to regenerate the hash table of attribute names if +necessary. + + + name no embedded whitespace + domain index into domain table + offset offset in record + + + + + -11- +DBIO (Nov84) Database I/O Design DBIO (Nov84) + + + +All strings will will be stored in a fixed size string buffer in the +header area; it is the index of the string which is stored in the +domain and attribute lists. This eliminates the need to place an upper +limit on the size of domain names and enumerated value lists and makes +it possible for a single attribute name string to be referenced in both +the attribute list and the attribute hash table. + + + +4. SPECIFICATIONS diff --git a/sys/dbio/db2.hlp b/sys/dbio/db2.hlp new file mode 100644 index 00000000..ffe3b74c --- /dev/null +++ b/sys/dbio/db2.hlp @@ -0,0 +1,612 @@ +.help dbio Nov84 "Database I/O Design" +.ce +\fBIRAF Database I/O\fR +.ce +Doug Tody +.ce +November 1984 +.sp 3 +.nh +Introduction + + The IRAF database i/o package (DBIO) is a library of SPP callable +procedures used to create, modify, and access IRAF database files. +All access to these database files shall be indirectly or directly via the +DBIO interface. DBIO shall be implemented using IRAF file i/o and memory +management facilities, hence the package will be compact and portable. +The separate CL level package DBMS shall be provided for interactive database +access and for procedural access to the database from within CL scripts. +The DBMS tasks will access the database via DBIO. + +Virtually all runtime IRAF datafiles not maintained in text form shall be +maintained under DBIO, hence it is essential that the interface be both +efficient and compact. In particular, bulk data (images) and large catalogs +shall be maintained under DBIO. The requirement for flexibility in defining +and accessing IRAF image headers necessitates quite a sophisticated interface. +Catalog storage is required primarily for module intercommunication and +output of the results of the larger IRAF applications packages, but will also +be useful for accessing astronomical catalogs prepared outside IRAF (e.g., +the SAO star catalog). In short, virtually all IRAF applications packages +are expected to make use of DBIO; many will depend upon it heavily. + +The relationship of the DBIO and DBMS packages to each other and to the +related standard IRAF interfaces is shown in Figure 1.1. + + +.ks +.nf + DBMS + DBIO + FIO + MEMIO + (kernel) + (datafiles) + + (cl) | (vos) | (host) + + + +.fi +.ce +Fig 1.1 Major Interfaces +.ke + + +While images will be maintained under DBIO, access to the pixels will +continue to be provided by the IMIO interface. IMIO is a higher level interface +which will use DBIO to maintain the image header. Pixel storage will be either +in a separate pixel storage file or in the database file itself (as a one +dimensional array), depending on the size of the image. +A system defined thresold value will determine which type of storage is used. +The relationship of IMIO to DBIO is shown in Figure 1.2. + + +.ks +.nf + IMAGES + IMIO + DBIO + FIO + MEMIO + + (cl) | (vos) + + +.fi +.ce +Fig 1.2 Relationship of Database and Image I/O +.ke + +.nh +Requirements + + The requirements for the DBIO interface are driven by its intended usage +for image and catalog storage. It is arguable whether the same interface +should be used for both types of data, but development of an interface such +as DBIO with all the associated DBMS utilities is expensive, hence we would +prefer to have to develop only one such interface. Furthermore, it is desirable +for the user to only have to learn one such interface. The primary functional +and performance requirements which DBIO must meet are the following (in no +particular order). + +.ls +.ls [1] +DBIO shall provide a high degree of data independence, i.e., a program +shall be able to access a data structure maintained under DBIO without +detailed knowledge of its contents. +.le +.ls [2] +A DBIO datafile shall be self describing and self contained, i.e., it shall +be possible to examine the structure and contents of a DBIO datafile without +prior knowledge of its structure or contents. +.le +.ls [3] +DBIO shall be able to deal efficiently with records containing up to N fields +and with data groups containing up to M records, where N and M are at least +sysgen configurable and are order of magnitude N=10**2 and M=10**6. +.le +.ls [4] +The time required to access an image header under DBIO must be comparable +to the time currently required for the equivalent operation under IMIO. +.le +.ls [5] +It shall be possible for an image header maintained under DBIO to contain +application or user defined fields in addition to the standard fields +required by IMIO. +.le +.ls [6] +It shall be possible to dynamically add new fields to an existing image header +(or to any DBIO record). +.le +.ls [7] +It shall be possible to group similar records together in the database +and to perform global operations upon all or part of the records in a +group. +.le +.ls [8] +It shall be possible for a field of a record to be a one-dimensional array +of any of the primitive types. +.le +.ls [9] +Variant records (records containing variable size fields) shall be supported, +ideally without penalizing efficient access to databases which do not contain +such records. +.le +.ls [A] +It shall be possible to copy a record without knowledge of its contents. +.le +.ls [B] +It shall be possible to merge (join) two records containing disjoint sets of +fields. +.le +.ls [C] +It shall be possible to update a record in place. +.le +.ls [D] +It shall be possible to simultaneously access (retrieve, update, or insert) +multiple records from the same data group. +.le +.le + + +To summarize, the primary requirements are data independence, efficient access +to both large and small databases, and flexibility in the contents of the +database. + +.nh +Conceptual Design + + The DBIO database facilities shall be based upon the relational model. +The relational model is preferred due to its simplicity (to the user) +and due to the demonstrable fact that relational databases can efficiently +handle large amounts of data. In the relational model the database appears +to be nothing more than a set of \fBtables\fR, with no builtin connections +between separate tables. The operations defined upon these tables are based +upon the relational algebra, which is in turn based upon set theory. +The major advantages claimed for relational databases are the simplicity +of the concept of a database as a collection of tables, and the predictability +of the relational operators due to their being based on a formal theoretical +model. + +None of the requirements listed in section 2 state that DBIO must implement +a relational database. Most of our needs can be met by structuring our data +according to the relational data model (i.e., as tables), and providing a +good \fBselect\fR operator for retrieving records from the database. If a +semirelational database is sufficient to meet our requirements then most +likely that is what will be built (at least initially; the relational operators +are very attractive for data analysis). DBIO is not expected to be competitive +with any commercial relational database; to try to make it so would probably +compromise the requirement that the interface be compact. +On the other hand, the database requirements of IRAF are similar enough to +those addressed by commercial databases that we would be foolish not to try +to make use of some of the same technology. + + +.ks +.nf + \fBformal relational term\fR \fBinformal equivalents\fR + + relation table + tuple record, row + attribute field, column + domain datatype + primary key record id +.fi +.ke + + +A DBIO \fBdatabase\fR shall consist of one or more \fBrelations\fR (tables). +Each relation shall contain zero or more \fBrecords\fR (rows of the table). +Each record shall contain one or more \fBfields\fR (columns of the table). +All records in a relation shall share the same set of fields, +but all of the fields in a record need not have been assigned values. +When a new \fBattribute\fR (column) is added to an existing relation a default +valued field is added to each current and future record in the relation. + +Each attribute is defined upon a particular \fBdomain\fR, e.g., the set of +all nonnegative integer values less than or equal to 100. It shall be possible +to specify minimum and maximum values for integer and real attributes +and to enumerate the permissible values of a string type attribute. +It shall be possible to specify a default value for an attribute. +If no default value is given INDEF is assumed. +One dimensional arrays shall be supported as attribute types; these will be +treated as atomic datatypes by the relational operators. Array valued +attributes shall be either fixed in size (the most efficient form) or variant. +There need be no special character string datatype since one dimensional +arrays of type character are supported. + +Each relation shall be implemented as a separate file. If the relations +comprising a database are stored in a directory then the directory can +be thought of as the database. Public databases will be stored in well +known public (write protected) directories, private databases in user +directories. The logical directory name of each public database will be +the name of the database. Physical storage for a database need not necessarily +be allocated locally, i.e., a database may be centrally located and remotely +accessed if the host computer is part of a local area network. + +Locking shall be at the level of entire relations rather than at the record +level, at least in the initial implementation. There shall be no support for +indices in the initial implementation except possibly for the primary key. +It should be possible to add either or both of these features to a future +implementation without changing the basic DBIO interface. Modifications to +the internal data structures used in database files will likely be necessary +when adding such a major feature, making a save and restore operation +necessary for each database file to convert it to the new format. +The save format chosen (e.g. FITS table) should be independent of the +internal format used at a particular time on a particular host machine. + +Images shall be stored in the database as individual records. +All image records shall share a common subset of attributes. +Related images (image records) may be grouped together to form relations. +The IRAF image operators shall support operations upon relations +(sets of images) much as the IRAF file operators support operations upon +sets of files. + +A unary image operator shall take as input a relation (set of one or more +images), inserting the processed images into the output relation. +A binary image operator shall take as input either two relations or a +relation and a record, inserting the processed images into the output +relation. In all cases the output relation can be an input relation as +well. The input relation will be defined either by a list or by selection +using a theta-join (operationally similar to a filename template). + +.nh 2 +Relational Operators + + DBIO shall support two basic types of database operations: operations upon +relations and operations upon records. The basic relational operators +are the following. All of these operators produce as output a new relation. + +.ls +.ls create +Create a new base relation (physical relation as stored on disk) by specifying +an initial set of attributes and the (file)name for the new relation. +Attributes and domains may be specified via a data definition file or by +reference to an existing relation. +A primary key (limited to a single attribute) should be identified. +The new relation initially contains no records. +.le +.ls drop +Delete a (possibly nonempty) base relation and any associated indices. +.le +.ls alter +Add a new attribute or attributes to an existing base relation. +Attributes may be specified explicitly or by reference to another relation. +.le +.ls select +Create a new relation by selecting records from one or more existing base +relations. Input consists of an algebraic expression defining the output +relation in terms of the input relations (usage will be similar to filename +templates). The output relation need not have the same set of attributes as +the input relations. The \fIselect\fR operator shall ultimately implement +all the basic operations of the relational algebra, i.e., select, project, +join, and the set operations. At a minimum, selection and projection are +required in the initial interface. The output of \fBselect\fR is not a +named relation (base relation), but is instead intended to be accessed +by the record level operators discussed in the next section. +.le +.ls edit +Edit a relation. An interactive screen editor is entered allowing the user +to add, delete, or modify tuples (not required in the initial version of +the interface). Field values are verified upon input. +.le +.ls sort +Make the storage order of the records in a relation agree with the order +defined by the primary key (the index associated with the primary key is +always sorted but index order need not agree with storage order). +In general, retrieval on a sorted relation is more efficient than on an +unsorted relation. Sorting also eliminates deadspace left by record +deletion or by updates involving variant records. +.le +.le + + +Additional nonalgebraic operators are required for examining the structure +and contents of relations, returning the number of records or attributes in +a relation, and determining whether a given relation exists. + +The \fIselect\fR operator is the primary user interface to DBIO. +Since most of the relational power of DBIO is bound up in the \fIselect\fR +operator and since \fIselect\fR will be driven by an algebraic expression +(character string) there is considerable scope for future enhancement +of DBIO without affecting existing code. + +.nh 2 +Record (Tuple) Level Operators + + While the user should see primarily operations on entire relations, +record level processing is necessary at the program level to permit +data entry and implementation of special operators. The basic record +level operators are the following. + +.ls +.ls retrieve +Retrieve the next record from the relation defined by \fBselect\fR. +While the tuples in a relation theoretically form an unordered set, +tuples will normally be returned in either storage order or in the sort +order of the primary key. Although all fields of a retrieved record are +accessible, an application will typically have knowledge of only a few fields. +.le +.ls update +Rewrite the (possibly modified) current record. The updated record is +written back into the base table from which it was read. Not all records +produced by \fBselect\fR can be updated. +.le +.ls insert +Insert a new record into an output relation. The output relation may be an +input relation as well. Records added to an output relation which is also +an input relation do not become candidates for selection until another +\fBselect\fR occurs. A retrieve followed by an insert copies a record without +knowledge of its contents. A retrieve followed by modification of selected +fields followed by an insert copies all unmodified fields of the record. +The attributes of the input and output relations need not match; unmatched +output attributes take on their default values and unmatched input attributes +are discarded. \fBInsert\fR returns a pointer to the output record, +allowing insertions of null records to be followed by initialization of +the fields of the new record. +.le +.ls delete +Delete the current record. +.le +.le + + +Additional operators are required to close or open a relation for record +level access and to count the number of records in a relation. + +.nh 3 +Constructing Special Relational Operators + + The record level operations may be combined with \fBselect\fR in compiled +programs to implement arbitrary operations upon entire relations. +The basic scenario is as follows: + +.ls +.ls [1] +The set of records to be operated upon, defined by the \fBselect\fR +operator, is opened as an unordered set (list) of records to be processed. +.le +.ls [2] +The "next" record in the relation is accessed with \fBretrieve\fR. +.le +.ls [3] +The application reads or modifies a subset of the fields of the record, +updating modified records or inserting the record in the output relation. +.le +.ls [4] +Steps [2] and [3] are repeated until the entire relation has been processed. +.le +.le + + +Examples of such operators are conversion to and from DBIO and LIST file +formats, column extraction, mimimum or maximum of an attribute (domain +algebra), and all of the DBMS and IMAGES operators. + +.nh 2 +Field (Attribute) Level Operators + + Substantial processing of the contents of a database is possible without +ever accessing the individual fields of a record. If field level access is +required the record must first be retrieved or inserted. Field level access +requires knowledge of the names of the attributes of the parent relation, +but not their exact datatypes. Automatic type conversion occurs when field +values are queried or set. + +.ls +.ls get +.sp +Get the value of the named scalar or vector field (typed). +.le +.ls put +.sp +Put the value of the named scalar or vector field (typed). +.le +.ls read +Read the named fields into an SPP data structure, given the name, datatype, +and length (if vector) of each field in the output structure. +There must be an attribute in the parent relation for each field in the +output structure. +.le +.ls write +Copy an SPP data structure into the named fields of a record, given the +name, datatype, and length (if vector) of each field in the input structure. +There must be an attribute in the parent relation for each field in the +input structure. +.le +.ls access +Determine whether a relation has the named attribute. +.le +.le + +.nh 2 +Storage Structures + + The DBIO storage structures are the data structures used by DBIO to +maintain relations in physical storage. The primary design goals are +simplicity and efficiency in time and space. Most actual relations are +expected to fall into three classes: + +.ls +.ls [1] +Relations containing only a single record, e.g., an image stored alone +in a relation. +.le +.ls [2] +Relations containing several dozen or several hundred records, e.g., +a collection of spectra from an observing run. +.le +.ls [3] +Large relations containing 10**5 or 10**6 records, e.g., the output of an +analysis program or an astronomical catalog. +.le +.le + + +Updates and insertions are generally random access operations; retrieval +based on the values of several attributes requires efficient sequential +access. Efficient random access for relations [2] and [3] requires use +of an index. Efficient sequential access requires that records be +accessible in storage order without reference to the index, i.e., that +records be chained in storage order. Efficient field access where a +record contains several dozen attributes requires something better than +a linear search over the attribute list. + +The use of an index shall be limited initially to a single index for +the primary key. The primary key will be restricted to a single attribute, +with the application defining the attribute to be used (in practice few +attributes are usable as keys). +The index will be a standard B+ tree, with one exception: the root block +of the tree will be maintained in dedicated storage in the datafile. +If and only if a relation grows so large that it overflows the root block +will a separate index file be allocated for the index. This will eliminate +most of the overhead associated with the index for small relations. + +Efficient sequential access will be provided in either of two ways: via the +index in index order or via the records themselves in storage order, +depending on the operation being performed. If an external index is used +the leaves will be chained to permit efficient sequential access in index +order. If the relation also happens to be sorted in index order then this +mode of access will be very efficient. Link information will also be stored +directly in the records to permit efficient sequential access when it is +not necessary or possible to use the index. + +Assuming that there is at most one index associated with a relation, +at most two files will be required to implement the relation. The relation +itself will have the file extension ".db". The index file, if any, will +have the extension ".dbi". The root name of both files will be the name of +the relation. + +The datafile header structure will probably have to be maintained in binary +if we are to keep the overhead of datafile access to acceptable levels for +small relations. Careful design of the basic header structure should +make most future refinements to the header possible without modification of +existing databases. The revision number of DBIO used to create the datafile +will be saved in the header to make at least detection of obsolete headers +possible. + +.nh 3 +Structure of a Binary Relation + + Putting all this together we come up with the following structure for +a binary relation: + + +.ks +.nf + BOF + relation header -+ + magic | + dbio revision number | + creation date | + relation name | + number of attributes |- fixed size header + primary key | + record size | + domain list | + attribute list | + miscellaneous | + string buffer | + root block of index -+ + record 1 + physical record length (offset to next record) + logical record length (implies number of attributes set) + field storage + <gap> + record 2 + ... + record N + EOF +.fi +.ke + + +Vector valued fields with a fixed upper size will be stored directly in the +record, prefixed by the length of the actual vector (which may vary from +record to record). +Storage for variant fields will be allocated outside the record, placing only +a pointer to the data vector and byte count in the record itself. +Variant records are thus reduced to fixed size records, +simplifying record access and making sequential access more efficient. + +Records will change size only when a new attribute is added to an existing +relation, followed by assignment into a record written when there were +fewer attributes. If the new record will not fit into the physical slot +already allocated, the record is written at EOF and the original record +is deleted. Deletion of a record is achieved by setting the logical record +length to zero. Storage is not reclaimed until a sort occurs, hence +recovery of deleted records is possible. + +To minimize buffer space and memory to memory copies when accessing a +relation it is desirable to work directly out of the FIO buffers. +To make this possible records will not be permitted to straddle logical +block boundaries. A file block will typically contain several records +followed by a gap. The gap may be used to accommodate record expansion +without moving a record to EOF. The size of a file block is fixed when +the relation is created. + +.nh 3 +The Attribute List + + Efficient lookup of attribute names suggests maintenance of a hash table +in the datafile header. There will be a fixed upper limit on the number of +attributes permitted in a single relation (but not on the number of records). +Placing an upper limit on the number of attributes simplifies the software +considerably and permits use of a fixed size header, making it possible to +read or update the entire header in one disk access. There will also be an +upper limit on the number of domains, but the domain list is not searched +very often hence a linear search will do. + +All information about the decomposition of a record into fields, other than +the logical length of vector valued fields, is given by the attribute list. +Records contain only data with no embedded structural information other than +the length of the vector fields. New attributes are added to a relation by +appending to the attribute list. Existing records are not affected. +By comparing the logical length of a record to the offset for a particular +field we can tell whether storage has been allocated for that field in the +record. + +Domains are used to limit the range of values a field can take on in an +assignment, and to flag attribute comparisons which are likely to be erroneous +(e.g. order comparison of a pixel coordinate and a wavelength). The domains +"bool", "char", "short", etc. are predefined. The following information +must be stored for each user defined domain: + + +.ks +.nf + name may be same as attribute name + datatype bool, char, short, etc. + physical vector length 0=variant, 1=scalar, N=vector + default default value, INDEF if not given + minimum mimimum value (ints and reals) + maximum maximum value (ints and reals) + enumval enumerated values (strings) +.fi +.ke + + +The following information is required to describe each attribute. +The attribute list is maintained separately from the hash table of attribute +names and can be used to regenerate the hash table of attribute names if +necessary. + + +.ks +.nf + name no embedded whitespace + domain index into domain table + offset offset in record +.fi +.ke + + +All strings will be stored in a fixed size string buffer in the header +area; it is the index of the string which is stored in the domain and +attribute lists. This eliminates the need to place an upper limit on the +size of domain names and enumerated value lists and makes it possible +for a single attribute name string to be referenced in both the attribute +list and the attribute hash table. + +.nh +Specifications diff --git a/sys/dbio/doc/dbio.hlp b/sys/dbio/doc/dbio.hlp new file mode 100644 index 00000000..4f163415 --- /dev/null +++ b/sys/dbio/doc/dbio.hlp @@ -0,0 +1,413 @@ +.help dbio Oct83 "Database I/O Specifications" +.ce +Specifications of the IRAF DBIO Interface +.ce +Doug Tody +.ce +October 1983 +.ce +(revised November 1983) + +.sh +1. Introduction + + The IRAF database i/o interface (DBIO) shall provide a limited but +highly extensible and efficient database capability for IRAF. DBIO datafiles +will be used in IRAF to implement image headers and to store the output +from analysis programs. The simple structure of a DBIO datafile, and the +self describing nature of the datafile, should make it easy to address the +problems of developing a query language, providing a CL interface, and +transporting datafiles between machines. + +.sh +2. Database Structure: the Data Dictionary + + An IRAF datafile, database file, or "data dictionary" is a set of +records, each of which must have a unique name within the dictionary, +but which may be defined in any time order and stored in the datafile +in any sequential order. Each record in the data dictionary has the +following external attributes: + +.ls 4 +.ls 12 name +The name of the record: an SPP style identifier, not to exceed 28 +characters in length. The name must be unique within the dictionary. +.le +.ls aliases +A record may be known by several names, i.e., several distinct dictionary +entries may actually point to the same physical record. The concept is +similar to the "link" attribute of the UNIX file system. The number +of aliases or links is immediately available, but determination of the +actual names of all the aliases requires a search of the entire dictionary. +.le +.ls datatype +One of the eight primitive datatypes ("bcsilrdx"), or a user defined, +fixed format structure, made up of primitive-type fields. In the case +of a structure, the structure is defined by a C-style structure declaration +given as a char type record elsewhere in the dictionary. The "datatype" +field of a record is one of the strings "b", "c", "s", etc. for the +primitive types, or the name of the record defining the structure. +.le +.ls value +The value of the dictionary entry is stored in the datafile in binary form +and is allocated a fixed amount of storage per record element. +.le +.ls length +Each record in the dictionary is potentially an array. The length field +gives the number of elements of type "datatype" forming the record. +New elements may be added by writing to "record_name[*]". +.le +.le + + +The values of these attributes are available via ordinary DBIO read +requests (but writing is not allowed). Each record in the dictionary +automatically has the following (user accessible) fields associated with it: + + +.ks +.nf + r_type char[28] ("b", "c",.. or record name) + r_nlinks long (initially 1) + r_len long (initially 1) + r_ctime long time of record creation + r_mtime long time of last modify +.fi +.ke + + +Thus, to determine the number of elements in a record, one would make the +following function call: + + nelements = dbgeti (db, "record_name.r_len") + + +.sh +2.1 Records and Fields + + The most complicated reference to an entry in the data dictionary occurs +when a record is structured and both the record and field of the record are +arrays. In such a case, a reference will have the form: + +.nf + "record[i].field[j]" most complex db reference +.fi + +Such a reference defines a unique physical offset in the datafile. +Any DBIO i/o transfer which does not involve an illegal type conversion +may take place at that offset. Normally, however, if the field is an array, +the entire array will be transferred in a single read or write request. +In that case the datafile offset would be specified as follows: + + "record[i].field" + +.sh +3. Basic I/O Procedures + + The basic i/o procedures are patterned after FIO and CLIO, with the +addition of a string type field ("reference") defining the offset in the +datafile at which the transfer is to take place. Sample reference fields +are given in the previous section. In most cases, the reference field +is merely the name of the record or field to be accessed, i.e., "im.ndim", +"im.pixtype", and so on. The "dbset" and "dbstat" procedures are used +to set or inspect DBIO parameters affecting the operation of DBIO itself, +and do not perform i/o on a datafile. + + +.ks +.nf + db = dbopen (file_name, access_mode) + dbclose (db) + + dbset[ils] (db, parameter, value) + val = dbstat[ils] (db, parameter) + + val = dbget[bcsilrdx] (db, reference) + dbput[bcsilrdx] (db, reference, value) + + dbgstr (db, reference, outstr, maxch) + dbpstr (db, reference, string) + + nelems = dbread[csilrdx] (db, reference, buf, maxelems) + dbwrite[csilrdx] (db, reference, buf, nelems) +.fi +.ke + + +A new, empty database is created by opening with access mode NEW_FILE. +The get and put calls are functionally equivalent to those used by +the CL interface, down to the "." syntax used to reference fields. +The read and write calls are complicated by the need to be ignorant +about the actual datatype of a record. Hence we have added a type +suffix, with the implication that automatic type conversion will take +place if reasonable. This also eliminates the need to convert to and +from chars in the fourth argument, and avoids the need for a 7**2 type +conversion matrix. + + +.sh +4. Other DBIO Procedures + + A number of special purpose routines are provided for adding and +deleting dictionary entries, making links to create aliases, searching +a dictionary of unknown content, and so on. The calls are summarized +below: + + +.ks +.nf + stat = dbnextname (db, previous, outstr, maxch) + y/n = dbaccess (db, record_name, datatypes) + + dbenter (db, record_name, type, nreserve) + dblink (db, alias, existing_record) + dbunlink (db, record_name) +.fi +.ke + + +The semantics of these routines are explained in more detail below: + +.ls 4 +.ls 12 dbnextname +Returns the name of the next dictionary entry. If the value of the "previous" +argument is the null string, the name of the first dictionary entry is returned. +EOF is returned when the dictionary has been exhausted. +.le +.ls dbaccess +Returns YES if the named record exists and has one of the indicated datatypes. +The datatype string may consist of any of the following: (1) one or more +primitive type characters specifying the acceptable types, (2) the name of +a structure definition record, or (3) the null string, in which case only +the existence of the record is tested. +.le +.ls dbenter +Used to make a new entry in the dictionary. The "type" field is the name +of one of the primitive datatypes ("b", "c", etc.), or in the case of a +structure, the name of the record defining the structure. The "nreserve" +field specifies the number of elements of storage to be initially allocated +(more elements can always be added later). If nreserve is zero, no storage +is allocated, and a read error will result if an attempt is made to read +the record before it has been written. Storage allocated by dbenter is +initialized to zero. +.le +.ls dblink +Enter an alias for an existing entry. +.le +.ls dbunlink +Remove an alias from the dictionary. When the last link is gone, +the record is physically deleted and storage may be reclaimed. +.le +.le + + +.sh +5. Database Access from the CL + + The self describing nature of a datafile, as well as its relatively +simple structure, will make development of CL callable database query +utilities easy. It shall be possible to access the contents of a datafile +from a CL script almost as easily as one currently accesses the contents +of a parameter file. The main difference is that a separate process must be +spawned to access the database, but this process may contain any number of +database access primitives, and will sit in the CL process cache if frequently +used. The "onexit" call and F_KEEP FIO option in the program interface allow +the query task to keep one or more database files open for quick access, +until the CL disconnects the process. + +The ability to access the contents of a database from a CL script is crucial +if we are to be able to have data independent applications package modules. +The intention is that CL callable applications modules will not be written +for any particular instrument, but will be quite general. At the top level, +however, we would like to have a "canned" program which knows a lot about +an instrument, and which calls the more general package routines, passing +instrument specific parameters. + +This top level routine should be a CL script to provide maximum +flexibility to the scientist using the system at the CL level. Use of a +script is also required if modules from different packages are to be called +from a single high level module (anything else would imply poorly +structured code). +This requires that we be able to store arbitrary information in +image headers, and that this information be available in CL scripts. +DBIO will provide such a capability. + + + In addition to access from CL scripts, we will need interactive access +to datafiles at the CL level. The DBIO interface makes it easy to +provide such an interface. The following functions should be provided: +.ls 4 +.ls o +List the contents of a datafile, much as one would list the contents of +a directory. Thus, there should be a short mode (record name only), and +a long mode (including type, length, nlinks, date of last modify, etc.). +A one name per line mode would be useful for creating lists. Pattern +matching would be useful for selecting subsets. +.le +.ls o +List the contents of a record or list of records. List the elements of +an array, possibly for further processing by the LISTS package. In the +case of a record which is an array of structures, print the values of +selected fields as a table for further processing by the LISTS utilities. +And so on. +.le +.ls o +Edit a record. +.le +.ls o +Delete a record. +.le +.ls o +Copy a record or set of records, possibly between two different datafiles. +.le +.ls o +Copy an array element or range of array elements, possibly between two +different records or two different records in different datafiles. +.le +.ls o +Compress a datafile. DBIO probably will not reclaim storage online. +A separate compress operation will be required to reclaim storage in +heavily edited datafiles, and to consolidate fragmented arrays. +.le +.ls o +And more I'm sure. +.le +.le + +.sh +6. DBIO and Imagefiles + + As noted earlier, DBIO will be used to implement the IRAF image header +structure. An IRAF imagefile is composed of two parts: the image header +structure, and the pixel storage file. Only the name of the pixel storage +file for an image will be kept in the image header; the pixel storage file +is always a separate file, which indeed usually resides on a different +filesystem. The pixel storage file is often far larger than the image +header, though the reverse may be true in the case of small one dimensional +spectra or other small images. The DBIO format image header file is +usually not very large and will normally reside in the user's directory +system. The pixel storage file is created and managed by IMIO transparently +to the user and to DBIO. + + +.ks +.nf + applications program + + + + IMIO + + + + DBIO + + + + FIO + + + Structure of a program which accesses images +.fi +.ke + + +It shall be possible for an single datafile to contain any number of +image header structures. The standard image header shall be implemented +as a regular DBIO structured record, defined in a structure declaration +file in the system library directory "lib$". + +.sh +7. Transportability + + The datafile is a essential part of the IRAF, and it is essential that +we be able to transport datafiles between machines. The self describing +nature of datafiles makes this straightforward, provided programmers do +not store structures in the database in binary. Binary arrays, however, +are fine, since they are completely defined. + +A datafile must be transformed into a machine independent form for transport +between machines. The independence of the records in a datafile, and the simple +structure of a record, should make transmission of a datafile in tabular +form (ASCII card image) straightforward. We shall use the tables extension +to FITS to transport DBIO datafiles. A simple unstructured record can +be represented in the form 'keyword = value' (with some loss of information), +while a structured record can be represented as a FITS table, given the +restriction of the fields of a record to the primitive types. + +.sh +8. Implementation Strategies + + Each data dictionary shall consist of a single random access file, the +"datafile". The dictionary shall be indexed by a B-tree containing the +28 character packed name of each record and a 4 byte integer giving the offset +of either the next block in the B-tree, or of the "inode" structure describing +the record, for a total of 32 bytes per index entry. If a record has several +aliases, each will have a separate entry in the index and all will point to +the same inode structure. The size of a B-tree block shall be variable (but +fixed for a given datafile), and in the case of a typical image header, shall +be chosen large enough so that the index for the entire image header can be +contained in a single B-tree block. The entries within an index block shall +be maintained in sorted order and entries shall be located by a binary search. + +Each physical record or array of records in the datafile shall be described +by a unique binary inode structure. The inode structure shall define the +number of links to the record, the datatype, size, and length of the record, +the dates of creation and last modify, the offset of the record in the +datafile (or the offset of the index block in the case of an array of records), +and so on. The inode structures shall be stored in the datafile as a +contiguous array of records; the inode array may be stored at any offset in +the datafile. Overflow of the inode array will be handled by moving the +array to the end of the file and doubling its size. + +New records shall be added to the datafile by appending to the end of the file. +No attempt shall be made to align records on block boundaries within the +datafile. When a record is deleted space will not be reclaimed, i.e., +deletion will leave an invisible 'hole' in the datafile (a utility will be +available for compacting fragmented datafiles). Array structured records +shall in general be stored noncontiguously in the datafile, though +DBIO will try to avoid excessive fragmentation. The locations of the sections +of a large array of records shall be described by a separately allocated index +block. + +DBIO will probably make use of the IRAF file i/o (FIO) buffer cache feature to +access the datafile. FIO permits both the number and size of the buffers +used to access a file to be set by the caller at file open time. +Furthermore, the FIO "reopen" call can be used to establish independent +buffer caches for the index and inode blocks and for the data records, +so that heavy data array accesses do not flush out the index blocks, even +though both are stored in the same file. Given the sophisticated buffering +capabilities of FIO, DBIO need only make FIO seek and read/write calls to access +both inode and record data, explicitly buffering only the B-tree index block +currently being searched. + +On a virtual machine a single FIO buffer the size of the entire datafile can +be allocated and mapped onto the file, to take advantage of virtual memory +without compromising transportability. DBIO would still use FIO seek, read, +and write calls to access the file, but no FIO buffer faults would occur +unless the file were extended. The current FIO interface does not provide +this feature but it can easily be added in the future without modification +to the FIO interface, if it is proved that there is anything to be gained. + +By carefully configuring the buffer cache for a file, it should be possible +to keep the B-tree index block and inode array for a moderate size datafile +buffered most of the time, limiting the number of disk accesses required to +access a small record to much less than one on the average, without limiting +the ability of DBIO to access very large dictionaries. For example, given +a dictionary of one million entries and a B-tree block size of 128 entries +(4 KB), only 4 disk accesses would be required to access a primitive record +in the worst case (no buffer hits). Very small datafiles, i.e. most image +headers, would be completely buffered all of the time. + +The B-tree index scheme, while very efficient for random record access, +is also well suited to sequential accesses ("dbnextname()" calls). A +straightforward dictionary copy operation using dbnextname, which steps +through the records of a dictionary in alphabetical order, would +automatically transpose the dictionary into the most efficient form for +future alphabetical or clustered accesses, reclaiming storage and +consolidating fragmented arrays in the process. + +The DBIO package, like FIO and IMIO, will dynamically allocate all buffer +space needed to access a datafile at runtime. The number of datafiles +which can be simultaneously accessed by a single program is limited primarily +by the maximum number of open files permitted a process by the OS. diff --git a/sys/dbio/new/coords b/sys/dbio/new/coords new file mode 100644 index 00000000..803ef3c7 --- /dev/null +++ b/sys/dbio/new/coords @@ -0,0 +1,73 @@ +.nh 4 +World Coordinates + + In general, an image may simultaneously have any number of world coordinate +systems associated with it. It would be quite awkward to try to store an +arbitrary number of WCS descriptors in the image header, so a separate WCS +relation is used instead. If world coordinates are not used no overhead is +incurred. + +Maintenance of the WCS descriptor, transformations of the WCS itself (e.g., +when an image changes spatially), and coordinate transformations using the WCS +are all managed by the WCS package. This will be a general purpose package +usable not only in IMIO but also in GIO and other places. IMIO will be +responsible for copying the WCS records for an image when a new image is +created, as well as for correcting the WCS for the effects of subsampling, +etc. when a section of an image is mapped. + +The WCS package will include support for both linear and nonlinear coordinate +systems. Each WCS is described by a mapping from pixel space to WCS space +consisting of a general nonlinear transformation followed by a linear +transformation. Either or both of the transformations may be unitary if +desired, e.g., the simple linear transformation is supported as a special case. +.ls 4 +.ls 12 image +The name (value of the \fIimage\fR key in the image header) of the image +for which the WCS is defined. +.le +.ls nlnterm +A flag specifying whether the WCS includes a nonlinear term. +.le +.ls invterm +A flag specifying whether the WCS includes an inverse nonlinear term. +If a forward nonlinear transformation is defined but no inverse transformation +is given, coordinate transformations from WCS space to pixel space may be +inefficient or impossible. +.le +.ls linterm +A flag specifying whether the WCS includes a linear term. +.le +.ls fwdtran +The interpreter program for the forward nonlinear transformation. +.le +.ls invtran +The interpreter program for the inverse nonlinear transformation. +.le +.ls lintran +A floating point array describing the linear transformation. +.le +.le + + +Nonlinear transformations are described by small user supplied \fIprograms\fR +written in a simple RPN language entered as a variable length character string. +The RPN language will include builtin intrinsic functions for all the standard +trigonometric and hyperbolic functions, plus builtin functions for the common +nonlinear transformations as well. The advantage of this scheme is that the +standard transformations are supported very efficiently without sacrificing +generality. Even nonstandard nonlinear functions can be computed quite +efficiently since the runtime overhead of an RPN interpreter can be made quite +small compared to the time required to evaluate the trigonometric and other +functions typically used in a nonlinear function. + +Implementation of the WCS as a nonlinear function plus a linear function +makes it trivial for IMIO to automatically update the WCS when a linear +transformation is applied to the image (the nonlinear term of the WCS will +not be affected by a linear transformation of the image). +Implementation of the nonlinear term as a program encoded as a character +string permits modification of the nonlinear term by \fIconcatentation\fR +of another nonlinear function, also represented as a character string. +In other words, the final mapping is given by successive application of +a series of nonlinear transformations, followed by the linear transformation. +Hence the WCS may be updated to reflect a subsequent linear or nonlinear +transformation of the image, regardless of the nature of the original WCS. diff --git a/sys/dbio/new/dbio.con b/sys/dbio/new/dbio.con new file mode 100644 index 00000000..9adc7d6c --- /dev/null +++ b/sys/dbio/new/dbio.con @@ -0,0 +1,202 @@ + IRAF Database I/O Design + Contents + + + +1. PREFACE + + 1.1 Scope of this Document + 1.2 Relationship to Previous Documents + + +2. INTRODUCTION + + 2.1 The Database Subsystem + 2.2 Major Subsystems + 2.3 Related Subsystems + + +3. REQUIREMENTS + + 3.1 General Requirements + 3.1.1 Portability + 3.1.2 Efficiency + 3.1.3 Code Size + 3.1.4 Use of Proprietary Software + + 3.2 Special Requirements + 3.2.1 Catalog Storage + 3.2.2 Image Storage + 3.2.3 Intermodule Communication + 3.2.4 Data Archiving + + 3.3 Other Requirements + 3.3.1 Concurrency + 3.3.2 Recovery + 3.3.3 Data Independence + 3.3.4 Host Database Interface + + +4. CONCEPTUAL DESIGN + + 4.1 Terminology + 4.2 System Architecture + + 4.3 The DBMS Package + 4.3.1 Overview + 4.3.2 Procedural Interface + 4.3.2.1 General Operators + 4.3.2.2 Form Based Data Entry and Retrieval + 4.3.2.3 List Interface + 4.3.2.4 FITS Table Interface + 4.3.2.5 Graphics Interface + 4.3.3 Command Language Interface + 4.3.4 Record Selection Syntax + 4.3.5 Query Language + 4.3.5.1 Query Language Functions + 4.3.5.2 Language Syntax + 4.3.5.3 Sample Queries + 4.3.6 DB Kernel Operators + 4.3.6.1 Dataset Copy and Load + 4.3.6.2 Rebuild Dataset + 4.3.6.3 Mount Foreign Dataset + 4.3.6.4 Crash Recovery + + 4.4 The IMIO Interface + 4.4.1 Overview + 4.4.2 Logical Schema + 4.4.2.1 Standard Fields + 4.4.2.2 History Text + 4.4.2.3 World Coordinates + 4.4.2.4 Histogram + 4.4.2.5 Bad Pixel List + 4.4.2.6 Region Mask + 4.4.3 Group Data + 4.4.4 Image I/O + 4.4.4.1 Image Templates + 4.4.4.2 Image Pixel Access + 4.4.4.3 Image Database Interface (IDBI) + 4.4.5 Summary of IMIO Data Structures + + 4.5 The DBIO Interface + 4.5.1 Overview + 4.5.2 Comparison of DBIO and Commercial Databases + 4.5.3 Query Language Interface + 4.5.4 Logical Schema + 4.5.4.1 Databases + 4.5.4.2 System Tables + 4.5.4.3 The System Catalog + 4.5.4.4 Relations + 4.5.4.5 Attributes + 4.5.4.6 Domains + 4.5.4.7 Groups + 4.5.4.8 Views + 4.5.4.9 Null Values + 4.5.5 Data Definition Language + 4.5.6 Record Select/Project Expressions + 4.5.6.1 Introduction + 4.5.6.2 Basic Syntax + 4.5.6.3 Examples + 4.5.6.4 Evaluation + 4.5.7 Operators + 4.5.7.1 General Operators + 4.5.7.2 Record Level Access + 4.5.7.3 Field Level Access + 4.5.7.4 Variable Length Fields + 4.5.7.5 IMIO Support + 4.5.8 Constructing Special Relational Operators + 4.5.9 Storage Structures + + 4.6 The DBKI Interface (DB Kernel Interface) + 4.6.1 Overview + 4.6.1.1 Default Kernel + 4.6.1.2 Host Database Interface + 4.6.1.3 Network Support + 4.6.2 Logical Schema + 4.6.2.1 System Tables + 4.6.2.2 User Tables + 4.6.2.3 Indexes + 4.6.2.4 Record Structure + 4.6.2 Database Management Operators + 4.6.2.1 Database Creation and Deletion + 4.6.2.2 Database Access + 4.6.2.3 Table Creation and Deletion + 4.6.2.4 Index Creation and Deletion + 4.6.3 Record Access Methods + 4.6.3.1 Direct Access via an Index + 4.6.3.2 Direct Access via the Record ID + 4.6.3.3 Sequential Access + 4.6.4 Record Access Operators + 4.6.4.1 Fetch + 4.6.4.2 Update + 4.6.4.3 Insert + 4.6.4.4 Delete + 4.6.4.5 Variable Length Fields + + 4.7 The DBK (IRAF DB Kernel) + 4.7.1 Overview + 4.7.2 Storage Structures + 4.7.2.1 Database + 4.7.2.2 System Catalog + 4.7.2.3 Table Storage + 4.7.3 The Control Interval + 4.7.3.1 Introduction + 4.7.3.2 Shared Intervals + 4.7.3.3 Private Intervals + 4.7.3.4 Record Insertion and Update + 4.7.3.5 Record Deletion + 4.7.3.6 Adding New Fields + 4.7.3.7 Array Storage + 4.7.3.8 Rebuilding a Dataset + 4.7.4 Indexes + 4.7.4.1 Nonindexed Tables + 4.7.4.2 Primary Index + 4.7.4.3 Secondary Indexes + 4.7.4.4 Key Compression + 4.7.5 Host Database Interface (HDBI) + 4.7.6 Concurrency + 4.7.7 Backup and Transport + 4.7.8 Accounting Services + 4.7.9 Crash Recovery + + +5. SPECIFICATIONS + + 5.1 DBMS Package + 5.1.1 Overview + 5.1.2 Module Specifications + + 5.2 IMIO Interface + 5.2.1 Overview + 5.2.2 Examples + 5.2.3 Module Specifications + 5.2.3.1 Image Header Access + 5.2.3.2 History Text + 5.2.3.3 World Coordinates + 5.2.3.4 Bad Pixel List + 5.2.3.5 Region Mask + 5.2.3.6 Pixel Access + 5.2.4 Storage Structures + 5.2.4.1 IRAF Runtime Format + 5.2.4.2 Archival Format + 5.2.4.3 Other Formats + + 5.3 DBIO (DataBase I/O interface) + 5.3.1 Overview + 5.3.2 Examples + 5.3.3 Module Specifications + + 5.4 DBKI (DB Kernel Interface) + 5.4.1 Overview + 5.4.3 Module Specifications + + 5.5. DBK (IRAF DB Kernel) + 5.5.1 Overview + 5.5.2 Storage Structures + 5.5.3 Error Recovery + 5.5.4 Concurrency + +6. SUMMARY + +Glossary +Index diff --git a/sys/dbio/new/dbio.hlp b/sys/dbio/new/dbio.hlp new file mode 100644 index 00000000..d5d9c77f --- /dev/null +++ b/sys/dbio/new/dbio.hlp @@ -0,0 +1,3202 @@ +.help dbss Sep85 "Design of the IRAF Database Subsystem" +.ce +\fBDesign of the IRAF Database Subsystem\fR +.ce +Doug Tody +.ce +September 1985 +.sp 2 + +.nh +Preface + + The primary purpose of this document is to define the interfaces comprising +the IRAF database i/o subsystem to the point where they can be built rapidly +and efficiently, with confidence that major changes will not be required after +implementation begins. The document also serves to inform all interested +parties of what is planned while there is still time to change the design. +A change which can easily be made to the design prior to implementation may +become prohibitively expensive as implementation proceeds. After implementation +is completed and the new subsystem has been in use for several months the basic +interfaces will be frozen and the opportunity for change will have passed. + +The description of the database subsystem presented in this document should +be considered to be no more than a close approximation to the system which +will actually be built. The specifications of the interface can be expected +to change in detail as the implementation proceeds. Any code which is written +according to the interface specifications presented in this document may have +to modified slightly before system testing with the final interfaces can +proceed. + +.nh 2 +Scope of this Document + + The scope of this document is the conceptual design and specification of +all IRAF packages and i/o interfaces directly involved with either user or +program access to binary data maintained in mass storage. Versions of some +of the interfaces described are already in use; when this is the case it will +be noted in the text. + +This document is neither a user's guide nor a reference manual. The reader +is assumed to be familiar with both database technology and with the IRAF +system. In particular, the reader should be familiar with the concept of the +IRAF VOS (virtual operating system), with the features of the IMIO (image i/o), +FIO (file i/o), and OS (host system interface) interfaces, as well as with the +architecture of the network interface. + +.nh 2 +Relationship to Previous Documents + + This document supercedes the document "IRAF Database I/O", November 1984. +Most of the concepts presented in that document are still valid but have been +expanded upon greatly in the present document. The scope of the original +document was limited to the DBIO interface alone, whereas the scope of the +present document has been expanded to encompass all subsystems or packages +directly involved with binary data access. This expansion in the scope of +the project was necessary to meet our primary goal of completing and freezing +the program interface, of which DBIO is only a small part. Furthermore, it +is difficult to have confidence in the design of a single subsystem without +working out the details of all closely related or dependent subsystems. + +In addition to expanding the scope of the database design project to cover +more interfaces, the requirements which the database subsystem must meet have +been expanded since the original conceptual design was done. In particular +it has become clear that data format conversions are prohibitively expensive +for our increasingly large datasets. Conversions such as those between FITS +and internal format (for an image), or between FITS table and internal format +(for a database) are too expensive to be performed routinely. Data which is +archived in a machine independent format should not have to be reformatted +to be accessed by the online system. The archival format may vary from site +to site and it should be possible to read the different formats without +reformatting the data. Large datasets should not have to be reformatted to +be moved between machines with incompatible binary data formats. + +A change such as this in the requirements for an interface can have a major +impact on the design of the final interface. It is essential that all such +requirements be identified and dealt with in the design before implementation +begins. + +.nh +Introduction + + In this section we introduce the database subsystem and summarize the +reasons why we need such a system. We then introduce the major components +of the database subsystem and briefly mention some related subsystems. + +.nh 2 +The Database Subsystem + + The database subsystem (DBSS) is conceived as a single comprehensive system +to be used to manage and access all binary (non textfile) data accessed by IRAF +programs. Simple applications are perhaps most easily and flexibly dealt with +using text files for the storage of data, descriptors, and control information. +As the amount of data to be processed grows or as the data structures to be +accessed grow in complexity, however, the text file approach becomes seriously +inefficient and cumbersome. Converting the text files to binary files makes +processing more efficient but does little to address the problems of complex +data structures. Efficient access to complex data structures requires complex +and expensive software. Developing such software specially for each and every +application is prohibitively expensive in a large system; hence the need for +a general purpose database system becomes clear. + +Use of a single central database system has significant additional advantages. +A standard user interface can be used to examine, edit, list, copy, etc., all +data maintained under the database system. Many technical problems may be +addressed in a general purpose system that would be too expensive to address +in a particular application, e.g., the problems of storing variable size data +elements, of dynamically and randomly updating a dataset, of byte packing to +conserve storage, of maintaining indexes so that a record may be found +efficiently in a large dataset, of providing data independence so that storage +formats may be changed without need to change the program accessing the data, +and of transport of binary datasets between incompatible machines. All of +these are examples of problems which are \fInot\fR adequately addressed by the +current IRAF i/o interfaces nor by the applications programs which use them. + +.nh 2 +Major Subsystems + + The major subsystems comprising the IRAF DBSS are depicted in Figure 1. +At the highest level are the CL (command language) packages, each of which +consists of a set of user callable tasks. The IMAGES package (consisting +of general image processing operators) is shown for completeness but since +there are many such packages in the system they are not considered part of +the DBSS and will not be discussed further here. +The DBMS (database management) package is the user interface to the DBSS, +and some day will possibly be the largest part of the DBSS in terms of number +of lines of code. + +In the center of the figure we see the VOS (virtual operating system) packages +IMIO, DBIO and FIO. FIO (file i/o) is the standard IRAF file interface and +will not be discussed further here. IMIO (image i/o) and DBIO (database i/o) +are the two major i/o interfaces in the DBSS and are the topic of much of the +rest of this document. IMIO and DBIO are the two parts of the DBSS of interest +to applications programmers; these interfaces are implemented as libraries of +subroutines to be called directly by the applications program. IMIO and FIO +are existing interfaces. + +At the bottom of the figure is the DB Kernel. The DB Kernel is the component +of the DBSS which physically accesses the data in mass storage (via FIO). +The DB Kernel is called only by DBIO and hence is invisible to both the user +and the applications programmer. There is a lot more to the DB Kernel than +is evident from the figure, and indeed the DB Kernel will be the subject of +another figure when we discuss the system architecture in section 4.2. + + +.ks +.nf + DBMS IMAGES(etc) (CL) + \ / + \ / --------- + \ / + \ IMIO + \ / \ + \ / \ + \/ \ (VOS) + DBIO FIO + | + | + | --------- + | + | + (DB Kernel) (VOS or Host System) + +.fi +.ce +Figure 1. Major Components of the Database Subsystem +.ke + + +With the exception of certain optional subsystems to be outlined later, +the entire DBSS is machine independent and portable. The IRAF system may +be ported to a new machine without any knowledge whatsoever of the +architecture or functioning of the DBSS. + +.nh 2 +Related Subsystems + + Several additional IRAF subsystems or packages are of interest from the +standpoint of the DBSS. These are the PLOT package, the graphics interface +GIO, and the LISTS package. + +The PLOT package is a CL level package consisting of general plotting +utilities. In general PLOT tasks can accept input in a number of standard +formats, e.g., \fBlist\fR (text file) format and \fBimagefile\fR format. +The DBSS will provide an additional standard format which should perhaps be +directly accessible by the PLOT tasks. Even if this is not done a very +general plotting capability will automatically be provided by "piping" the +list format output of a DBMS task to a PLOT task. Additional graphics +capabilities will be provided as built in functions in the DBMS +\fBquery language\fR, which will access GIO directly to make plots. +The query language graphics facilities will be faster and more convenient +to use but less extensive and less sophisticated than those provided by PLOT. + +The LISTS package is interesting because the facilities provided and operations +performed resemble those provided by the DBMS package in many respects. +The principle difference between the two packages is that the LISTS package +operates on arbitrary text files whereas the DBMS package operates only +upon DBIO format binary files. The textual output of \fIany\fR IRAF or +non-IRAF program may serve as input to a LISTS operator, as may any ordinary +text file, e.g., the source files for a program or package. A typical LISTS +database is a directory full of source files or documentation; LISTS can also +operate on tables of numbers but the former application is perhaps more +common. Using LISTS it is possible to conveniently and rapidly perform +operations (evaluate queries) which would be cumbersome or impossible to +perform with a conventional database system such as DBMS. On the other hand, +the LISTS operators would be hopelessly inefficient for the types of +applications for which DBMS is designed. + +.nh +Requirements + + Requirements define the problem to be solved by a software system. +There are two types of requirements, non-functional requirements, i.e., +restrictions or constraints, and functional requirements, i.e., the functions +which the system must perform. Since nearly all IRAF science software will +be heavily dependent on the DBSS, the requirements for this subsystem are as +strict as those for any subsystem in IRAF. + +.nh 2 +General Requirements + + The general requirements which the DBSS must satisfy primarily take the +form of constraints or restrictions. These requirements are common to +all mainline IRAF system software. Note that these requirements are \fInot\fR +automatically enforced for all system software. If a particular subsystem is +prototype or optional (not required for the normal functioning of IRAF) then +these requirements can be relaxed. In particular, certain parts of the DBSS +(e.g, the host database interface) are optional and are not subject +to the same constraints as the mainline software. The primary functional +requirements discussed in section 3.2, however, must be met by software which +satisfies all of the general requirements discussed here. + +.nh 3 +Portability + + All software in the DBMS, IMIO, and DBIO interfaces and in the DB kernel +must be fully portable under IRAF. To meet this requirement the software +must be written in the IRAF SPP language using only the facilities provided +by the IRAF VOS. In particular, this rules out complicated record locking +schemes in the DB kernel, as well as any type of centralized database server +which relies on process control, IPC, or signal handling facilities not +provided by the IRAF VOS. For most processes the requirement is even more +strict, i.e., ordinary IRAF processes are not permitted to rely upon the VOS +process control or IPC facilities for their normal functioning (the IPC +connection to the CL is an exception since it is not required to run an +IRAF process standalone). + +.nh 3 +Efficiency + + The database interface must be efficient, particularly when used for +image access and intermodule communication. There are as many ways to +measure the efficiency of an interface as there are applications for the +interface, and we cannot address them all here. The dimensions of the +efficiency matrix we are concerned with here are the cpu time consumed +during execution, the clock time consumed during execution, e.g, the number +of file opens and disk seeks or required, and the disk space consumed for +table storage. Where necessary efficient cpu utilization will be achieved +at the expense of memory requirements for code and buffers. + +A simple and well defined efficiency requirement is that the cpu and clock +time required to access the pixels of an image stored in the database from +a "cold start" (no open files) must not noticeably exceed that required +by the old IMIO interface. The efficiency of the new interface for the +case when many images are to be accessed is expected to be a major improvement +over that provided by the old IMIO interface, since the old interface +stores each image in two separate files, whereas the new interface will +be capable of storing the entire contents of many (small) images in a single +file. The amount of disk space required for image header storage is also +expected to decrease by a large factor when multiple images are stored +in a single physical file. + +.nh 3 +Code Size + + We have already established that a process must directly access the +database in mass storage to meet our portability and efficiency requirements. +This type of access requires that the necessary IMIO, DBIO and DB Kernel +routines be linked into each process requiring database access. Minimizing +the amount of text space used by the database code is desirable to minimize +disk and memory requirements and process spawn time, but is not critical +since memory is cheap and plentiful and is likely to become even cheaper +and more plentiful in the future. Furthermore, the multitask nature of +IRAF processes allows the text segment used by the database code to be shared +by many tasks, saving both disk and memory. + +The main problem remaining today with large text segments seems to be the +process spawn time; loading the text segment by demand paging in a virtual +memory environment can be quite slow. The fault here seems to lie more with +the operating system than with IRAF, and probably the solution will require +tuning either the IRAF system interface or the operating system itself. + +Taking all these factors into account it would seem that typical memory +requirements for the executable database code (not including data buffers) +in the range 50 to 100 Kb would be acceptable, with 50 Kb being a reasonable +goal. This would make the database interface the largest i/o interface in +IRAF but that seems inevitable considering the complexity of the problem to +be solved. + +.nh 3 +Use of Proprietary Software + + A mainline IRAF interface, i.e., any interface required for the normal +operation of the system, must belong to IRAF and must be distributed with +the IRAF system at no additional charge and with no licensing restrictions. +The source code must be part of the system and is subject to strict +configuration control by the IRAF group, i.e., the IRAF group is responsible +for the software and must control it. This rules out the use of a commercial +database system for any essential part of the DBSS, but does not rule out +IRAF access to a commercial database system provided such access is optional, +i.e., not required for the operation of the standard applications packages. +The host database interface provided by the DB kernel is an example of such +an interface. + +.nh 2 +Special Requirements + + In this section we present the functional requirements of the DBSS. +The major applications for which the DBSS in intended are described and +the desirable characteristics of the DBSS for each application are outlined. +The major applications thus far identified are catalog storage, image storage, +intermodule communication, and data archiving. + +.nh 3 +Catalog Storage + + The catalog storage application is probably the closest thing in IRAF to a +conventional database application. A catalog is a set of records, each of +which describes a single object. Each record consists of a set of fields +of various datatypes describing the attributes of the object. A record is +produced by numerical analysis of the object represented as a region of a +digital array. All records have the same structure, i.e., set of fields; +often the records are all the same size (but not necessarily). A large catalog +might contain several hundred thousand records. Examples of such catalogs are +the SAO star catalog, the IRAS point source catalog, and the catalogs produced +by analysis programs such as FOCAS (a faint object detection and classification +program) and RICHFLD (a digital stellar photometry program). Many similar +examples can be identified. + +Generation of such a catalog by an analysis program is typically a cpu bound +batch operation requiring many hours of computer time for a large catalog. +Once the catalog has been generated there are typically numerous questions of +scientific interest which can be answered using the data in the catalog. +It is highly desirable that this phase of the analysis be interactive and +spontaneous, as one question will often lead to another in an unpredictable +fashion. A general purpose analysis capability is required which will permit +the scientist to pose arbitrary queries of arbitrary complexity, to be answered +by the system in a few seconds (or minutes for large problems), with the answer +taking the form of a number or name, set or table of numbers or names, plot, +subcatalog, etc. + +Examples of such queries are given below. Clearly, the set of all possible +queries of this type is infinite, even assuming a limited number of operators +operating on a single catalog. The set of potentially interesting queries +is equally large. +.ls 4 +.ls [1] +Find all objects of type "pqr" for which X is in the range A to B and +Z is less than 10. +.le +.ls [2] +Compute the mean and standard deviation of attribute X for all objects +in the set [1]. +.le +.ls [3] +Compute and plot (X-Y) for all objects in set [1]. +.le +.ls [4] +Plot a circle of size (log2(Z-3.2) * 100) at the position (X,Y) of all objects +in set [1]. +.le +.ls [5] +Print the values of the attributes OBJ, X, Y, and Z of all objects for which +X is in the range A to B and Y is greater than 30. +.le +.le + + +In the past queries such as these have all too often been answered by writing +a program to answer each query, or worse, by wading though a listing of the +program output and manually computing the result or manually plotting points +on a graph. + +Given the preceding description of the catalog storage application, we can +make the following observations about the application of the DBSS to catalog +storage. +.ls +.ls o +A catalog is typically written once and then read many times. +.le +.ls o +Both public and private catalogs are common. +.le +.ls o +Catalog records are infrequently updated or are not updated at all once the +original entry has been made in the catalog. +.le +.ls o +Catalog records are rarely if ever deleted. +.le +.ls o +Catalogs can be very large, making efficient storage structures important +in order to minimize disk storage requirements. +.le +.ls o +Since catalogs can be very large, indexing facilities are required for +efficient record retrieval and for the efficient evaluation of queries. +.le +.ls o +A general purpose interactive query capability is required for the user to +effectively make use of the data in a catalog. +.le +.le + + +In DBSS terminology a user catalog will often be referred to as a \fBtable\fR +to avoid confusion with the use of the DBSS term \fBcatalog\fR which refers +to the system table which lists the contents of a database. + +.nh 3 +Image Storage + + A primary requirement for the DBSS, if not \fIthe\fR primary requirement, +is that the DBSS be suitable for the storage of bulk data or \fBimages\fR. +An image consists of two parts: an \fIimage header\fR describing the image, +and a multidimensional array of \fBpixels\fR. The pixel array is sometimes +small and sometimes very large indeed. For efficiency and other reasons the +actual pixel array is not required to be stored in the database. Even if the +pixels are stored directly in the database they are not expected to be used +in queries. + +We can make the following observations about the use of the DBSS for image +storage. The reader concerned about how all this might map into the storage +structures provided by a relational database should assume that the image +header is stored as a single large, variable size record (tuple), whereas +a group of images is stored as one or more tables (relations). If the images +are large assume the pixels are be stored outside the DBSS in a file, storing +only the name of the file in the header record. +.ls +.ls o +Images tend to be grouped into sets that have some logical meaning to the user, +e.g., "nite1", "nite2", "raw", "reduced", etc. Each group typically contains +dozens or hundreds of images (enough to require use of an index for efficient +retrieval). +.le +.ls o +Within a group the individual images are often referred to by a unique ordinal +number which is automatically assigned by some program (e.g., "nite1.10", +"nite1.11", etc). +.le +.ls o +Image databases tend to be private databases, created and accessed by a +single user. +.le +.ls o +The size of the pixel segment of an image varies enormously, e.g., from +1 kilobyte to 8 megabytes, even 40 megabytes in some cases. +.le +.ls o +Small pixel segments are most efficiently stored directly in the image header +to minimize the number of file opens and disk seeks required to access the +pixels once the header has been accessed (as well as to minimize file clutter). +.le +.ls o +Large pixel segments are most efficiently stored separately from the image +headers to increase clustering and speed sequential searches of a group of +headers. +.le +.ls o +It is occasionally desirable to store either the image header or the pixel +segment on a special, non file-structured device. +.le +.ls o +The image header logically consists of a closed set of standard attributes +common to all images, plus an open set of attributes peculiar to the data +or to the type of analysis being performed on the data. +.le +.ls o +The operations performed on images are often functions which produce a +modified version of the input image(s) as a new output image. It is desirable +for most header information to be automatically preserved in such a mapping. +For this to happen automatically without the DBSS requiring knowledge of +the contents of a header, it is necessary that the header be a single object +to the DBSS, i.e., a single record in some table, rather than a set of +related records in several tables. +.le +.ls o +Since the image header needs to be maintained as a single record and since +the header may contain an unpredictable number of application or data specific +attributes, image headers can be quite large. +.le +.ls o +Not all image header attributes are simple scalar values or even fixed size +arrays. Variable size attributes, i.e., arrays, are common in image headers. +Examples of such attributes are the bad pixel list, history text, and world +coordinate system (more on this in a later section). +.le +.ls o +Image header attributes often form logical groupings, e.g., several logically +related attributes may be required to define the bad pixel list or the world +coordinate system. +.le +.ls o +The image header structure is often dynamically updated and may change in +size when updated. +.le +.ls o +It is often necessary to add new attributes to an existing image header. +.le +.ls o +Images are often selectively deleted. Any subordinate files logically +associated with the image should be automatically deleted when the image +header is deleted. If this is not possible under the DBSS then the DBSS +should forbid deletion of the image header unless special action is taken +to remove delete protection. +.le +.ls o +For historical or other reasons, a given site will often maintain images +in several different and completely incompatible formats. It is desirable +for the DBSS to be capable of directly accessing images maintained in a foreign +format without a format conversion, even if only limited (e.g., read only) +access is possible. +.le +.le + + +In summary, images are characterized by a header with a highly variable set +of fields, some of which may vary in size during the lifetime of the image. +New fields may be added to the image header at any time. Array valued fields +are common and fields tend to form logical groupings. The image header is +best maintained as a single structure under the DBSS. Image headers can be +quite large. The pixel segment of an image can be extremely large and may +be best maintained outside the DBSS. Since many existing image archives exist, +each with its own unique format, it is desirable for the DBSS to be capable +of accessing multiple storage formats. + +Storage of the pixel segment or any other portion of an image in a separate +file outside the DBSS causes problems which must be dealt with at some level +in the system, if not by the DBSS. In particular, problems occur if the user +tries to backup, restore, copy, rename, or delete any portion of an image using +a host system utility. These problems are minimized if all logically related +data is kept in a single data directory, allowing the database as a whole to +be moved or backed up with host system utilities. All pathnames should be +defined relative to the data directory to permit relocation of the database +to a different directory. Ideally all binary datafiles in the database should +be maintained in a machine independent format to permit movement of the +database between different machines without reformatting the entire database. + +.nh 3 +Intermodule Communication + + A large applications package consists of many separate tasks or programs. +These tasks are best defined and understood in terms of their operation on a +central package database. For example, one task might fit some function to +an image, leaving a record describing the fit in the database. A second task +might take this record as input and use it to control a transformation on +the original image. Additional operators implementing a range of algorithms +or optimized for a discrete set of cases are easily added, each relying upon +the central database for intermodule communication. + +This application of the DBSS is a fairly conventional database application +except that array valued attributes and logical groupings of attributes are +common. For example, assume that a polynomial has been fitted to a data +vector and we wish to record the fit in the database. A typical set of +attributes describing a polynomial fit are shown below. + + +.ks +.nf + image_name char*30 # name of source image + nfeatures int # number of features fitted + features.x real*4[*] # x values of the features + features.y real*4[*] # y values of the features + curve.type char*10 # curve type + curve.ncoeff int # number of coefficients + curve.coeff real*4[*] # coefficients +.fi +.ke + + +The data structure shown records the positions (X) and world coordinates (Y) +of the data features to which the curve was fitted, plus the coefficients of +the fitted curve itself. There is no way of predicting the number of features +hence the X and Y arrays are variable length. Since the fitted curve might +be a spline or some other piecewise function rather than a simple polynomial, +there is likewise no reasonable way to place an upper limit on the amount of +storage required to store the fitted curve. This type of record is common in +scientific applications. + +We can now make the following observations regarding the use of the DBSS for +intermodule communication. +.ls +.ls o +The number of fields in a record tends to be small, but array valued fields +of variable size are common hence the physical size of a record may be large. +.le +.ls o +A large table might contain several hundred records in typical applications, +requiring the use of an index for efficient retrieval. +.le +.ls o +Record access is usually random rather than sequential. +.le +.ls o +Random record updates will be rare in some applications, but common in others. +.le +.ls o +Records will often change in size when updated. +.le +.ls o +Selective record deletion is rare, occurring mostly during cleanup following +an error. +.le +.ls o +New fields are rarely, if ever, added to existing records. The record structure +is usually determined by the programmer rather than by the user and tends to +be well defined. +.le +.ls o +This type of database is typically a private database created and used by a +single user to process a specific dataset with a specific applications package. +.le +.le + + +Application specific information may sometimes be stored directly in the header +of the image being analyzed, but more often will be stored in one or more +separate tables, recording the name of the image analyzed in the new record +as a backpointer, as in the example. Hence a typical scientific database +might consist of several tables containing the input images, several tables +containing intermodule records of various types, and one or more tables +containing either reduced images or catalog records, depending on whether a +reduction or analysis operation was performed. + +.nh 3 +Data Archiving + + Data archiving refers to the long term storage of raw or reduced data. +Data archiving is important for the following reasons. +.ls +.ls o +Archiving is currently necessary just to \fItransport\fR data from the +telescope to the site where reduction and analysis takes place. +.le +.ls o +Permanently archiving the raw (or pre-reduced) data is necessary in case +an error in the reduction process is later discovered, making it necessary +for the observer to repeat the reductions. +.le +.ls o +Archiving of the reduced data is desirable to save computer and human time +in case the analysis phase has to be repeated, or in case additional analysis +is later discovered to be necessary. +.le +.ls o +Archived data could conceivably be of considerable value to future researchers +who, given access to such data, might not have to make observations of their +own, or who might be able to use the archived data to augment or plan their +own observations. +.le +.ls o +Archived data could be invaluable for future projects studying the variability +of an object or objects over a period of years. +.le +.le + + +Ideally data should be archived as it is taken at the telescope, possibly +performing some simple pipeline reductions before archiving takes place. +Subsequent reduction and analysis using the archived data should be possible +without the format conversion (e.g., FITS to IRAF) currently required. +This conversion wastes cpu time and disk space as well as user time. +The problem is already serious and is expected to grow by an order of +magnitude in the next several years as digital detectors grow in size and +are used more frequently. + +Archival data consists of the digital data itself (the pixels) plus information +describing the object, the observer, how the data was taken, when and where +the data was taken, and so on. This is just the type of information assumed +to be present in an IRAF image. In addition one would expect the archive to +contain one or more \fBmaster catalogs\fR containing exhaustive information +describing the observations but no data. + +Since a permanent digital data archive can be expected to be around for many +years and to be read on many types of machines, data images should be archived +in a machine independent format; this format would almost certainly be FITS. +It is also desirable, though not essential, that the master catalogs be +readable on a variety of machines and hence be maintained and distributed in +a machine independent format. The ideal storage medium for archiving and +transporting large amounts of digital data appears to be the optical disk. + +Archival data and catalog access via the DBSS differs from conventional image +and catalog access only in the storage format, which is assumed to be machine +independent, and in the storage medium, which is assumed to be an archival +medium such as the optical disk. Direct access to a database on optical +disk requires that the DBSS be able to read the machine independent format +directly. + +To achieve acceptable performance for direct access it is necessary that +the storage medium be randomly accessible (unlike, say, a magnetic or optical +tape) and that the hardware seek time and transfer rate be comparable to those +provided by magnetic disk technology. Note that current optical disk readers +often do not have fast seek times, and that those that do have fast seek times +generally have a lower storage density than sequential devices due to the gaps +between sectors. Even if a device is not fast enough to be used directly it +is still possible to eliminate the expensive format conversion and do only a +disk to disk copy, accessing the machine independent format on magnetic disk. + +There is no requirement that the IRAF DBSS be used to support data archiving, +but the DBSS \fIis\fR required to be able to access the data in an archive. +Accessing the master catalogs as well seems reasonable since such a catalog +is no different than those described in sections 3.2.1. and 3.2.3; IRAF will +have the capability to maintain, access, and query such a catalog without +developing any additional software. + +The main obstacle likely to limit the success of data archiving may well be +the difficulty involved in gaining access to the archive. If the master +catalogs were maintained on magnetic disk but released periodically in +optical disk format for astronomers to refer to at their home institutions, +access would be much easier (and probably more frequent) than if all the +astronomers in the country were required to access a single distant computer +via modem. Telephone access by sites not on the continent would probably +be too expensive or problematic to be feasible. + +.nh 2 +Other Requirements + + In earlier sections we have discussed the principle constraints and +primary requirements for the DBSS. Several other requirements or +non-requirements deserve mention. + +.nh 3 +Concurrency + + All of the applications identified thus far require either read-only access +to a public database or read-write access to a private database. +The DBSS is therefore not required to support simultaneous updating by many +users of a single centralized database, with all the overhead and complication +associated with record locking, deadlock avoidance and detection, and so on. +The only exception occurs when a single user has several concurrent processes +requiring simultaneous update access to the user's private database. It appears +that this case can be addressed adequately by distributing the database in +several datasets and using host system file locking to lock the datasets, +a technique discussed further in a later section. + +.nh 3 +Recovery + + If a database update is aborted for some reason a dataset can be corrupted, +possibly preventing further access to the dataset. The DBSS should of course +protect datasets from corruption in normal circumstances, but it is always +possible for a hardware or software error (e.g., disk overflow or reboot) to +cause a dataset to be corrupted. Some mechanism is required for recovering a +database that has been corrupted. The minimum requirement is that the DBSS, +when asked to access a corrupted dataset, detect that the dataset has been +corrupted and abort, after which the user runs a recovery task to rebuild the +dataset minus the corrupted records. + +.nh 3 +Data Independence + + Data independence is a fundamental property inherent in virtually all +database systems. One of the major reasons one uses a database system is to +provide data independence. Data independence is so fundamental that we will +not discuss it further here. Suffice it so say that the DBSS must provide +a high degree of data independence, allowing applications programs to function +without detailed knowledge of the structure or contents of the database they +are accessing, and allowing databases to change significantly without +affecting the programs which access them. + +.nh 3 +Host Database Interface + + The host database interface (HDBI) makes it possible for the DBSS to +interface to a host database system. The ability to interface to a host +database system is not a primary requirement for the DBSS but is a highly +desirable one for many of the same reasons that direct access to archival data +is important. The problems of accessing a HDB and of accessing an archive +maintained in non-DBSS format are similar and might perhaps be addressed +by a single interface. + +.nh +Conceptual Design + + In this section we develop the design of the various subsystems comprising +the DBSS at the conceptual level, without bothering with the details of specific +language bindings or with the details of implementation. We start by defining +some important terms and then describe the system architecture. Lastly we +describe each of the major subsystems in turn, starting at the highest level +and working down. + +.nh 2 +Terminology + + The DBSS is an implementation of a \fBrelational database\fR. A relational +database views data as a collection of \fBtables\fR. Each table has a fixed +set of named columns and may contain any number of rows of data. The rows +of a table are often referred to as \fBrecords\fR. A record consists of a set +of named \fBfields\fR. The fields of a record are the columns of the table +containing the record. + +We shall use this informal terminology when discussing the contents of a +physical database. When discussing the \fIstructure\fR of a database we shall +use the formal relational terms relation, tuple, attribute, and so on. +The correspondence between the formal relational terms and their informal +equivalents is given in the table below. + + +.ks +.nf + \fBformal relational term\fR \fBinformal equivalents\fR + + relation table + tuple record, row + attribute field, column + primary key unique identifier + domain pool of legal values +.fi +.ke + + +A \fBrelation\fR is a set of like tuples. A \fBtuple\fR is a set of +\fBattributes\fR, each of which is defined upon a specific domain. +A \fBdomain\fR is an abstract type which defines the legal values an +attribute may take on (e.g., "posint" or "color"). The tuples of a relation +must be unique within the containing relation. The \fBprimary key\fR is +a subset of the attributes of a relation which is sufficient to uniquely +identify any tuple in the relation (often a single attribute serves as +the primary key). + +The relational data model was chosen for the DBSS because it is the simplest +conceptual data model which meets our requirements. Other possibilites +considered were the \fBhierarchical\fR model, in which data is organized in +a tree structure, and the \fBnetwork\fR model, in which data is organized in +a potentially recursive graph structure. Virtually all new database systems +implemented since the mid-seventies have been based on the relational model +and most database research today is in support of the relational model (the +remainder goes to the new fifth-generation technology, not to the old data +models). + +The term "relational" in "relational database" comes from the \fBrelational +algebra\fR, a branch of mathematics based on set theory which defines a +fundamental and mathematically complete set of operations upon relations +(tables). The relational algebra is fundamental to the DBMS query language +(section 4.3) but can be safely ignored in the rest of the DBSS. The reader +is referred to any introductory database text for a discussion of the relational +algebra and other database technotrivia. The classic introductory database +text is \fI"An Introduction to Database Systems"\fR, Volume 1 (Fourth Edition, +1986) by C. J. Date. + +.nh 2 +System Architecture + + The system architecture of the DBSS is depicted in Figure 2. The parts +of the figure above the "DBKI" have already been discussed in section 2.2. +The remainder of the figure is what has been referred to previously as the +DB kernel. + +The primary function of DBIO is record access (retrieval, update, insertion, +and deletion) based on evaluation of a \fBselect\fR statement input as a string. +DBIO can also process symbolic definitions of relations and other database +objects so that new tables may be created. DBIO does not implement any +relational operators more complex than select; the more complex relational +operations are left to the DBMS query language to minimize the size and +complexity of DBIO. + +The basic concept underlying the design of the lower level portions of the DBSS +is that the DB kernel provides the \fBaccess method\fR for efficiently accessing +records in mass storage, while DBIO takes care of all higher level functions. +In particular, DBIO implements all functions required to access the contents +of a record, while the DB kernel is responsible for storage allocation and for +the maintenance and use of indexes, but has no knowledge of the actual contents +of a record (the HDBI is an exception to this rule as we shall see later). + +The database kernel interface (DBKI) provides a layer of indirection between +DBIO and the underlying database kernel (DBK). The DBKI can support a number +of different kernels, much the way FIO can support a number of different device +drivers. The DBKI also provides network access to a remote database, using +the existing IRAF kernel interface (KI) to communicate with a DBKI on the +remote node. Two standard database kernels are provided. + +The primary DBK (at the right in the figure) maintains and accesses DBSS +binary datasets; this is the most efficient kernel and probably the only +kernel which will fully implement the semantic actions of the DBKI. +The second DBK (at the left in the figure) supports the host database +interface (HDBI) and is used to access archival data, any foreign image +formats, and the host database system (HDB), if any. Specialized HDBI +drivers are required to access foreign image formats or to interface to +an HDB. + + +.ks +.nf + DBMS IMAGES(etc) (CL) + \ / + \ / --------- + \ / + \ IMIO + \ / \ + \ / \ + \/ \ + DBIO FIO (VOS) + | + | + | + DBKI + | + +------+------+-------+ + | | | + DBK DBK (KI) + | | | + | | | + HDBI | | + | | | + +----+----+ | | --------- + | | | | + | | | | + [archive] [HDB] [dataset] | + | + | (host system) + - + (LAN) + - + | + | --------- + | + (Kernel-Server) + | + | + DBKI (VOS) + | + +---+---+ + | | + DBK DBK + + +.fi +.ce +Figure 2. \fBDatabase Subsystem Architecture\fR +.ke + + +.nh 2 +The DBMS Package +.nh 3 +Overview + + The user interfaces with a database in either of two ways. The first way +is via the tasks in an applications package, which perform highly specialized +operations upon objects stored in the database, e.g., to reduce a certain kind +of data. The second way is via the database management package (DBMS), which +gives the user direct access to any dataset (but not to large pixel arrays +stored outside the DBSS). The DBMS provides an assortment of general purpose +operators which may be used regardless of the type of data stored in the +database and regardless of the applications program which originally created +the structures stored in the database. + +The DBMS package consists of an assortment of simple procedural operators +(conventional CL callable parameter driven tasks), a screen editor for tables, +and the query language, a large program which talks directly to the terminal +and which has its own special syntax. Lastly there is a subpackage containing +tasks useful only for datasets maintained by the primary DBK, i.e., a package +of relatively low level tasks for things like crash recovery and examining +the contents of physical datasets. + +.nh 3 +Procedural Interface + + The DBMS procedural interface provides a number of the most commonly +performed database operations in the form of CL callable tasks, allowing +these simple operations to be performed without the overhead involved in +entering the query language. Extensive database manipulations are best +performed from within the query language, but if the primary concern of +the user is data reduction in some package other than DBMS the procedural +operators will be more convenient and less obtrusive. + +.nh 4 +General Operators + + DBMS tasks are required to implement the following general database +management operations. Detailed specifications for the actual tasks are +given later. +.ls +.ls \fBchdb\fR newdb +Change the default database. To minimize typing the DBSS provides a +"default database" paradigm analogous to the default directory of FIO. +Note that there need be no obvious connection between database objects +and files since multiple tables may be stored in a single physical file, +and the physical database may reside on an optical disk or worse may be +an HDB. Therefore the FIO "directory" cannot be used to examine the +contents of a database. The default database may be set independently +of the current directory. +.le +.ls \fBpcatalog\fR [database] +Print the catalog of the named database. The catalog is a system table +containing one entry for every table in the database; it is analogous +to a FIO directory. Since the catalog is a table it can be examined like +any other table, but a special task is provided since the print catalog +operation is so common. If no argument is given the catalog of the default +database is printed. +.le +.ls \fBptable\fR spe +Print the contents of the specified relation in list form on the standard +output. The operand \fIspe\fR is a general select expression defining +a new table as a projection of some subset of the records in a set of one or +more named tables. The simplest select expression is the name of a single +table, in which case all fields of all records in the table will be printed. +More generally, one might print all fields of a single table, selected fields +of a single table (projection), all fields of selected records of a single +table (selection), or selected fields of selected records from one or more +tables (selection plus projection). +.le +.ls \fBrcopy\fR spe output_table +Copy (insert) the records specified by the general select expression +\fIspe\fR into the named \fIoutput_table\fR. If the named output table +does not exist a new one will be created. If the attributes of the output +table are different than those of the input table the proper action of +this operator is not obvious and has not yet been defined. +.le +.ls \fBrmove\fR spe output_table +Move (insert) the relation specified by the general select expression +\fIspe\fR into the named \fIoutput_table\fR. If the named output table +does not exist a new one will be created. The original records are deleted. +This operator is used to generate the union of two or more tables. +.le +.ls \fBrdelete\fR spe +Delete the records specified by the general select expression \fIspe\fR. +Note that this operator deletes records from tables, not the tables themselves. +.le +.ls \fBmkdb\fR newdb [ddl_file] +Create a new, empty database \fInewdb\fR. If a data definition file +\fIddl_file\fR is named it will be scanned and any domain, relation, etc. +definitions therein entered into the new database. +.le +.ls \fBmktable\fR table relation +Create a new, empty table \fItable\fR of type \fIrelation\fR. The parameter +\fIrelation\fR may be the name of a DDL file, the name of an existing base +table, or any general record select/project expression. +.le +.ls \fBmkview\fR table relation +Create a new virtual table (view) defined in terms of one or more existing +base tables by the operand \fIrelation\fR, which is the same as for the +task \fImktable\fR. Operationally, \fBmkview\fR is much like \fBrcopy\fR, +except that it is considerably faster and the new table does not physically +store any data. The new view-table behaves like any other table in most +operations (except some types of updates). Note that the new table may +reference tuples in several different base tables. A view-table may +subsequently be converted into a base table with \fBrcopy\fR. Views are +discussed in more detail in section 4.5. +.le +.ls \fBmkindex\fR table fields +Make a new index on the named base table over the listed fields. +.le +.ls \fBrmtable\fR table +Drop (delete, remove) the named base table (or view) and any indexes defined +on the table. +.le +.ls \fBrmindex\fR table fields +Drop (delete, remove) the index defined over the listed fields on the named +base table. +.le +.ls \fBrmdb\fR [database] +Destroy the named database. Unless explicitly overridden \fBrmdb\fR will +refuse to delete a database until all tables therein have been dropped. +.le +.le + + +Several terms were introduced in the discussion above which have not yet been +defined. A \fBbase table\fR is a physical table (instance of a defined +relation), unlike a \fBview\fR which is a virtual table defined via selection +and projection over one or more base tables or other views. Both types of +objects behave equivalently in most operations. +A \fBdata definition language\fR (DDL) is a language syntax used to define +database objects. + +.nh 4 +Forms Based Data Entry and Retrieval + + Many of the records typically stored in a database are too large to be +printed in list format on a single line. Some form of multiline output is +necessary; this multiline representation is called a \fBform\fR. The full +terminal screen is used to display a form, e.g. with the fields labeled +in reverse video and the field values in normal video. Records are viewed +one at a time. + +Data entry via a form is an interactive process similar to editing a file with +a screen editor. The form is displayed, possibly with default values for the +fields, and the user types in new values for the fields. Editor commands are +provided for positioning the cursor to the field to be edited and for editing +within a field. The DBSS verifies each value as it is entered using the range +information supplied with the domain definition for that field. +Additional checks may be made before the new record is inserted into the +output table, e.g., the DBSS may verify that values have been entered for +all fields which do not permit null values. +.ls +.ls \fBetable\fR spe +Call up the forms editor to edit a set of records. The operand \fIspe\fR +may be any general select expression. +.le +.ls \fBpform\fR spe +Print a set of records on the standard output, using the forms generator to +generate a nice self documenting format. +.le +.le + + +The \fBforms editor\fR (etable) may be used to display or edit existing records +as well as to enter new ones. It is desirable for the forms editor to be able +to move backward as well as forward in a table, as well as to move randomly +to a record satisfying a predicate, i.e., search through the table for a +record. This makes the forms editor a powerful tool for browsing through a +database. If the predicate for a search is specified by entering values or +boolean expressions into the fields contributing to the predicate then we have +a query-by-form utility, which has been reported in the literature to be very +popular with users (since one does not have to remember a syntax and typing +is minimized). + +A variation on the forms editor is \fBpform\fR, used to output records in +"forms" format. This will be most useful for large records or for cases where +one is more interested in studying individual records than in comparing +different records. The alternative to forms output is list or tabular format +output. This form of output is more concise and can be used as input to the +\fBlists\fR operators, but may be harder to read and may overflow the output +line. List format output is discussed further in the next section. + +By default the format of a form is determined automatically by a +\fBforms generator\fR using information given in the DDL when the database +was created. The domain definition capability of the DDL includes provisions +for specifying the default output format for a field as well as the field label. +In most cases this will be sufficient information for the forms generator to +generate an esthetically acceptable form. If desired the user or programmer can +modify this form or create a new form from scratch, and the forms generator +will use the customized form rather than create one of its own. + +The CL \fBeparam\fR parameter file editor is an example of a simple forms +editor. The main differences between \fBeparam\fR and \fBetable\fR are the +forms generator and the browsing capability. + +.nh 4 +List Interface + + The \fBlist\fR is one of the standard IRAF data structures. A list is +an ascii table wherein the standard record delimiter is the newline and the +standard field delimiter is whitespace. Comment lines and blank lines are +ignored within lists; double comment lines ("## ...") may optionally be used +to label the columns of a list. By default, non-DBMS lists are free format; +strings must be quoted if they contain one of the field delimiter characters. +The field and record delimiter characters may be changed if necessary, e.g., +to permit multiline records. Fixed format lists are available as an option +and are often required to interface to external (non-IRAF) programs. + +The primary advantages of the list or tabular format for printed tables are +the following. +.ls +.ls [1] +The list or tabular format is the most concise form of printed output. +The eye can rapidly scan up and down a column to compare the values of +the same field in a set of records. +.le +.ls [2] +DBMS list output may be used as input to the tasks in the \fBlists\fR, +\fBplot\fR, and other packages. Using the pipe syntax, tasks which +communicate via lists may be connected together to perform arbitrarily +complex operations. +.le +.ls [3] +List format output is the defacto standard format for the interchange of +tabular data (e.g., DBSS tables) amongst different computers and programs. +A list (usually the fixed format variety) may be written onto a cardimage +tape for export, and conversely, a list read from a cardimage tape may be +used to enter a table into a DBSS database. +.le +.le + + +The most common use for list format output will probably be to print tables. +When a table is too wide to fit on a line the user will learn to use +\fBprojection\fR to print only the fields of interest. The default format +for DBMS lists will be fixed format, using the format information provided +in the DDL specification to set the default output format. Fixed format +is best for DBMS lists since it forces the field values to line up in nice +orderly columns, which are easier for a human to read (fixed format is easier +and more efficient for a computer to read as well, if not to write). +The type of format used will be recorded in the list header and a +\fBlist interface\fR will be provided so that all list processing programs +can access lists equivalently regardless of their format. + +As mentioned above, the list interface can be used to import and export tables. +In particular, an astronomical catalog distributed on card image tape can be +read directly into a DBSS table once a format descriptor has been prepared +and the DDL for the new table has been written and used to create an empty +table ready to receive the data. After only a few minutes of setup a user can +have a catalog entered into the database and be getting final results using +the query language interface! +.ls +.ls \fBrtable\fR listfile output_table +The list \fIlistfile\fR is scanned, inserting successive records from the +list into the named output table. A new output table is created if one does +not already exist. The format of the list is taken from the list header +if there is one, otherwise the format specification is provided by the user +in a separate file. +.le +.ls \fBptable\fR spe +Print the contents of the relation \fIspe\fR in list form on the standard +output. The operand \fIspe\fR may be any general select/project expression. +.le +.le + + +The \fBptable\fR operator (introduced in section 4.3.2.1) is used to generate +list output. The inverse operation is provided by \fBrtable\fR. + +.nh 4 +FITS Table Interface + + The FITS table format is a standard format for the transport of tabular +data. The idea is very similar to the cardimage format discussed in the last +section except that the FITS table standard includes a table header used to +define the format of the encoded table, hence the user does not have to +prepare a format descriptor to read a FITS table. The FITS reader and writer +programs are part of the \fBdataio\fR package. + +.nh 4 +Graphics Interface + + All of the \fBplot\fR package graphics facilities are available for plotting +DBMS data via the \fBlist\fR interface discussed in section 4.3.2.3. List +format output may also be used to generate output to drive external (non-IRAF) +graphics packages. Plotting facilities are also available via a direct +interface within the query language; this latter interface is the most efficient +and will be the most suitable for most graphics applications. See section +2.3 for additional comments on the graphics interface. + +.nh 3 +Command Language Interface + + All of the DBMS tasks are CL callable and hence part of the command language +interface to the DBSS. For example, a CL script task may implement arbitrary +relational operators using \fBptable\fR to copy a table into a list, \fBfscan\fR +and \fBprint\fR to read the list and format the modified list, and finally +\fBrtable\fR to insert the output list into a table. The query language may +also be called from within a CL script to process commands passed on the +command line, via the standard input, or via a temporary file. + +Additional operators are required for randomly accessing records without the +use of a list; suitable operators are shown below. +.ls +.ls \fBdbgets\fR record fields +The named fields of the indicated record are returned as a free format string +suitable for decoding into individual fields with \fBfscan\fR. +.le +.ls \fBdbputs\fR record fields values +The named fields of the indicated record are set to the values given in the +free format value string. +.le +.le + + +More sophisticated table and record access facilities are conceivable but +cannot profitably be implemented until an enhanced CL becomes available. + +.nh 3 +Record Selection Syntax + + As we have seen, many of the DBMS operators employ a general record +selection syntax to specify the set of records to be operated upon. +The selection syntax will include a list of tables and optionally a +predicate (boolean expression) to be evaluated for each record in the +listed tables to determine if the record is to be included in the final +selection set. In the simplest case a single table is named with no +predicate in which case the selection set consists of all records in the +named table. Parsing and evaluation of the record selection expression +is performed entirely by the DBIO interface hence we defer detailed +discussion of selection syntax to the sections describing DBIO. + +.nh 3 +Query Language + + In most database systems the \fBquery language\fR is the primary user +interface, both for the end-user interactively entering adhoc queries, and for +the programmer entering queries via the host language interface. The major +reasons for this are outlined below. +.ls +.ls [1] +A query language interface is much more powerful than a "task" or subroutine +based interface such as that described in section 4.3.2. A query language +can evaluate queries much more complex than the simple "select" operation +implemented by DBIO and made available to the user in tasks such as +\fBptable\fR and \fBrcopy\fR. +.le +.ls [2] +A query language is much more efficient than a task interface for repeated +queries. Information about a database may be cached between queries and +files may remain open between queries. Complex queries may be executed as +a series of simpler queries, cacheing the intermediate results in memory. +Graphs may be generated directly from the data without encoding, writing, +reading, decoding, and deleting an intermediate list. +.le +.ls [3] +A query language can perform many functions via a single interface, reducing +the amount of code to be written and supported, as well as simplifying the +user interface. For example, a query language can be used to globally +update (edit) tables, as well as to evaluate queries on the database. +Lacking a query language, such an editing operation would have to be +implemented with a separate task which would no doubt have its own special +syntax for the user to remember (e.g, the \fBhedit\fR task in the \fBimages\fR +package). +.le +.le + + +Unlike most commercial database systems, the DBSS is not built around the +query language. The heart of the IRAF DBSS is the DBIO interface, which is +little more than a glorified record access interface. The query language +is a high level applications task built upon DBIO, GIO, and the other interfaces +constituting the IRAF VOS. This permits us to delay implementation of the +query language until after the DBSS is in use and our primary requirements have +been met, and then implement the query language as an experimental prototype. +Like all data analysis software, the query language is not required to meet +our primary requirements (data acquisition and reduction), rather it is needed +to do interesting things with our data once it has been reduced. + +.nh 4 +Query Language Functions + + The query language is a prominent part of the user interface and is +often used interactively directly by the user, but may also be called +noninteractively from within CL scripts and by SPP programs. The major +functions performed by the query language are as follows. +.ls +.ls [1] +The database management operations, i.e., create/destroy database, +create/drop table or index, sort table, alter table (add new attribute), +and so on. +.le +.ls [2] +The relational operations, i.e., select, project, join, and divide +(the latter is rarely implemented). These are the operations most used +to evaluate queries on the database. +.le +.ls [3] +The traditional set operations, i.e., union, intersection, difference, +and cartesian product. +.le +.ls [4] +The editing operations, i.e, selective record update and delete. +.le +.ls [5] +Operations on the columns of tables. Compute the sum, average, minimum, +maximum, etc. of the values in a column of a table. These operations +are also required for queries. +.le +.ls [6] +Tabular and graphical output. The result of any query may be printed or +plotted in a variety of ways, without need to repeat the query. +.le +.le + + +The most important function performed by the query language is of course the +interactive evaluation of queries, i.e., questions about the data in the +database. It is beyond the scope of this document to try to give the reader +a detailed understanding of how a query language is used to evaluate queries. + +.nh 4 +Language Syntax + + The great flexibility of a query language derives from the fact that it is +syntax rather than parameter driven. The syntax of the DBMS query language +has not yet been defined. In choosing a language syntax there are a several +possible courses of action: [1] implement a standard syntax, [2] extend a +standard syntax, or [3] develop a new syntax, e.g., as a variation on some +existing syntax. + +The problem with rigorously implementing a standard syntax is that all query +languages currently in wide use were developed for commercial applications, +e.g., for banking, inventory, accounting, customer mailing lists, etc. +Experimental query languages are currently under development for CAD +applications, analysis of Landsat imagery, and other applications similar +to ours, but these are all research projects at the present time. +The basic characteristics desirable in a query language intended for scientific +data reduction and analysis seem little different than those provided by a query +language intended for commercial applications, hence the most practical +approach is probably to start with some existing query language syntax and +modify or extend it as necessary for our type of data. + +There is no standard query language for relational databases. +The closest thing to a standard is SQL, a language originally developed by +IBM for System-R (one of the first relational database systems, actually an +experimental prototype), and still in use in the latest IBM product, DB2. +This language has since been used in many relational products by many companies. +SQL is the latest in a series of relational query languages from IBM; earlier +languages include SQUARE and SEQUEL. The second most widely used relational +query language appears to be QUEL, the query language used in both educational +and commercial INGRES. + +Both SQL and QUEL are examples of the so-called "calculus" query languages. +The other major type of query language is the "algebraic" query language +(excluding the forms and menu based query languages which are not syntax +driven). Examples of algebraic languages are ISBL (PRTV, Todd 1976), +TABLET (U. Mass.), ASTRID (Gray 1979), and ML (Li 1984). +These algebraic languages have all been implemented and used, but nowhere +near as widely as SQL and QUEL. + +It is interesting to note that ASTRID and ML were developed by researchers +active in the area of logic languages. In particular, the ML (Mathematics-Like) +query language was implemented in Prolog and some of the character of Prolog +shows through in the syntax of the language. There is a close connection +between the relational algebra and the predicate calculus (upon which the +logic languages are based) which is currently being actively explored. +One of the most promising areas of application for the logic languages +(upon which the so-called "fifth generation" technology is based) is in +database applications and query languages in particular. + +There appears to be no compelling reason for the current dominance of the +calculus type query language, other than the fact that is what IBM decided +to use in System-R. Anything that can be done in a calculus language can +also be done in an algebraic language and vice versa. + +The primary difference between the two languages is that the calculus languages +want the user to express a complex query as a single large statement, +whereas the algebraic languages encourage the user to execute a complex +query as a series of simpler queries, storing the intermediate results as +snapshots or views (either language can be used either way, but the orientation +of the two languages is as stated). For simple queries there is little +difference between the two languages, although the calculus languages are +perhaps more readable (more English-like) while the algebraic languages are +more concise and have a more mathematical character. + +The orientation of the calculus languages towards doing everything in a single +statement provides more scope for optimization than if the equivalent query is +executed as a series of simpler queries; this is often cited as one of the +major advantages of the calculus languages. The procedural nature of the +algebraic languages does not permit the type of global optimizations employed +in the calculus languages, but this approach is perhaps more user-friendly +since the individual steps are easy to understand, and one gets to examine +the intermediate results to figure out what to do next. Since a complex query +is executed incrementally, intermediate results can be recomputed without +starting over from scratch. It is possible that, taking user error and lack +of forethought into account, the less efficient algebraic languages might end +up using less computer time than the super efficient calculus languages for +comparable queries. + +A further advantage of the algebraic language in a scientific environment is +that there is more of a distinction between executing a query and printing +the results of the query than in a calculus language. The intermediate results +of a complex query in an algebraic language are named relations (snapshots +or views); an extra print command must be entered to examine the intermediate +result. This is an advantage if the query language provides a variety of ways +to examine the result of a query, e.g., as a printed table or as some type +of plot. + +.nh 4 +Sample Queries + + At this point several examples of actual queries, however simple they may +be, should help us to visualize what a query language is like. Several +examples of typical scientific queries were given (in English) in section 3.2.1. +For the convenience of the reader these are duplicated here, followed by actual +examples in the query languages SQL, QUEL, ASTRID, and ML. It should be noted +that these are all examples of very simple queries and these examples do little +to demonstrate the power of a fully relational query language. +.ls +.ls [1] +Find all objects of type "pqr" for which X is in the range A to B and +Z is less than 10. +.le +.ls [2] +Compute the mean and standard deviation of attribute X for all objects +in the set [1]. +.le +.ls [3] +Compute and plot (X-Y) for all objects in set [1]. +.le +.ls [4] +Plot a circle of size (log2(Z-3.2) * 100) at the position (X,Y) of all objects +in set [1]. +.le +.ls [5] +Print the values of the attributes OBJ, X, Y, and Z of all objects of type +"pqr" for which X is in the range A to B and Y is greater than 30. +.le +.le + + +It should not be difficult for the imaginative reader to make up similar +queries for a particular astronomical catalog or data archive. +For example (I can't resist), "find all objects for which B-V exceeds X", +"find all recorded observations of object X", "find all observing runs on +telescope X in which astronomer Y participated during the years 1975 to +1985", "compute the number of cloudy nights in August during the years +1985 to 1990", and so on. The possibilities are endless. + +Query [5] is an example of a simple select/project query. This query is +shown below in the different query languages. Note that whitespace may be +redistributed in each query as desired; in particular, the entire query may +be entered on a single line if desired. Keywords are shown in upper case +and data names or values in lower case. The object "table" is the table +from which records are to be selected, "pqr" is the desired value of the +field "type" of table "table", and "x", "y", and "z" are numeric fields of +the table. + + +.ks +.nf +SQL: + + SELECT obj, x, y, z + FROM table + WHERE type = 'pqr' + AND x >= 10 + AND x <= 20 + AND z > 30; +.fi +.ke + + +.ks +.nf +QUEL: + + RANGE OF t IS table + RETRIEVE (t.obj, t.x, t.y, t.z) + WHERE t.type = 'pqr' + AND t.x >= 10 + AND t.y <= 20 + AND t.z > 30 +.fi +.ke + + +.ks +.nf +ASTRID (mnemonic form): + + table + SELECTED_ON [ + type = 'pqr' + AND x >= 10 + AND x <= 20 + AND z > 30 + ] PROJECTED_TO + obj, x, y, z +.fi +.ke + + +.ks +.nf +ASTRID (mathematical form): + + table ;[ type = 'pqr' AND x >= 10 AND x <= 20 AND z < 10 ] % + obj, x, y, z +.fi +.ke + + +.ks +.nf +ASTRID (alternate query showing use of intermediates): + + a := table ;[ type = 'pqr' AND z > 30 ] + b := a ;[ x >= 10 AND x <= 20 ] + b % obj,x,y,z +.fi +.ke + + +.ks +.nf +ML (Li/Prolog): + + table : type=pqr, x >= 10, x <= 20, z < 10 [obj,x,y,z] +.fi +.ke + + +Note that in ASTRID and ML selection and projection are implemented as +operators or qualifiers modifying the relation on the left. To print all +fields of all records of a table one need only enter the name of the table. +The logic language nature of such queries is evident if one thinks of the +query as a predicate or true/false assertion. Given such an assertion (query), +the query processor tries to prove the assertion true by finding all tuples +satisfying the predicate, using the set of rules given (the database). + +For simple queries such as these it makes little difference what query language +is used; many users would probably prefer the SQL or QUEL syntax for these +simple queries because of the English like syntax. To seriously evaluate the +differences between the different languages more complex queries must be tried, +but such an exercise is beyond the scope of the present document. + +As a final example we present, without supporting explanation, an example +of a more complex query in SQL (from Date, 1986). This example is based +upon a "suppliers-parts-projects" database, consisting of four tables: +suppliers (S), parts (P), projects (J), and number of parts supplied to +a specified project by a specified supplier (SPJ), with fields 'supplier +number' (S#), 'part number' (P#) and 'project number' (J#). The names +SPJX and SPJY are aliases for SPJ. This example is rather contrived and +the data is not interesting, but it should serve to illustrate the use of +SQL for complex queries. + + +.ks +.nf +Query: Get part numbers for parts supplied to all projects in London. + + SELECT DISTINCT p# + FROM spj spjx + WHERE NOT EXISTS + ( SELECT * + FROM j + WHERE city = 'london' + AND NOT EXISTS + ( SELECT * + FROM spj spjy + WHERE spjy.p# = spjx.p# + AND spjy.j# = j.j# )); +.fi +.ke + + +The nesting shown in this example is characteristic of the calculus languages +when used to evaluate complex queries. Each SELECT implicitly returns an +intermediate relation used as input to the next higher level subquery. + +.nh 3 +DB Kernel Operators + + All DBMS operators described up to this point have been general purpose +operators with no knowledge of the form in which data is stored internally. +Additional operators are required in support of the standard IRAF DB kernels. +These will be implemented as CL callable tasks in a subpackage off DBMS. + +.nh 4 +Dataset Copy and Load + + Since our intention is to store the database in a machine independent +format, special operators are not required to backup, reload, or copy dataset +files. The binary file copy facilities provided by IRAF or the host system +may be used to backup, reload, or copy dataset files. + +.nh 4 +Rebuild Dataset + + Over a period of time a dataset which is subjected to heavy updating +may become disordered internally, reducing the efficiency of a most record +access operations. A utility task is required to efficiently rebuild such +datasets. The same result can probably be achieved by an \fIrcopy\fR +operation but a lower level operator may be more efficient. + +.nh 4 +Mount Foreign Dataset + + Before a foreign dataset (archive or local format imagefile) can be +accessed it must be \fImounted\fR, i.e., the DBSS must be informed of the +existence and type of the dataset. The details of the mount operation are +kernel dependent; ideally the mount operation will consist of little more +than examining the structure of the foreign dataset and making appropriate +entries in the system catalog. + +.nh 4 +Crash Recovery + + A utility is required for recovering datasets which have been corrupted +as a result of a hardware or software failure. There should be sufficient +redundancy in the internal data structures of a dataset to permit automated +recovery. The recover operation is similar to a rebuild so perhaps the +same task can be used for both operations. + +.nh 2 +The IMIO Interface +.nh 3 +Overview + + The Image I/O (IMIO) interface is an existing subroutine interface used +to maintain and access bulk data arrays (images). The IMIO interface is built +upon the DBIO interface, using DBIO to maintain and access the image headers +and sometimes to access the stored data (the pixels) as well. For reasons of +efficiency IMIO directly accesses the bulk data array when large images are +involved. + +Most of the material presented in this section on the image header is new. +The pixel access facilities provided by the existing IMIO interface will +remain essentially unchanged, but the image header facilities provided by +the current interface are quite limited and badly need to be extended. +The existing header facilities provide support for the major physical image +attributes (dimensionality, length of each axis, pixel datatype, etc.) plus +a limited facility for storing user defined attributes. The main changes +in the new interface will be excellent support for history records, world +coordinates, histograms, a bad pixel list, and image masks. In addition +the new interface will provide improved support for user defined attributes, +and greatly improved efficiency when accessing large groups of images. +The storage structures will be more localized, hopefully causing less +confusion for the user. + +In this section we first discuss the components of an image, concentrating +primarily on the different parts of the image header, which is quite a +complex structure. We then discuss briefly the (mostly existing) facilities +for header and pixel access. Lastly we discuss the storage structures +normally used to maintain images in mass storage. + +.nh 3 +Logical Schema + + Images are stored as records in one or more tables in a database. More +precisely, the main part of an image header is a record (row) in some table +in a database. In general some of the other tables in a database will contain +auxiliary information describing the image. Some of these auxiliary tables +are maintained by IMIO and will be discussed in this section. Other tables +will be created by the applications programs used to reduce the image data. + +As far as the DBSS is concerned, the pixel segment of an image is a pretty +minor item, a single array type attribute in the image header. Since the +size of this array can vary enormously from one image to the next some +strategic questions arise concerning where to store the data. In general, +small pixel segments will be stored directly in the image header, while large +pixel segments will be stored in a separate file from that used to store +the header records. + +The major components of an image (as far as IMIO is concerned) are summarized +below. More detailed information on each component is given in the following +sections. +.ls +.ls Standard Header Fields +An image header is a record in a relation initially of type "image". +The standard header fields include all attributes necessary to describe +the physical characteristics of the image, i.e., all attributes necessary +to access the pixels. +.le +.ls History +History records for all images in a database are stored in a separate history +relation in time sequence. +.le +.ls World Coordinates +An image may have any number of world coordinate systems associated with it. +These are stored in a separate world coordinate system relation. +.le +.ls Histogram +An image may have any number of histograms associated with it. +Histograms for all images in a database are stored in a separate histogram +relation in time sequence. +.le +.ls Pixel Segment +The pixel segment is stored in the image header, at least from the point of +view of the logical schema. +.le +.ls Bad Pixel List +The bad pixel list, a variable length integer array, is required to physically +describe the image hence is stored in the image header. +.le +.ls Region Mask +An image may have any number of region masks associated with it. Region masks +for all images in a database are stored in a separate mask relation. A given +region mask may be associated with any number of different images. +.le +.le + + +In summary, the \fBimage header\fR contains the standard header fields, +the pixels, the bad pixel list, and any user defined fields the user wishes +to store directly in the header. All other information describing an image +is stored in external non-image relations, of which there may be any number. +Note that the auxiliary tables (world coordinates, histograms, etc.) are not +considered part of the image header. + +.nh 4 +Standard Header Fields + + The standard header fields are those fields required to describe the +physical attributes of the image, plus those fields required to physically +access the image pixels. The standard header fields are summarized below. +These fields necessarily reflect the current capabilities of IMIO. Since +the DBSS provides data independence, however, new fields may be added in +the future to support future versions of IMIO without rendering old images +unreadable. +.ls +.ls 12 image +An integer value automatically assigned by IMIO when the image is created +which uniquely identifies the image within the containing table. This field +is used as the primary key in \fIimage\fR type relations. +.le +.ls naxis +Number of axes, i.e., the dimensionality of the image. +.le +.ls naxis[1-4] +A group of 4 attributes, i.e., \fInaxis1\fR through \fInaxis4\fR, +each specifying the length of the associated image axis in pixels. +Axis 1 is an image line, 2 is a column, 3 is a band, and so on. +If \fInaxis\fR is greater than four additional axis length attributes +are required. If \fInaxis\fR is less than four the extra fields are +set to one. Distinct attributes are used rather than an array so that +the image dimensions will appear in printed output, to simplify the use +of the dimension attributes in queries, and to make the image header +more FITS-like. +.le +.ls linelen +The physical length of axis one (a line of the image) in pixels. Image lines +are often aligned on disk block boundaries (stored in an integral number of +disk blocks) for greater i/o efficiency. If \fIlinelen\fR is the same as +\fInaxis1\fR the image is said to be stored in compressed format. +.le +.ls pixtype +A string valued attribute identifying the datatype of the pixels as stored +on disk. The possible values of this attribute are discussed in detail below. +.le +.ls bitpix +The number of bits per pixel. +.le +.ls pixels +The pixel segment. +.le +.ls nbadpix +The number of bad pixels in the image. +.le +.ls badpix +The bad pixel list. This is effectively a boolean image stored in compressed +form as a variable length integer array. The bad pixel list is maintained by +the pixel list package, a subpackage of IMIO, also used to maintain region +masks. +.le +.ls datamin +The minimum pixel value. This field is automatically invalidated (set to a +value greater than \fIdatamax\fR) whenever the image is modified, unless +explicitly updated by the caller. +.le +.ls datamax +The maximum pixel value. This field is automatically invalidated (set to a +value less than \fIdatamin\fR) whenever the image is modified, unless +explicitly updated by the caller. +.le +.ls title +The image title, a one line character string identifying the image, +for annotating plots and other forms of output. +.le +.le + + +The possible values of the \fIpixtype\fR field are shown below. The format +of the value string is "type.host", where \fItype\fR is the logical datatype +and \fIhost\fR is the host machine encoding used to represent that datatype. + + +.ks +.nf + TYPE DESCRIPTION MAPS TO + + byte.m unsigned byte ( 8 bits) short.spp + ushort.m unsigned word (16 bits) long.spp + + short.m short integer, signed short.spp + long.m long integer, signed long.spp + real.m single precision floating real.spp + double.m double precision floating double.spp + complex.m (real,real) complex.spp +.fi +.ke + + +Note that the first character of each keyword is sufficient to uniquely +identify the datatype. The ".m" suffix identifies the "machine" to which +the datatype refers. When new images are written \fIm\fR will usually be +the name of the host machine. When images written on a different machine +are read on the local host there is no guarantee that the i/o system will +recognize the formats for the named machine, but at least the format will +be uniquely defined. Some possible values for \fIm\fR are shown below. + + +.ks +.nf + dbk DBK (database kernel) mip-format + mip machine independent (MII integer, + IEEE floating) + sun SUN formats (same as mip?) + vax DEC Vax data formats + mvs DG MV-series data formats +.fi +.ke + + +The DBK format is used when the pixels are stored directly in the image header, +since only the DBK binary formats are supported in DBK binary datafiles. +The standard i/o system will be support at least the MIP, DBK, SUN (=MIP), +and VAX formats. If the storage format is not the host system format +conversion to and from the corresponding SPP (host) format will occur at the +level of the FIO interface to avoid an N-squared type conversion matrix in +IMIO, i.e., IMIO will see only the SPP datatypes. + +Examples of possible \fIpixtype\fR values are "short.vax", i.e., a 16 bit signed +twos-complement byte-swapped integer format, and "real.mip", the 32 bit IEEE +single precision floating point format. + +.nh 4 +History Text + + The intent of the \fIhistory\fR relation is to record all events which +modify the image data in a dataset, i.e., all operations which create, delete, +or modify images. The attributes of the history relation are shown below. +Records are added to the history table in time sequence. Each record logically +contains one line of history text. +.ls 4 +.ls 12 time +The date and time of the event. This value of this field is automatically +set by IMIO when the history record is inserted. +.le +.ls parent +The name of the parent image in the case of an image creation event, +or the name of the affected image in the case of an image modification +event affecting a single image. +.le +.ls child +The name of the child or newly created image in the case of an image creation +event. This field is not used if only a single image is involved in an event. +.le +.ls event +The history text, i.e., a one line description of the event. The suggested +format is a task or procedure call naming the task or procedure which modified +the image and listing its arguments. +.le +.le + + +.ks +.nf +Example: + + TIME PARENT CHILD EVENT + + Sep 23 20:24 nite1[12] -- imshift (1.5, -3.4) + Sep 23 20:30 nite1[10] nite1[15] + Sep 23 20:30 nite1[11] nite1[15] + Sep 23 20:30 nite1[15] -- nite1[10] - nite1[11] +.fi +.ke + + +The principal reason for collecting all history text together in a single +relation rather than storing it scattered about in string attributes in the +image headers is to permit use of the DBMS facilities to pose queries on the +history of the dataset. Secondary reasons are the completeness of the history +record thus provided for the dataset as a whole, and increased efficiency, +both in the amount of storage required and in the time required to record an +event (in particular, the time required to create a new image). Note also that +the history relation may be used to record events affecting dataset objects +other than images. + +The history of any particular image is easily recovered by printing the values +of the \fItext\fR field of all records with a particular value of the +\fIimage\fR key. The parents or children of any image are easily traced +using the information in the history relation. The history of the dataset +as a whole is given by printing all history records in time sequence. +History information is not lost when intermediate images are deleted unless +deletes are explicitly performed upon the history relation. + +.nh 4 +World Coordinates + + In general, an image may simultaneously have any number of world coordinate +systems (WCS) associated with it. It would be quite awkward to try to store an +arbitrary number of WCS descriptors in the image header, so a separate WCS +relation is used instead. If world coordinates are not used no overhead is +incurred. + +Maintenance of the WCS descriptor, transformation of the WCS itself (e.g., +when an image changes spatially), and coordinate transformations using the WCS +are all managed by a dedicated package, also called WCS. The WCS package +is a general purpose package usable not only in IMIO but also in GIO and +other places. IMIO will be responsible for copying the WCS records for an +image when a new image is created, as well as for correcting the WCS for the +effects of subsampling, coordinate flip, etc. when a section of an image is +mapped. + +A general solution to the WCS problem requires that the WCS package support +both linear and nonlinear coordinate systems. The problem is further +complicated by the variable number of dimensions in an image. In general +the number of possible types of nonlinear coordinate systems is unlimited. +Our solution to this difficult problem is as follows. +.ls 4 +.ls o +Each image axis is associated with a one or two dimensional mapping function. +.le +.ls o +Each mapping function consists of a general linear transformation followed +by a general nonlinear transformation. Either transformation may be unitary +(may be omitted) if desired. +.le +.ls o +The linear transformation for an axis consists of some combination of a shift, +scale change, rotation, and axis flip. +.le +.ls o +The nonlinear transformation for an axis consists of a numerical approximation +to the underlying nonlinear analytic function. A one dimensional function is +approximated by a curve x=f(a) and a two dimensional function is approximated +by a surface x=f(a,b), where X, A, and B may be any of the image axes. +A choice of approximating functions is provided, e.g., chebyshev or legendre +polynomial, piecewise cubic spline, or piecewise linear. +.le +.ls o +The polynomial functions will often provide the simplest solution for well +behaved coordinate transformations. The piecewise functions (spline and linear) +may be used to model any slowly varying analytic function represented in +cartesian coordinates. The piecewise functions \fIinterpolate\fR the original +analytic function on a regular grid, approximating the function between grid +points with a first or third order polynomial. The approximation may be made +arbitrarily good by sampling on a finer grid, trading table space for increased +precision. +.le +.ls o +For many nonlinear functions, especially those defined in terms of the +transcendental functions, the fitted curve or surface will be quicker to +evaluate than the original function, i.e., the approximation will be more +efficient (evaluation of a bicubic spline is not cheap, however, requiring +computation of a linear combination of sixteen coefficients for each output +point). +.le +.ls o +The nonlinear transformation will define the mapping from pixel coordinates +to world coordinates. The inverse transformation will be computed by numerical +inversion (iterative search). This technique may be too inefficient for some +applications. +.le +.le + + +For example, the WCS for a three dimensional image might consist of a bivariate +Nth order chebyshev polynomial mapping X and Y to RA and DEC via gnomic +projection, plus a univariate piecewise linear function mapping each discrete +image band (Z) to a wavelength value. If the image were subsequently shifted, +rotated, magnified, block averaged, etc., or sampled via an image section, +a linear term would be added to the WCS record of each axis affected by the +transformation. + +A WCS is represented by a \fIset\fR of records in the WCS relation. One record +is required for each axis mapped by the transformation. The attributes of the +WCS relation are described below. The records forming a given WCS all share +the same value of the \fIwcs\fR field. +.ls +.ls 12 wcs +The world coordinate system number, a unique integer code assigned by the WCS +package when the WCS is added to the database. +.le +.ls image +The name of the image with which the WCS is associated. +If a WCS is to be associated with more +than one image retrieval must be via the \fIwcs\fR number rather than the +\fIimage\fR name field. +.le +.ls type +A keyword supplied by the application identifying the type of coordinate +system defined by the WCS. This attribute is used in combination with the +\fIimage\fR attribute for keyword based retrieval in cases where an image +may have multiple world coordinate systems. +.le +.ls axis +The image axis mapped by the transformation stored in this record. The X +axis is number 1, Y is number 2, and so on. +.le +.ls axin1 +The first input axis (independent variable in the transformation). +.le +.ls axin2 +The second input axis, set to zero in the case of a univariate transformation. +.le +.ls axout +The number of the input axis (1 or 2) to be used for world coordinate output, +in the case where there is only the linear term but there are two input axes +(in which case the linear term produces a pair of world coordinate values). +.le +.ls linflg +A flag indicating whether the linear term is present in the transformation. +.le +.ls nlnflg +A flag indicating whether the nonlinear term is present in the transformation. +.le +.ls p1,p2 +Linear transformation: origin in pixel space for input axes 1, 2. +.le +.ls w1,w2 +Linear transformation: origin in world space for input axes 1, 2. +.le +.ls s1,s2 +Linear transformation: Scale factor DW/DP for input axes 1, 2. +.le +.ls rot +Linear transformation: Rotation angle in degrees counterclockwise from the +X axis. +.le +.ls cvdat +The curve or surface descriptor for the nonlinear term. The internal format +of this descriptor is controlled by the relevant math package. +This is a variable length array of type real. +.le +.ls label +Axis label for plots. +.le +.ls format +Tick label format for plots, e.g., "0.2h" specifies HMS format in a variable +field width with two decimal places in the seconds field. +.le +.le + + +As noted earlier, the full transformation for an axis involves a linear +transformation followed by a nonlinear transformation. The linear term +is defined in terms of the WCS attributes \fIp1, p2\fR, etc. as shown below. +The variables X and Y are the input values of the axes \fIaxin1\fR and +\fIaxin2\fR, which need not correspond to the X and Y axes of the image. + + +.ks +.nf + x' = (x - p1) * s1 + y' = (y - p2) * s2 + + x" = x' * cos(rot) + y' * sin(rot) + y" = y' * cos(rot) - x' * sin(rot) + + u = x" + w1 + v = y" + w2 +.fi +.ke + + +The output variables U and V are then used as input to the nonlinear mapping, +producing the world coordinate value W for the specified image axis \fIaxis\fR +as output. + + w = eval (cvdat, u, v) + +The mappings for the special cases [1] no linear transformation, +[2] no nonlinear transformation, and [3] univariate rather than bivariate +transformation, are easily derived from the full transformation shown above. +Note that if there is no nonlinear term the linear term produces world +coordinates as output, otherwise the intermediate values (U,V) are in +pixel coordinates. Note also that if there is no nonlinear term but there +are two input axes (as in the case of a rotation), attribute \fIaxout\fR +must be set to indicate whether U or V is to be returned as the output world +coordinate. + +.nh 4 +Image Histogram + + Histogram records are stored in a separate histogram relation outside +the image header. An image may have any number of histograms associated +with it, each defined for a different section of the image. A given image +section may have multiple associated histogram records differing in time, +number of sampling bins, etc., although normally recomputation of the +histogram for a given section will result in a record update rather than an +insertion. A subpackage within IMIO is responsible for the computation of +histogram records. Histogram records are not propagated when an image is +copied. Modifications to an image made subsequent to computation of a +histogram record may invalidate or obsolete the histogram. +.ls 4 +.ls 12 image +The name of the image or image section to which the histogram record +applies. +.le +.ls time +The date and time when the histogram was computed. +.le +.ls z1 +The pixel value associated with the first bin of the histogram. +.le +.ls z2 +The pixel value associated with the last bin of the histogram. +.le +.ls npix +The total number of pixels used to compute the histogram. +.le +.ls nbins +The number of bins in the histogram. +.le +.ls bins +The histogram itself, i.e., an array giving the number of pixels in each +intensity range. +.le +.le + + +The histogram limits Z1 and Z2 will normally correspond to the minimum and +maximum pixel values in the image section to which the histogram applies. + +.nh 4 +Bad Pixel List + + The bad pixel list records the positions of all bad pixels in an image. +A "bad" pixel is a pixel which has an invalid value and which therefore should +not be used for image analysis. As far as IMIO is concerned a pixel is either +good or bad; if an application wishes to assign a fractional weight to +individual pixels then a second weight image must be associated with the +data image by the applications program. + +Images tend to have few or no bad pixels. When bad pixels are present they +are often grouped into bad regions. This makes it possible to use data +compression techniques to efficiently represent the set of bad pixels, +which is conceptually a simple boolean mask image. + +The bad pixel list is represented in the image header as a variable length +integer array (the runtime structure is slightly more complex). +This integer array consists of a set of lists. Each list in the set lists +the bad pixels in a particular image line. Each linelist consists of a record +length field and a line number field, followed by the bad pixel list for that +line. The bad pixel list is a series of either column numbers or ranges of +column numbers. Single columns are represented in the list as positive +integers; ranges are indicated by a negative second value. + + +.ks +.nf + 15 2 512 512 + 6 23 4 8 15 -18 44 + 4 72 23 -29 35 +.fi +.ke + + +An example of a bad pixel list describing a total of 15 bad pixels is shown +above. The first line is the pixel list header which records the total list +length (15 ints), the number of dimensions (2), and the sizes of each dimension +(512, 512). There follow a set of variable length line list records. +Two such lists are shown in the example, one for line 23 and one for line 72. +On line 23 columns 4, 8, 15 though 18, and 44 are all bad. Note that each +linelist contains only a line number since the list is two dimensional; +in general an N dimensional image requires N-1 subscripts after the record +length field, starting with the line number and proceeding to higher dimensions +to the right. + +Even though IMIO provides a bad pixel list capability, many applications will +not want to bother to check for bad pixels. In general, pointwise image +operators which produce a new image as output will not need to check for bad +pixels. Non-pointwise image operators, e.g., filtering opertors, may or may +not wish to check for bad pixels (in principle they should use kernel collapse +to ignore bad pixels). Analysis programs, i.e., programs which produce +database records as output rather than create new images, will usually check +for and ignore bad pixels. + +To avoid machine traps when running the pointwise image operators, all bad +pixels must have reasonable values, even if these values have to be set +artificially when the data is archived. IMAGES SHOULD NOT BE ARCHIVED WITH +MAGIC IN-PLACE VALUES FOR THE BAD PIXELS (as in FITS) since this forces the +system to conditionally test the value of every pixel when the image is read, +an unnecessary operation which is quite expensive for large images. +The simplicity of the reserved value scheme does not warrant such an expense. +Note that the reverse operation, i.e., flagging the bad pixels by setting +them to a magic value, can be carried out very efficiently by the reader +program given a bad pixel list. + +For maximum efficiency those operators which have to deal with bad pixels may +provide two separate data paths internally, one for data which contains no +bad pixels and one for data containing some bad pixels. The path to be taken +would be chosen dynamically as each image line is input, using the bad pixel +list to determine which lines contain bad pixels. Alternatively a program +may elect to have the bad pixels flagged upon input by assignment of a magic +value. The two-path approach is the most desirable one for simple operators. +The magic value approach is often simplest for the more complex applications +where duplicating the code to provide two data paths would be costly and the +operation is already so expensive that the conditional test is not important. + +All operations and queries on bad pixel lists are via a general pixel list +package which is used by IMIO for the bad pixel list but which may be used +for any other type of pixel list as well. The pixel list package provides +operators for creating new lists, adding and deleting pixels and ranges of +pixels from a list, merging lists, and so on. + +.nh 4 +Region Mask + + A region mask is a pixel list which defines some subset of the pixels in +an image. Region masks are used to define the region or regions of an image +to be operated upon. Region masks are stored in a separate mask relation. +A mask is a type of pixel list and the standard pixel list package is used +to maintain and access the mask. Any number of different region masks may be +associated with an image, and a given region mask may be used in operations +upon any number of different images. +.ls 4 +.ls 12 mask +The mask number, a unique integer code assigned by the pixel list package +when the mask is added to the database. +.le +.ls image +The image or image section associated with the mask, if any. +.le +.ls type +The logical type of the mask, a keyword supplied by the applications program +when the mask is created. +.le +.ls naxis +The number of axes in the mask image. +.le +.ls naxis[1-4] +The length of each image axis in pixels. If \fInaxis\fR is greater than 4 +additional axis length attributes must be provided. +.le +.ls npix +The total number of pixels in the subset defined by the mask. +.le +.ls pixels +The mask itself, a variable length integer array. +.le +.le + + +Examples of the use of region masks include specifying the regions to be +used in a surface fit to a two dimensional image, or specifying the regions +to be used to correlate two or more images for image registration. +A variety of utility tasks will be provided in the \fIimages\fR package for +creating mask images, interactively and otherwise. For example, it will +be possible to display an image and use the image cursor to mark the regions +interactively. + +.nh 3 +Group Data + + The group data format associates a set of keyword = value type +\fBgroup header\fR parameters with a group of images. All of the images in +a group should have the same size, number of dimensions, and datatype; +this is required for images to be in group format even though it is not +physically required by the database system. All of the images in a group +share the parameters in the group header. In addition, each image in a +group has its own private set of parameters (attributes), stored in the +image header for that image. + +The images forming a group are stored in the database as a named base table +of type \fIimage\fR. The name of the base table must be the same as the name +of the group. Each group is stored in a separate table. The group headers +for all groups in the database are stored in a separate \fIgroups\fR table. +The attributes of the \fIgroups\fR relation are described below. +.ls 4 +.ls 12 group +The name of the group (\fIimage\fR table) to which this record belongs. +.le +.ls keyword +The name of the group parameter represented by the current record. +The keyword name should be FITS compatible, i.e., the name must not exceed +eight characters in length. +.le +.ls value +The value of the group parameter represented by the current record, encoded +FITS style as a character string not to exceed 20 characters in length. +.le +.ls comment +An optional comment string, not to exceed 49 characters in length. +.le +.le + + +Group format is provided primarily for the STScI/SDAS applications, which +require data to be in group format. The format is however useful for any +application which must associate an arbitrary set of \fIglobal\fR parameters +with a group of images. Note that the member images in a group may be +accessed independently like any other IRAF image since each image has a +standard image header. The primary physical attributes will be identical +in all images in the group, but these attributes must still be present in +each image header. For the SDAS group format the \fInaxis\fR, \fInaxisN\fR, +and \fIbitpix\fR parameters are duplicated in the group header. + +.nh 3 +Image I/O + + In this section we describe the facilities available for accessing +image headers and image data. The discussion will be limited to those +aspects of IMIO relevant to a discussion of the DBSS. The image i/o (IMIO) +interface and the image database interface (IDBI) are existing interfaces +which are more properly described in detail elsewhere. + +.nh 4 +Image Templates + + Most IRAF image operators are set up to operate on a group of images, +rather than a single image. Membership in such a group is determined at +runtime by a so-called \fIimage template\fR which may select any subset +of the images in the database, i.e., and subset of images from any subset +of \fIimage\fR type base tables. This type of group should not be confused +with the \fIgroup format\fR discussed in the last section. The image template +is normally entered by the user on the command line and is dynamically +converted into a list of images by expansion of the template on the current +contents of the database. + +Given an image template the IRAF applications program calls an IMIO routine +to "open" the template. Successive calls to a get image name routine are made +to operate upon the individual images in the group. When all images have been +processed the template is closed. + +The images in a group defined by an image template must exist by definition +when the template is expanded, hence the named images must either be input +images or the operation must update or delete the named images. If an +output image is to be produced for each input image the user must supply the +name of the table into which the new images are to be inserted. This is +exactly the same type of operation performed by the DBMS operators, and in +fact most image operators are relational operators, i.e., they take a +relation as input and produce a new relation as output. Note that the user +is required to supply only the name of the output table, not the names of +the individual images. The output table may be one of the input tables if +desired. + +An image template is syntactically equivalent to a DBIO record selection +expression with one exception: each image name may optionally be modified +by appending an \fIimage section\fR to specify the subset of the pixels in +the image to be operated upon. An example of an image section string is +"[*,100]"; this references column 100 of the associated image. The image +section syntax is discussed in detail in the \fICL User's Guide\fR. + +Since the image template syntax is nearly identical to the general DBIO record +selection syntax the reader is referred to the discussion of the latter syntax +presented in section 4.5.6 for further details. The new DBIO syntax is largely +upwards compatible with the image template syntax currently in use. + +.nh 4 +Image Pixel Access + + IMIO provides quite sophisticated pixel access facilities which it is +beyond the scope of the present document to discuss in detail. Complete +data independence is provided, i.e., the applications program in general +need not know the actual dimensionality, size, datatype, or storage mode +of the image, what format the image is stored in, or even what device or +machine the image resides on. This is not to say that the application is +forbidden from knowing these things, since more efficient i/o is possible +if there is a match between the logical and physical views of the data. + +Pixel access via IMIO is via the FIO interface. The DBSS is charged with +management of the pixel storage file (if any) and with setting up the +FIO interface so that IMIO can access the pixels. Both buffered and virtual +memory mapped access is supported; which is actually used is transparent to +the user. The types of i/o operations provided are "get", "put", and "update". +The objects upon which i/o may be performed are image lines, image columns, +N-dimensional subrasters, and pixel lists. + +New in the DBIO based version of IMIO are update mode and column and pixel +list i/o, plus direct access via virtual memory mapping using the static file +driver. + +.nh 4 +Image Database Interface (IDBI) + + The image database interface is a simple keyword based interface to the +(non array valued) fields of the standard image header. The IDBI isolates +the image oriented applications program from the method used to store the +header, i.e., programs which access the header via the IDBI don't care whether +the header is implemented upon DBIO or some other record i/o interface. +In particular, the IDBI is an existing interface which is \fInot\fR currently +implemented upon DBIO, but which will be converted to use DBIO when it becomes +available. Programs which currently use the IDBI should require few if any +changes when DBIO is installed. + +The philosophy of isolating the applications program using IMIO from the +underlying interfaces is followed in all the subpackages forming the IMIO +interface. Additional IMIO subpackages are provided for appending history +records, creating and reading histograms, and so on. + +.nh 3 +Summary of IMIO Data Structures + + As we have seen, an image is represented as a record in a table in some +database. The image record consists of a set of standard fields, a set of +user defined fields, and the pixel segment, or at least sufficient information +to locate and access the pixel segment if it is stored externally. +An image database may contain a number of other tables; these are summarized +below. + + +.ks +.nf + <images> Image storage (a set of tables named by the user) + groups Header records for group format data + histograms Histograms of images or image sections + history Image history records + masks Region masks + wcs World coordinate systems +.fi +.ke + + +Any number of additional application specific tables may be present in an +actual database. The names of the application and user defined tables must +not conflict with the reserved table names shown above (or with the names of +the DBIO system tables discussed in the next section). The pixel segment of +an image and possibly the image header may be stored in a non-DBSS format +accessed via the HDBI. All the other tables are stored in the standard DBSS +format. + +.nh 2 +The DBIO Interface +.nh 3 +Overview + + The database i/o (DBIO) interface is the interface by which all compiled +programs directly or indirectly access data maintained by the DBSS. DBIO is +primarily a high level record manager interface. DBIO defines the logical +structure of a database and directly implements most of the operations +possible upon the objects in a database. + +The major functions of DBIO are to translate a record select/project expression +into a series of physical record accesses, and to provide the applications +program with access to the contents of the specified records. DBIO hides the +the physical structure and contents of the stored records from the applications +program; providing data independence is one of the major concerns of DBIO. +DBIO is not directly concerned with the physical storage of tables and records +in mass storage, nor with the methods used to physically access such objects. +The latter operations, i.e., the \fIaccess method\fR, are provided by a database +kernel (DBK). + +We first review the philosophy underlying the design of DBIO, and discuss +how DBIO differs from most commercial database systems. Next we describe +the logical structure of a database and introduce the objects making up a +database. The method used to define an actual database is described, +followed by a description of the methods used to access the contents of a +database. Lastly we describe the mapping of a DBIO database into physical +files. + +.nh 3 +Comparision of DBIO and Commercial Databases + + The design of the DBIO interface is based on a thorough study of existing +database systems (most especially System-R, DB2 and INGRES). It was clear from +the beginning that these systems were not ideally suited to our application, +even if the proprietary and portability issues were ignored. Eventually the +differences between these commercial database systems and the system we need +became clear. The differences are due to a change in focus and emphasis as +much as to the obvious differences between scientific and commercial +applications, and are summarized below. +.ls 4 +.ls o +The commercial systems are not sufficiently flexible in the types of data that +can be stored. In particular these systems do not in general support variable +length arrays of arbitrary datatype; most do not support even static arrays. +Only a few systems allow new attributes to be added to existing tables. +Most systems talk about domains but few implement them. We need both array +storage and the ability to dynamically add new attributes, and it appears that +domains will be quite useful as well. +.le +.ls o +Most commercial systems emphasize the query language, which forms the basis +for the host language interface as well as the user interface. The query +language is the focus of these systems. In our case the DBSS is embedded +within IRAF as one of many subsystems. While we do need query language +facilities at the user level, we do not need such sophisticated facilities +at the DBIO level and would rather do without the attendant complexity and +overhead. +.le +.ls o +Commercial database systems are designed for use in a multiuser transaction +processing environment. Many users may simultaneously be performing update +and revtrieval operations upon a single centralized database. The financial +success of the company may well depend upon the integrity of the database. +Downtime can be very expensive. + +In contrast we anticipate having many independent databases. These will be +of two kinds: public and private. The public databases will virtually always be +accessed read only and the entire database can be locked for exclusive access +if it should ever need updating. Only the private databases are subject to +heavy updating; concurrent access is required for background jobs but the +granularity of locking can be fairly coarse. If a database should become +corrupted it can be fixed at leisure or even regenerated from scratch without +causing great hardship. Concurrency, integrity, and recovery are therefore +less important for our applications than in a commercial environment. +.le +.ls o +Most commercial database systems (with the exception of the UNIX based INGRES) +are quite machine, device, and host system dependent. In our case portability +of both the software and the data is a primary concern. The requirement that +we be able to archive data in a machine independent format and read it on a +variety of machines seems to be an unusual one. +.le +.le + + +In summary, we need a simple interface which provides flexibility in the way +in which data can be stored, and which supports complex, dynamic data structures +containing variable length arrays of any datatype and size. The commercial +database systems do not provide enough flexibility in the types of data +structures they can support, nor do they provide enough flexibility in storage +formats. On the other hand, the commercial systems provide a more sophisticated +host language interface than we need. DBIO should therefore emphasize flexible +data structures but avoid a complex syntax and all the problems that come with +such. Concurrency and integrity are important but are not the major concerns +they would be in a commercial system. + +.nh 3 +Query Language Interface + + We noted in the last section that DBIO should be a simple record manager +type interface rather than an embedded query language type interface. This +approach should yield the simplest interface meeting our primary requirements. +Nonetheless a host language interface to the query language is possible and +can be added in the future without compromising the present DBIO interface +design. + +The query language will be implemented as a conventional CL callable task in +the DBMS package. Command input to the query language will be interactively +via the terminal (the usual case), or noninteractively via a string type +command line argument or via a file. Any compiled program can send commands +to the query language (or to any CL task) using the CLIO \fBclcmd\fR procedure. +Hence a crude but usable HLI query language interface will exist as soon as +a query language becomes available. A true high level embedded query language +interface could be built using the same interface internally, but this should +be left to some future compiled version of SPP rather than attempted with the +current preprocessor. We have no immediate plans to build such an embedded +query language interface but there is nothing in the current design to hinder +such a project should it someday prove worthwhile. + +.nh 3 +Logical Schema + + In this section we present the logical schema of a DBIO database. +A DBIO database consists of a set of \fBsystem tables\fR and a set of +\fBuser tables\fR. The system tables define the structure of the database +and its contents; the user tables contain user data. All tables are instances +of named \fBrelations\fR or \fBviews\fR. Relations and views are ordered +collections of \fBattributes\fR or \fBgroups\fR of attributes. Each attribute +is defined upon some particular \fBdomain\fR. The structure of the objects +in a database is defined at runtime by processing a specification written in +the \fBdata definition language\fR. + +.nh 4 +Databases + + A DBIO database is a collection of named tables. All databases include +a standard set of \fBsystem tables\fR defining the structure and contents +of the database. Any number of user or application defined tables may also +be present in the database. The most important system table is the database +\fIcatalog\fR which includes a record describing each user or system table +in the database. + +Conceptually a database is similar to a directory containing files. The catalog +corresponds to the directory and the tables correspond to the files. +A database is however a different type of object; there need be no obvious +connection between the objects in a database and the physical directories and +files used to store a database, e.g., several tables might be stored in one +file, one table might be stored in many files, the tables might be stored on +a special device and not in files at all, and so on. + +In general the mapping of tables into physical objects is hidden from the user +and is not important. The only exception to this is the association of a +database with a specific FIO directory. The mapping between databases and +directories is one to one, i.e., a directory may contain only one database, +and a database is contained in a single directory. An entire database can +be physically moved, copied, backed up, or restored by merely performing a +binary copy of the contents of the directory. DBIO dynamically generates all +file names relative to the database directory, hence moving a database to +a different directory is harmless. + +To hide the database directory from the user DBIO supports the concept of a +\fBcurrent database\fR in much the way that FIO supports the concept of a +current directory. Tables are normally referenced by name, e.g., "ptable masks" +without explicitly naming the database (i.e., directory) in which the table +resides. The current database is maintained independently of the current +directory, allowing the user to change directories without affecting the +current database. This is particularly useful when accessing public databases +(maintained in a write protected directory) or when accessing databases which +reside on a remote node. To list the contents of the current database the +user must type "pcat" rather than "dir". The current database defaults to +the current directory until the user explicitly sets the current database +with the \fBchdb\fR command. + +Databases are referred to by the filename of the database directory. +The IRAF system will provide a "master catalog" of public databases, +consisting of little more than a set of CL environment definitions assigning +logical names to the database directories. Whenever possible logical names +should be used rather than pathnames to hide the pathname of the database. + +.nh 4 +System Tables + + The structure and contents of a DBIO database are described by the same +table mechanism used to maintain user data. DBIO automatically maintains +the system tables, which are normally protected from writing by the user +(the system tables can be manually updated like any other table in a desperate +situation). Since the system tables are ordinary tables, they can be +inspected, queried, etc., using the same utilities used to access the user +data tables. The system tables are summarized below. +.ls 4 +.ls 12 syscat +The database catalog. +Contains an entry (record) for every table or view in the database. +.le +.ls sysatt +The attribute list table. +Contains an entry for every attribute in every table in the database. +.le +.ls sysddt +The domain descriptor table. +Contains an entry for every defined domain in the database. Any number of +attributes may share the same domain. +.le +.ls sysidt +The index descriptor table. +Contains an entry for every primary or secondary index in the database. +.le +.le + + +The system tables are visible to the user, i.e., they appear in the database +catalog. Like the user tables, the system tables are themselves described by +entries in the database catalog, attribute list table, and domain descriptor +table. + +.nh 4 +The System Catalog + + The \fBsystem catalog\fR is effectively a "table of contents" for the +database. The fields of the catalog relation \fBsyscat\fR are as follows. +.ls 4 +.ls 12 table +The name of the user or system table described by the current record. +Table names may contain any combination of the alphanumeric characters, +underscore, or period and must not exceed 32 characters in length. +.le +.ls relid +The table identifier. A unique integer code by which the table is referred +to internally. +.le +.ls type +Identifies the type of table, e.g., base table or view. +.le +.ls ncols +The number of columns (attributes) in the table. +.le +.ls nrows +The number of rows (records, tuples) in the table. +.le +.ls rsize +The size of a record in bytes, not including array storage. +.le +.ls tsize +An estimate of the total number of bytes of storage currently in use by the +table, including array storage. +.le +.ls ctime +The date and time when the table was created. +.le +.ls mtime +The date and time when the table was last modified. +.le +.ls flags +A small integer containing flag bits used internally by DBIO. +These include the protection bits for the table. Initially only write +protection and delete protection will be supported (for everyone). +Additional protections are of course provided by the file system. +A flag bit is also used to indicate that the table has one or more +indexes, to avoid an unnecessary search of the \fBsysidx\fR table when +accessing an unindexed table. +.le +.le + + +Only a subset of these fields will be of interest to the user in ordinary +catalog listings. The \fBpcatalog\fR task will by default print only the +most interesting fields. Any of the other DBMS output tasks may be used +to inspect the catalog in detail. + +.nh 4 +Relations + + A \fBrelation\fR is an ordered set of named attributes, each of which is +defined upon some specific domain. A \fBbase table\fR is a named instance +of some relation. A base table is a real object like a file; a base table +appears in the catalog and consumes storage on disk. The term "table" is +more general, and is normally used to refer to any object which can be +accessed like a base table. + +A DBIO relation is defined by a set of records describing the attributes +of the relation. The attribute lists of all relations are stored in the +\fBsysatt\fR table, described in the next section. + +.nh 4 +Attributes + + An \fBattribute\fR of a relation is a datum which describes some aspect +of the object described by the relation. Each attribute is defined by a +record in the \fBsysatt\fR table, the fields of which are described below. +The attribute descriptor table, while visible to the user if they wish to +examine the structure of the database in detail, is primarily an internal +table used by DBIO to define the structure of a record. +.ls 4 +.ls 12 name +The name of the attribute described by the current record. +Attribute names may contain any combination of the alphanumeric characters +or underscore and must not exceed 16 characters in length. +.le +.ls attid +The attribute identifier. A unique integer code by which the attribute is +referred to internally. The \fIattid\fR is unique within the relation to +which the attribute belongs, and defines the ordering of attributes within +the relation. +.le +.ls relid +The relation identifier of the table to which this attribute belongs. +.le +.ls domid +The domain identifier of the domain to which this attribute belongs. +.le +.ls dtype +A single character identifying the atomic datatype of this attribute. +Note that domain information is not used for most runtime record accesses. +.le +.ls prec +The precision of the atomic datatype of this attribute, i.e., the number +of bytes of storage per element. +.le +.ls count +The number of elements of type \fIdtype\fR in the attribute. If this value +is one the attribute is a scalar. Zero implies a variable length array +and N denotes a static array of N elements. +.le +.ls offset +The offset of the field in bytes from the start of the record. +.le +.ls width +The width of the field in bytes. All fields occupy a fixed amount of space +in a record. In the case of variable length arrays fields \fBoffset\fR and +\fBwidth\fR refer to the array descriptor. +.le +.le + + +In summary, the attribute list defines the physical structure of a record +as stored in mass storage. DBIO is responsible for encoding and decoding +records as well as for all access to the fields of records. A record is +encoded as a byte stream in a machine independent format. The physical +representation of a record is discussed further in a later section describing +the DBIO storage structures. + +.nh 4 +Domains + + A domain is a restricted implementation of an abstract datatype. +Simple examples are the atomic datatypes char, integer, real, etc.; no doubt +these will be the most commonly used domains. A more interesting example is +the \fItime\fR domain. Times are stored in DBIO as attributes defined upon +the \fItime\fR domain. The atomic datatype of a time attribute is a four byte +integer; the value is the long integer value returned by the IRAF system +procedure \fBclktime\fR. Integer time values are convenient for time domain +arithmetic, but are not good for printed output. The definition of the +\fItime\fR domain therefore includes a specification for the output format +which will cause time attributes to be printed as a formatted date/time string. + +Domains are used to verify input and to format output, hence there is no +domain related overhead during record retrieval. The only exception to +this rule occurs when returning the value of an uninitialized attribute, +in which case the default value must be fetched from the domain descriptor. + +Domains may be defined either globally for the entire database or locally for +a specific table. Attributes in any table may be defined upon a global domain. +The system table \fBsysddt\fR defines all global and local domains. +The attributes of this table are outlined below. +.ls 4 +.ls 12 name +The name of the domain described by the current record. +Domain names may contain any combination of the alphanumeric characters +or underscore and must not exceed 16 characters in length. +.le +.ls domid +The domain identifier. A unique integer code by which the domain is referred +to internally. The \fIdomid\fR is unique within the table for which the domain +is defined. +.le +.ls relid +The relation identifier of the table to which this domain belongs. +This is set to zero if the domain is defined globally. +.le +.ls grpid +The group identifier of the group to which this domain belongs. +This is set to zero if the domain does not belong to a special group. +A negative value indicates that the named domain is itself a group +(groups are discussed in the next section). +.le +.ls dtype +A single character identifying the atomic datatype upon which the domain +is defined. +.le +.ls prec +The precision of the atomic datatype of this domain, i.e., the number +of bytes of storage per element. +.le +.ls defval +The default value for attributes defined upon this domain (a byte string of +length \fIprec\fR bytes). If no default value is specified DBIO will assume +that null values are not permitted for attributes defined upon this domain. +.le +.ls minval +The minimum value permitted. This attribute is used only for integer or real +valued domains. +.le +.ls maxval +The maximum value permitted. This attribute is used only for integer or real +valued domains. +.le +.ls enumval +If the domain is string valued with a fixed number of permissible value strings, +the legal values may be enumerated in this string valued field. +.le +.ls units +The units label for attributes defined upon this domain. +.le +.ls format +The default output format for printed output. All SPP formats are supported +(e.g., including HMS, HM, octal, etc.) plus some special DBMS formats, e.g., +the time format. +.le +.ls width +The field width in characters for printed output. +.le +.le + + +Note that the \fIunits\fR and \fIformat\fR fields and the four "*val" fields +are stored as variable length character arrays, hence there is no fixed limit +on the sizes of these strings. Use of a variable length field also minimizes +storage requirements and makes it easy to test for an uninitialized value. +Only fixed length string fields and scalar valued numeric fields may be used +in indexes and selection predicates, however. + +A number of global domains are predefined by DBIO. These are summarized +in the table below. + + +.ks +.nf + NAME DTYPE PREC DEFVAL + + byte u 1 0 + char c arb nullstr + short i 2 INDEFS + int i 4 INDEFI + long i 4 INDEFL + real r 4 INDEFR + double r 8 INDEFD + time i 4 0 +.fi +.ke + + +The predefined global domains, as well as all user defined domains, are defined +in terms of the four DBK variable precision atomic datatypes. These are the +following: + + +.ks +.nf + NAME DTYPE PREC DESCRIPTION + + char c >=1 character + uint u 1-4 unsigned integer + int i 1-4 signed integer + real r 2-8 floating point +.fi +.ke + + +DBIO stores records with the field values encoded in the machine independent +variable precision DBK data format. The precision of an atomic datatype is +specified by an integer N, the number of bytes of storage to be reserved for +the value. The permissible precisions for each DBK datatype are shown in +the preceding table. The actual encoding used is designed to simplify the +semantics of the DBK and is not any standard format. The DBK binary encoding +will be described in a later section. + +.nh 4 +Groups + + A \fBgroup\fR is a logical grouping of several related attributes. +A group is much like a relation except that a group is a type of domain +and may be used as such to define the attributes of relations. Since groups +are similar to relations groups are defined in the \fBsysatt\fR table +(groups do not however appear in the system catalog). Each member of a +group is an attribute defined upon some domain; nesting of groups is permitted. + +Groups are expanded when a relation is defined, hence the runtime system +need not be aware of groups. Expansion of a group produces a set of ordinary +attributes wherein each attribute name consists of the group name glued +to the member name with a period, e.g., the resolved attributes "cv.ncoeff" +and "cv.type" are the result of expansion of a two-member group attribute +named "cv". + +The main purposes of the group construct are to simplify data definition and +to give the forms generator additional information for structuring formatted +output. Groups provide a simple capability for structuring data within a table. +Whenever the same grouping of attributes occurs in several tables the group +mechanism should be used to ensure that all instances of the group are +defined equivalently. + +.nh 4 +Views + + A \fBview\fR is a virtual table defined in terms of one or more base +tables or other views via a record select/project expression. Views provide +different ways of looking at the same data; the view mechanism can be very +useful when working with large, complex base tables (it saves typing). +Views allow the user to focus on just the data that interests them and ignore +the rest. The view mechansism also significantly increases the amount of data +independence provided by DBIO, since a base table can be made to look +differently to different applications programs without physically modifying +the table or producing several copies of the same table. This capability can +be invaluable when the tables involved are very large or cannot be modified +for some reason. + +A view provides a "window" into one or more base tables. The window is +dynamic in the sense that changes to the underlying base tables are immediately +visible through the window. This is because a view does not contain any data +itself, but is rather a \fIdefinition\fR via record selection and projection +of a new table in terms of existing tables. For example, consider the +following imaginary select/project expression (SPE): + + data1 [x >= 10 and x <= 20] % obj, x, y + +This defines a new table with attributes \fIobj\fR, \fIx\fR, and \fIy\fR +consisting of all records of table \fIdata1\fR for which X is in the range +10 to 20. We could use the SPE shown to copy the named fields of the +selected records to produce a new base table, e.g. \fId1x\fR. +The view mechanism allows us to define table \fId1x\fR as a view-table, +storing only the SPE shown. When the view-table \fId1x\fR is subsequently +queried DBIO will \fImerge\fR the SPE supplied in the new query with that +stored in the view, returning only records which satisfy both selection +expressions. This works because the output of an SPE is a table and can +therefore be used as input to another SPE, i.e., two or more selection +expressions can be combined to form a more complex expression. + +A view appears to the user (or to a program) as a table, behaving equivalently +to a base table in most operations. View-tables appear in the catalog and +can be created and deleted much like ordinary tables. + +.nh 4 +Null Values + + Null valued attributes are possible in any database system; they are +guaranteed to occur when the system permits new attributes to be dynamically +added to existing, nonempty base tables. DBIO deals with null values by +the default value mechanism mentioned earlier in the discussion of domains. +When the value of an uninitialized attribute is referenced DBIO automatically +supplies the user specified default value of the attribute. The defaulting +mechanism supports three cases; these are summarized below. +.ls 4 +.ls o +If null values are not permitted for the referenced attribute DBIO will +return an error condition. This case is indicated by the absence of a +default value. +.le +.ls o +Indefinite (or any special value) may be returned as the default value if +desired, allowing the calling program to test for a null value. +.le +.ls o +A valid default value may be returned, with no checking for null values +occurring in the calling program. +.le +.le + + +Testing for null values in predicates is possible only if the default value +is something recognizable like INDEF, and is handled by the conventional +equality operator. Indefinites are propagated in expressions by the usual +rules, i.e., the result of any arithmetic expression containing an indefinite +is indefinite, order comparison where an operand is indefinite is illegal, +and equality or inequality comparison is legal and is well defined. + +.nh 3 +Data Definition Language + + The data definition language (DDL) is used to define the objects in a +database, e.g., during table creation. The function of the DBIO table +creation procedure is to add tuples to the system tables to define a new +table and all attributes, groups, and domains used in the table. The data +definition tuples can come from either of two sources: [1] they can be +copied in compiled form from an existing table, or [2] they can be +generated by compilation of a DDL source specification. + +In appearance DDL looks much like a series of structure declarations such +as one finds in most modern compiled languages. DDL text may be entered +either via a string buffer in the argument list (no file access required) +or via a text file named in the argument list to the table creation procedure. + +The DDL syntax has not yet been defined. An example of what a DDL declaration +for the IMIO \fImasks\fR relation might look like is shown below. The syntax +shown is a generalization of the SPP+ syntax for a structure declaration with +a touch of the CL thrown in. If a relation is defined only in terms of the +predefined domains or atomic datatypes and has no primary key, etc., then the +declaration would look very much like an SPP+ (or C) structure declaration. + + +.ks +.nf + relation masks { + u2 mask { width=6 } + c64 image { defval="", format="%20.20s", width=21 } + c15 type { defval="generic" } + byte naxis + long naxis1, naxis2, naxis3, naxis4 + long npix + i2 pixels[] + } where { + key = mask+image+type + comment = "image region masks" + } +.fi +.ke + + +The declaration shown identifies the primary key for the relation and gives +a comment describing the relation, then declares the attributes of the +relation. In this example all domains are either local and are declared +implicitly, or they are global and are predefined. For example, DBIO will +automatically create a domain named "type" belonging to the relation "masks" +for the attribute named "type". DBIO is assumed to provide default values +for the attributes of each domain (e.g., "format", "width", etc.) not +specified explicitly in the declaration. It should be possible to keep +the DDL syntax simple enough that a LALR parser does not have to be used, +reducing text memory requirements and the time required to process the DDL, +and improving error diagnostics. + +.nh 3 +Record Select/Project Expressions + + Most programs using DBIO will be relational operators, taking a table +as input, performing some operation or transformation upon the table, and +either updating the table or producing a new table as output. DBIO record +select/project expressions (SPE) are used to define the input table. +By using an SPE one can define the input table to be any subset of the +fields (projection) of any subset of the records (selection) of any set of +base tables or views (set union). + +The general form of a select/project expression is shown below. The syntax +is patterned after the algebraic languages and even happens to be upward +compatible with the existing IMIO image template syntax. + + +.ks +.nf + tables [pred] [upred] % fields + +where + + tables Is a comma delimited list of tables. + + , Is the set union operator (in the tables and + fields lists). + + [ Is the selection operator. + + pred Is a predicate, i.e., a boolean condition. + The simplest predicate is a constant or + list of constants, specifying a set of + possible values for the primary key. + + upred Is a user predicate, passed back to the + calling program appended to the record + name but not used by DBIO. This feature + is used to implement image sections. + + % Is the projection operator. + + fields Is a comma delimited list of \fIexpressions\fR + defined upon the attributes of the input + relation, defining the attributes of the + output relation. +.fi +.ke + + +All components of an SPE are optional except \fItables\fR; the simplest +SPE is the name of a single table. Some simple examples follow. + +.nh 4 +Examples + + Print all fields of table \fInite1\fR. The table \fInite1\fR is an image +table containing several images with primary keys 1, 2, 3, and so on. + + cl> ptable nite1 + +Print selected fields of table \fInite1\fR. + + cl> ptable nite1%image,title + +Plot line 200 of image 2 in table \fInite1\fR. + + cl> graph nite1[2][*,200] + +Print image statistics on the indicated images in table \fInite1\fR. +The example shows a predicate specifying images 1, 3, and 5 through 12, +not an image section. + + cl> imstat nite1[1,3,5:12] + +Print the names and number of bad pixels in tables \fInite1\fR and \fIm87\fR +for all images that have any bad pixels. + + cl> ptable "nite1,m87 [nbadpix > 0] % image, nbadpix" + + +The tables in an SPE may be general select/project expressions, not just the +names of base tables or views as in the examples. In other words, SPEs +may be nested, using parenthesis around the inner SPE if necessary to indicate +the order of evaluation. As noted earlier in the discussion of views, +the ability of SPEs to nest is used to implement views. Nesting may also +be used to perform selection or projection upon the individual input tables. +For example, the SPE used in the following command specifies the union of +selected records from tables \fInite1\fR and \fInite2\fR. + + cl> imstat nite1[1,8,21:23],nite2[9] + +.nh 3 +Operators diff --git a/sys/dbio/new/dbio.hlp.1 b/sys/dbio/new/dbio.hlp.1 new file mode 100644 index 00000000..202b4488 --- /dev/null +++ b/sys/dbio/new/dbio.hlp.1 @@ -0,0 +1,346 @@ +.help dbio Jul85 "Database I/O Design" +.ce +\fBIRAF Database I/O\fR +.ce +Conceptual Design +.ce +Doug Tody +.ce +July 1985 +.sp 3 +.nh +Introduction + The DBIO (database i/o) interface is a library of SPP callable procedures +used to access data structures maintained in mass storage. While DBIO is at +the heart of the IRAF database subsystem, it is only a part of that subsystem. +Other major components of the database subsystem include the IMIO interface +(image i/o), a higher level interface used to access bulk data maintained +in part under DBIO, and the DBMS package (data base management system), a CL +level package providing the user with direct access to any database maintained +under DBIO. Additional structure is found beneath DBIO; this is for the most +part invisible to both the programmer and the user but is of fundamental +importance to the design, as we shall see later. +.ks +.nf + DBMS (cl) + \ --------- + \ IMIO + \ / \ + \ / \ + \/ \ (vos) + DBIO FIO + | + | --------- + | + (DB kernel) (vos or host) +.fi +.ce +Figure 1. Major Interfaces +.ke + +.nh +Requirements + The requirements for the DBIO interface are driven by its intended usage +for image and catalog storage. It is arguable whether the same interface +should be used for both types of data, but development of an interface such +as DBIO with all the associated DBMS utilities is expensive, hence we would +prefer to have to develop only one such interface. Furthermore, it is desirable +for the user to only have to learn one such interface. The primary functional +and performance requirements which DBIO must meet are the following (in no +particular order). +.ls +.ls [1] +DBIO shall provide a high degree of data independence, i.e., a program +shall be able to access a data structure maintained under DBIO without +detailed knowledge of its contents. +.le +.ls [2] +A DBIO datafile shall be self describing and self contained, i.e., it shall +be possible to examine the structure and contents of a DBIO datafile without +prior knowledge of its structure or contents. +.le +.ls [3] +DBIO shall be able to deal efficiently with records containing up to N fields +and with data groups containing up to M records, where N and M are at least +sysgen configurable and are order of magnitude N=10**2 and M=10**6. +.le +.ls [4] +The time required to access an image header under DBIO must be comparable +to the time currently required for the equivalent operation under IMIO. +.le +.ls [5] +It shall be possible for an image header maintained under DBIO to contain +application or user defined fields in addition to the standard fields +required by IMIO. +.le +.ls [6] +It shall be possible to dynamically add new fields to an existing image header +(or to any DBIO record). +.le +.ls [7] +It shall be possible to group similar records together in the database +and to perform global operations upon all or part of the records in a +group. +.le +.ls [8] +It shall be possible for a field of a record to be a one-dimensional array +of any of the primitive types. +.le +.ls [9] +Variant records (records containing variable size fields) shall be supported, +ideally without penalizing efficient access to databases which do not contain +such records. +.le +.ls [A] +It shall be possible to copy a record without knowledge of its contents. +.le +.ls [B] +It shall be possible to merge (join) two records containing disjoint sets of +fields. +.le +.ls [C] +It shall be possible to update a record in place. +.le +.ls [D] +It shall be possible to simultaneously access (retrieve, update, or insert) +multiple records from the same data group. +.le +.le +To summarize, the primary requirements are data independence, efficient access +to both large and small databases, and flexibility in the contents of the +database. +.nh +Conceptual Design + + The DBIO database faciltities shall be based upon the relational model. +The relational model is preferred due to its simplicity (to the user) +and due to the demonstrable fact that relational databases can efficiently +handle large amounts of data. In the relational model the database appears +to be nothing more than a set of \fBtables\fR, with no builtin connections +between separate tables. The operations defined upon these tables are based +upon the relational algebra, which is in turn based upon set theory. +The major advantages claimed for relational databases are the simplicity +of the concept of a database as a collection of tables, and the predictability +of the relational operators due to their being based on a formal theoretical +model. +None of the requirements listed in section 2 state that DBIO must implement +a relational database. Most of our needs can be met by structuring our data +according to the relational data model (i.e., as tables), and providing a +good \fBselect\fR operator for retrieving records from the database. If a +semirelational database is sufficient to meet our requirements then most +likely that is what will be built (at least initially; the relational operators +are very attractive for data analysis). DBIO is not expected to be competitive +with any commercial relational database; to try to make it so would probably +compromise the requirement that the interface be compact. +On the other hand, the database requirements of IRAF are similar enough to +those addressed by commercial databases that we would be foolish not to try +to make use of some of the same technology. +.ks +.nf + \fBformal relational term\fR \fBinformal equivalents\fR + relation table + tuple record, row + attribute field, column + domain datatype + primary key record id +.fi +.ke +A DBIO \fBdatabase\fR shall consist of one or more \fBrelations\fR (tables). +Each relation shall contain zero or more \fBrecords\fR (rows of the table). +Each record shall contain one or more \fBfields\fR (columns of the table). +All records in a relation shall share the same set of fields, +but all of the fields in a record need not have been assigned values. +When a new \fBattribute\fR (column) is added to an existing relation a default +valued field is added to each current and future record in the relation. +Each attribute is defined upon a particular \fBdomain\fR, e.g., the set of +all nonnegative integer values less than or equal to 100. It shall be possible +to specify minimum and maximum values for integer and real attributes +and to enumerate the permissible values of a string type attribute. +It shall be possible to specify a default value for an attribute. +If no default value is given INDEF is assumed. +One dimensional arrays shall be supported as attribute types; these will be +treated as atomic datatypes by the relational operators. Array valued +attributes shall be either fixed in size (the most efficient form) or variant. +There need be no special character string datatype since one dimensional +arrays of type character are supported. +Each relation shall be implemented as a separate file. If the relations +comprising a database are stored in a directory then the directory can +be thought of as the database. Public databases will be stored in well +known public (write protected) directories, private databases in user +directories. The logical directory name of each public database will be +the name of the database. Physical storage for a database need not necessarily +be allocated locally, i.e., a database may be centrally located and remotely +accessed if the host computer is part of a local area network. +Locking shall be at the level of entire relations rather than at the record +level, at least in the initial implementation. There shall be no support for +indices in the initial implementation except possibly for the primary key. +It should be possible to add either or both of these features to a future +implementation without changing the basic DBIO interface. Modifications to +the internal data structures used in database files will likely be necessary +when adding such a major feature, making a save and restore operation +necessary for each database file to convert it to the new format. +The save format chosen (e.g. FITS table) should be independent of the +internal format used at a particular time on a particular host machine. +Images shall be stored in the database as individual records. +All image records shall share a common subset of attributes. +Related images (image records) may be grouped together to form relations. +The IRAF image operators shall support operations upon relations +(sets of images) much as the IRAF file operators support operations upon +sets of files. +A unary image operator shall take as input a relation (set of one or more +images), inserting the processed images into the output relation. +A binary image operator shall take as input either two relations or a +relation and a record, inserting the processed images into the output +relation. In all cases the output relation can be an input relation as +well. The input relation will be defined either by a list or by selection +using a theta-join (operationally similar to a filename template). +.nh 2 +Relational Operators + DBIO shall support two basic types of database operations: operations upon +relations and operations upon records. The basic relational operators +are the following. All of these operators produce as output a new relation. +.ls +.ls create +Create a new base relation (physical relation as stored on disk) by specifying +an initial set of attributes and the (file)name for the new relation. +Attributes and domains may be specified via a data definition file or by +reference to an existing relation. +A primary key (limited to a single attribute) should be identified. +The new relation initially contains no records. +.le +.ls drop +Delete a (possibly nonempty) base relation and any associated indices. +.le +.ls alter +Add a new attribute or attributes to an existing base relation. +Attributes may be specified explicitly or by reference to another relation. +.le +.ls select +Create a new relation by selecting records from one or more existing base +relations. Input consists of an algebraic expression defining the output +relation in terms of the input relations (usage will be similar to filename +templates). The output relation need not have the same set of attributes as +the input relations. The \fIselect\fR operator shall ultimately implement +all the basic operations of the relational algebra, i.e., select, project, +join, and the set operations. At a minimum, selection and projection are +required in the initial interface. The output of \fBselect\fR is not a +named relation (base relation), but is instead intended to be accessed +by the record level operators discussed in the next section. +.le +.ls edit +Edit a relation. An interactive screen editor is entered allowing the user +to add, delete, or modify tuples (not required in the initial version of +the interface). Field values are verified upon input. +.le +.ls sort +Make the storage order of the records in a relation agree with the order +defined by the primary key (the index associated with the primary key is +always sorted but index order need not agree with storage order). +In general, retrieval on a sorted relation is more efficient than on an +unsorted relation. Sorting also eliminates deadspace left by record +deletion or by updates involving variant records. +.le +.le +Additional nonalgebraic operators are required for examining the structure +and contents of relations, returning the number of records or attributes in +a relation, and determining whether a given relation exists. +The \fIselect\fR operator is the primary user interface to DBIO. +Since most of the relational power of DBIO is bound up in the \fIselect\fR +operator and since \fIselect\fR will be driven by an algebraic expression +(character string) there is considerable scope for future enhancement +of DBIO without affecting existing code. +.nh 2 +Record (Tuple) Level Operators + While the user should see primarily operations on entire relations, +record level processing is necessary at the program level to permit +data entry and implementation of special operators. The basic record +level operators are the following. +.ls +.ls retrieve +Retrieve the next record from the relation defined by \fBselect\fR. +While the tuples in a relation theoretically form an unordered set, +tuples will normally be returned in either storage order or in the sort +order of the primary key. Although all fields of a retrieved record are +accessible, an application will typically have knowledge of only a few fields. +.le +.ls update +Rewrite the (possibly modified) current record. The updated record is +written back into the base table from which it was read. Not all records +produced by \fBselect\fR can be updated. +.le +.ls insert +Insert a new record into an output relation. The output relation may be an +input relation as well. Records added to an output relation which is also +an input relation do not become candidates for selection until another +\fBselect\fR occurs. A retrieve followed by an insert copies a record without +knowledge of its contents. A retrieve followed by modification of selected +fields followed by an insert copies all unmodified fields of the record. +The attributes of the input and output relations need not match; unmatched +output attributes take on their default values and unmatched input attributes +are discarded. \fBInsert\fR returns a pointer to the output record, +allowing insertions of null records to be followed by initialization of +the fields of the new record. +.le +.ls delete +Delete the current record. +.le +.le +Additional operators are required to close or open a relation for record +level access and to count the number of records in a relation. +.nh 3 +Constructing Special Relational Operators + The record level operations may be combined with \fBselect\fR in compiled +programs to implement arbitrary operations upon entire relations. +The basic scenario is as follows: +.ls +.ls [1] +The set of records to be operated upon, defined by the \fBselect\fR +operator, is opened as an unordered set (list) of records to be processed. +.le +.ls [2] +The "next" record in the relation is accessed with \fBretrieve\fR. +.le +.ls [3] +The application reads or modifies a subset of the fields of the record, +updating modified records or inserting the record in the output relation. +.le +.ls [4] +Steps [2] and [3] are repeated until the entire relation has been processed. +.le +.le +Examples of such operators are conversion to and from DBIO and LIST file +formats, column extraction, mimimum or maximum of an attribute (domain +algebra), and all of the DBMS and IMAGES operators. +.nh 2 +Field (Attribute) Level Operators + Substantial processing of the contents of a database is possible without +ever accessing the individual fields of a record. If field level access is +required the record must first be retrieved or inserted. Field level access +requires knowledge of the names of the attributes of the parent relation, +but not their exact datatypes. Automatic type conversion occurs when field +values are queried or set. +.ls +.ls get +.sp +Get the value of the named scalar or vector field (typed). +.le +.ls put +.sp +Put the value of the named scalar or vector field (typed). +.le +.ls read +Read the named fields into an SPP data structure, given the name, datatype, +and length (if vector) of each field in the output structure. +There must be an attribute in the parent relation for each field in the +output structure. +.le +.ls write +Copy an SPP data structure into the named fields of a record, given the +name, datatype, and length (if vector) of each field in the input structure. +There must be an attribute in the parent relation for each field in the +input structure. +.le +.ls access +Determine whether a relation has the named attribute. +.le +.le diff --git a/sys/dbio/new/dbki.hlp b/sys/dbio/new/dbki.hlp Binary files differnew file mode 100644 index 00000000..a825f6ef --- /dev/null +++ b/sys/dbio/new/dbki.hlp diff --git a/sys/dbio/new/ddl b/sys/dbio/new/ddl new file mode 100644 index 00000000..8c1256b7 --- /dev/null +++ b/sys/dbio/new/ddl @@ -0,0 +1,125 @@ +1. Data Definition Language + + Used to define relations and domains. + Table driven. + + +1.1 Domains + + Domains are used to save storage, format output, and verify input, as well +as to document the structure of a database. DBIO does not use domain +information to verify the legality of predicates. + + + attributes of a domain: + + name domain name + type atomic type + default default value (none, indef, actual) + minimum minimum value permitted + maximum maximum value permitted + enumval list of legal values + units units label + format default output format + + + predefined (atomic) domains: + + bool + byte*N + char*N + int*N + real*N + +The precision of an atomic domain is specified by N, the number of bytes of +storage to be reserved for the value. N may be any integer value greater +than or equal to N=1 for byte, char, and int, or N=2 for real. The byte +datatype is an unsigned (positive) integer. The floating point datatype +has a one byte (8 bit) base 2 exponent. For example, char*1 is a signed +byte, byte*2 is an unsigned 16 bit integer, and real*2 is a 16 bit floating +point number. + + +1.2 Groups + + A group is an aggregate of two or more domains or other groups. Groups +as well as domains may be used to define the attributes of a relation. +Repeating groups, i.e., arrays of groups, are not allowed (a finite number +of named instances of a group may however be declared within a single relation). + + + attributes of a group: + + name group name as used in relation declarations + nelements number of elements (attributes) in group + elements set of elements (see below) + + + attributes of each group element: + + name attribute name + domain domain on which attribute is defined + naxis number of axes if array valued + naxisN length of each axis if array valued + label column label for output tables + + +1.3 Relations + + A relation declaration consists of a list of the attributes forming the +relation. An attribute is a named instance of an atomic domain, user defined +domain, or group. Any group, including nested groups, may be decomposed +into a set of named instances of domains, each of which is defined upon an +atomic datatype, hence a relation declaration is decomposable into a linear +list of atomic fields. The relation is the logical unit of storage in a +database. A base table is an named instance of some relation. + + + attributes of a relation: + + name name of the relation + nattributes number of attributes + atr_list list of attributes (see below) + primary_key + title + + + attributes of each attribute of a relation: + + name attribute name + domain domain on which attribute is defined + naxis number of axes if array valued + naxisN length of each axis if array valued + label column label for output tables + + +The atomic attributes of a relation may be either scalar or array valued. +The array valued attributes may be either static (the amount of storage is +set in the relation declaration) or dynamic (a variable amount of storage +is allocated at runtime). Array valued attributes may not be used as +predicates in queries. + + +1.4 Views + + A view is a logical relation defined upon one or more base tables, i.e., +instances of named relations. The role views perform in a database is similar +to that performed by base tables, but views do not in themselves occupy any +storage. The purpose of a view is to permit the appearance of the database +to be changed to suit the needs of a variety of applications, without having +to physically change the database itself. As a trivial example, a view may +be used to provide aliases for the names of the attributes of a relation. + + + attributes of a view: + + name name of the view + nattributes number of attributes + atr_list list of attributes (see below) + + + attributes of each attribute of a view: + + name attribute name + mapping name of the table and attribute to which this + view attribute is mapped diff --git a/sys/dbio/new/schema b/sys/dbio/new/schema new file mode 100644 index 00000000..ef99ac1b --- /dev/null +++ b/sys/dbio/new/schema @@ -0,0 +1,307 @@ +1. Database Schema + + A logical database consists of a standard set of system tables describing +the database, plus any number of user data tables. The system tables are the +following: + + + syscat System catalog. Lists all base tables, views, groups, + and relations in the database. The names of all tables, + relations, views, and groups must be distinct. Note + that the catalog does not list the attributes composing + a particular base table, relation, view, or group. + + REL_atl Attribute list table. Descriptor table for the table, + relation, view, or group REL. Lists the attributes + comprising REL. One such table is required for each + relation, view, or group defined in the database. + + sysddt Domain descriptor table. Describes all user defined + domains used in the database. Note that the scope of + a domain definition is the entire database, not one + relation. + + sysidt Index descriptor table. Lists all of the indexes in + the database. + + sysadt Alias descriptor table. Defines aliases for the names + of tables or attributes. + + +In addition to the standard tables, a table is required for each relation, +view, or group listing the attributes (fields) comprising the relation, view, +or group. A base table which is an instance of a named relation is described +by the table defining the relation. If a given base table has been altered +since its creation, e.g., by the addition of new attributes, then a separate +table is required listing the attributes of the altered base table. In effect, +a new relation type is automatically defined by the database system listing the +attributes of the altered base table. + +Like the user tables, the system tables are themselves described by attribute +list tables stored in the database. The database system need only know the +structure of an attribute list table to decipher the structure of the rest of +the database. A single access method can be used to access all database +structures (excluding the indexes, which are probably not stored as tables). + + +2. Storage Structures + + A database is maintained in a single random access binary file. This one +file contains all user tables and indexes and all system tables. A single +file is used to minimize the number of file opens and disk accesses required +to access a record from a "cold start", i.e., after process startup. Use of +a single file also simplifies bookeeping for the user, minimizes directory +clutter, and aids in database backup and transport. For clarity we shall +refer to this database file as a "datafile". A datafile is a DBIO format +binary file with the extension ".db". + +What the user perceives as a database is one or more datafiles plus any +logically associated non-database files. While database tasks may +simultaneously access several databases, access will be much more efficient +when multiple records are accessed in a single datafile than when a single +record is accessed in multiple datafiles. + + +2.1 Database Design + + When designing a database the user or applications programmer must consider +the following issues: + + [1] The logical structure of the database must be defined, i.e., the + organization of the data into tables. While in many cases this is + trivial, e.g., when there is only one type of table, in general this + area of database design is nontrivial and will require the services + of a database expert familiar with the relational algebra, + normalization, the entity/relationship model, etc. + + [2] The clustering of tables into datafiles must be defined. Related + tables which are fairly static should normally be placed in the same + datafile. Tables which change a lot or which may be used for a short + time and then deleted may be best placed in separate datafiles. + If the database is to be accessed simultaneously by multiple processes, + e.g., when running background jobs, then it may be necessary to place + the input tables in read only datafiles and the output tables in + separate private access datafiles to permit concurrent access (DBIO + does not support record level locking). + + [3] The type and number of indexes required for each table must be defined. + Most tables will require some sort of index for efficient retrieval. + Maintenance of an index slows insertion, hence output tables may be + better off without an index; indexes can be added later when the time + comes to read the table. The type of index (linear, hash, or B-tree) + must be defined, and the keys used in the index must be listed. + + [4] Large text or binary files which are logically associated with the + database may be implemented as physically separate, non-database files, + saving only the name of the file in the database, or as variable length + attributes, storing the data in the database itself. Large files may + be more efficiently accessed when stored outside the database, while + small files consume less storage and are more efficiently accessed when + stored in a datafile. Storing a file outside the database complicates + database management and transport. + + +3. DBIO + + DBIO is the host language interface to the database system. The interface +is a procedural rather than query oriented interface; the query facilities +provided by DBIO are limited to select/project. DBIO is designed to be fast and +compact and hence is little more than an access method. A process typically +has direct access to a database via a high bandwidth binary file i/o interface. + +Although we will not discuss it further here, we note that a compiled +application which requires query level access to a database can send queries +to the DBMS query language via the CL, using CLCMD (the query language resides +in a separate process). This is much the same technique as is used in +commercial database packages. A formal DBIO query language interface will be +defined when the query language is itself defined. + + +3.1 Database Management Functions + + DBIO provides a range of functions for database management, i.e., operations +on the database as a whole as opposed to the access functions, used for +retrieval, update, insertion, etc. The database management functions are +summarized below. + + + open database + close database + create database initially empty + delete database + change database (change default working database) + + create table from DDL; from compiled DDT, ALT + drop table + alter table + sort table + + create view + drop view + + create index + drop index + + +A database must be opened or created before any other operations can be +performed on the database (excluding delete). Several databases may be +open simultaneously. New tables are created by any of several methods, +i.e., from a written specification in the Data Definition Language (DDL), +by inheriting the attributes of an existing table, or by successive alter +table operations, adding a new attribute to the table definition in each call. + + +3.2 Data Access Functions + + A program accesses the database record by record via a "cursor". A cursor +is a pointer into a virtual table defined by evaluating a select/project +statement upon a database. This virtual table, or "selection set", consists of +a set of record ids referencing actual records in one or more base tables. +The individual records are not physically accessed by DBIO until a fetch, +update, insert, or delete operation is performed by the applications program +upon the record currently pointed to by the cursor. + + +3.2.1 Record Level Access Functions + + The record access functions allow a program to read and write entire records +in one operation. For the sake of data independence the program must first +define the exact format of the logical record to be read or written; this +format may differ from the physical record format in the number, order, and +datatype of the fields to be accessed. The names of the fields in the logical +record must however match those in the physical record (unless aliased), +and not all datatype conversions are legal. + + + open cursor + close cursor + length cursor + next cursor element + + fetch record + update record + insert record + delete record + + get/put scalar field (typed) + get/put vector field (typed) + + +Logical records are passed between DBIO and the calling program in the form +of a binary data structure via a pointer to the structure. Storage for the +structure is allocated by the calling program. Only fixed size fields may be +passed in this manner; variable size fields are represented in the static +structure by an integer count of the current number of elements in the field. +A separate call is required to read or write the contents of a variable length +field. + +The dynamically allocated binary structure format is flexible and efficient +and will be the most suitable format for most applications. A character string +format is also supported wherein the successive fields are encoded into +successive ranges of columns. This format is useful for data entry and +forms generation, as well as for communication with foreign languages (e.g., +Fortran) which do not provide the data structuring facilities necessary for +binary record transmission. + +The functions of the individual record level access operators are discussed +in more detail below. + + + fetch Read the physical record currently pointed to by the cursor + into an internal holding area in DBIO. Return the fields of + the specified logical record to the calling program. If no + logical record was specified the only function is to copy the + physical record into the DBIO holding area. + + modify Update the internal copy of the physical record from the fields + of the logical record passed as an argument, but do not update + the physical input record. + + update Update the internal copy of the physical record from the fields + of the logical record passed as an argument, then update the + physical record in mass storage. Mass storage will be updated + only if the local copy of the record has been modified. + + insert Update the internal copy of the physical record from the fields + of the logical record passed as an argument, then insert the + physical record into the specified output table. The record + currently in the holding area is used regardless of its origin, + hence an explicit fetch is required to copy a record. + + delete The record currently pointed to by the cursor is deleted. + + +For example, to perform a select/project operation on a database one could +open a cursor on the selection set defined by the indicated select/project +statement (passed as a character string), then FETCH and print successive +records until EOF is reached on the cursor. To perform some operation on +the elements of a selection set, producing a new table as output, one might +FETCH each element, use and possibly modify the binary data structure returned +by the FETCH, and then INSERT the modified record into the output table. + +When performing an UPDATE operation on the tuples of a selection set defined +over multiple input tables, the tuples in separate input tables need not all +have the same set of attributes. INSERTion into an output table, however, +requires that the new output tuples be union compatible with the existing +tuples in the output table, or the mismatched attributes in the output tuples +will be either lost or created with null values. If the output table is a new +table the attribute list of the new table may be defined to be either the +union or intersection of the attribute lists of all tables in the selection +set used as input. + + +3.2.2 Field Level Access Functions + + The record level access functions can be cumbersome when only one or two +of the fields in a record are to be accessed. The fields of a record may be +accessed individually by typed GET and PUT procedures (e.g., DBGETI, DBPUTI) +after copying the record in question into the DBIO holding area with FETCH. + + +3.3 DBKI + + The DataBase Kernel Interface (DBKI) is the interface between DBIO and +one or more DataBase Kernels (DBK). The DBKI supports multiple database +kernels, each of which may support multiple storage formats. The DBKI does +not itself provide any database functionality, rather it provides a level +of indirection between DBIO and the actual DBK used for a given dataset. +The syntax and semantics of the procedures forming the DBKI interface are +those required of a DBK, i.e., there is a one-to-one mapping between DBKI +procedures and DBK procedures. + +A DBIO call to a DBKI procedure will normally be passed on to a DBK procedure +resident in the same process, providing maximum performance. If the DBK is +especially large, e.g., when the DBK is a host database system, it may reside +in a separate process with the DBK procedures in the local process serving +only as an i/o interface. On a system configured with network support DBKI +will also provide the capability to access a DBK resident on a remote node. +In all cases when a remote DBK is accessed, the interprocess or network +interface occurs at the level of the DBKI. Placing the interface at the +DBKI level, rather than at the FIO z-routine level, provides a high bandwidth +between the DBK and mass storage, greatly increasing performance since only +selected records need be passed over the network interface. + + +3.4 DBK + + A DBIO database kernel (DBK) provides a "record manager" type interface, +similar to the popular ISAM and VSAM interfaces developed by IBM (the actual +access method used is based on the DB2 access method which is a variation on +VSAM). The DBK is responsible for the storage and retrieval of records from +tables, and for the maintainance and use of any indexes maintained upon such +tables. The DBK is also responsible for arbitrating database access among +concurrent processes (e.g., record locking, if provided), for error recovery, +crash recovery, backup, and so on. All data access via DBIO is routed through +a DBK. In no case does DBIO bypass the DBK to directly access mass storage. + +The DBK does not have any knowledge of the contents of a record (an exception +occurs if the DBK is actually an interface to a host database system). +To the DBK a record is a byte string. Encoding and decoding of records is +performed by DBIO. The actual encoding used is machine independent and space +efficient (byte packed). Numeric fields are encoded in such a way that a +generic comparison procedure may be used for order comparisons of all fields +regardless of their datatype. This greatly simplifies both the evaluation of +predicates (e.g., in a select) and the maintenance of indexes. The use of a +machine independent encoding provides equivalent database semantics on all +machines and transparent network access without redundant encode/decode, +as well as making it trivial to transport databases between machines. diff --git a/sys/dbio/new/spie.ms b/sys/dbio/new/spie.ms new file mode 100644 index 00000000..ce380b70 --- /dev/null +++ b/sys/dbio/new/spie.ms @@ -0,0 +1,17 @@ +.TL +The IRAF Data Reduction and Analysis System +.AU +Doug Tody +.AI +National Optical Astronomy Observatories +Central Computer Services +IRAF Group +.PP +.ls 2 +The Interactive Reduction and Analysis Facility (IRAF) is a general purpose +data reduction and analysis system that has been under development by the +National Optical Astronomy Observatories (NOAO) for the past several years +and which is now in use within NOAO, at the Space Telescope Science Institute, +and at several other sites on several different computers and operating systems. +The philosophy and design goals of the IRAF system are discussed and the +facilities provided by the current system are summarized. |