.help dbio Jul85 "Database I/O Design" .ce \fBIRAF Database I/O\fR .ce Conceptual Design .ce Doug Tody .ce July 1985 .sp 3 .nh Introduction The DBIO (database i/o) interface is a library of SPP callable procedures used to access data structures maintained in mass storage. While DBIO is at the heart of the IRAF database subsystem, it is only a part of that subsystem. Other major components of the database subsystem include the IMIO interface (image i/o), a higher level interface used to access bulk data maintained in part under DBIO, and the DBMS package (data base management system), a CL level package providing the user with direct access to any database maintained under DBIO. Additional structure is found beneath DBIO; this is for the most part invisible to both the programmer and the user but is of fundamental importance to the design, as we shall see later. .ks .nf DBMS (cl) \ --------- \ IMIO \ / \ \ / \ \/ \ (vos) DBIO FIO | | --------- | (DB kernel) (vos or host) .fi .ce Figure 1. Major Interfaces .ke .nh Requirements The requirements for the DBIO interface are driven by its intended usage for image and catalog storage. It is arguable whether the same interface should be used for both types of data, but development of an interface such as DBIO with all the associated DBMS utilities is expensive, hence we would prefer to have to develop only one such interface. Furthermore, it is desirable for the user to only have to learn one such interface. The primary functional and performance requirements which DBIO must meet are the following (in no particular order). .ls .ls [1] DBIO shall provide a high degree of data independence, i.e., a program shall be able to access a data structure maintained under DBIO without detailed knowledge of its contents. .le .ls [2] A DBIO datafile shall be self describing and self contained, i.e., it shall be possible to examine the structure and contents of a DBIO datafile without prior knowledge of its structure or contents. .le .ls [3] DBIO shall be able to deal efficiently with records containing up to N fields and with data groups containing up to M records, where N and M are at least sysgen configurable and are order of magnitude N=10**2 and M=10**6. .le .ls [4] The time required to access an image header under DBIO must be comparable to the time currently required for the equivalent operation under IMIO. .le .ls [5] It shall be possible for an image header maintained under DBIO to contain application or user defined fields in addition to the standard fields required by IMIO. .le .ls [6] It shall be possible to dynamically add new fields to an existing image header (or to any DBIO record). .le .ls [7] It shall be possible to group similar records together in the database and to perform global operations upon all or part of the records in a group. .le .ls [8] It shall be possible for a field of a record to be a one-dimensional array of any of the primitive types. .le .ls [9] Variant records (records containing variable size fields) shall be supported, ideally without penalizing efficient access to databases which do not contain such records. .le .ls [A] It shall be possible to copy a record without knowledge of its contents. .le .ls [B] It shall be possible to merge (join) two records containing disjoint sets of fields. .le .ls [C] It shall be possible to update a record in place. .le .ls [D] It shall be possible to simultaneously access (retrieve, update, or insert) multiple records from the same data group. .le .le To summarize, the primary requirements are data independence, efficient access to both large and small databases, and flexibility in the contents of the database. .nh Conceptual Design The DBIO database faciltities shall be based upon the relational model. The relational model is preferred due to its simplicity (to the user) and due to the demonstrable fact that relational databases can efficiently handle large amounts of data. In the relational model the database appears to be nothing more than a set of \fBtables\fR, with no builtin connections between separate tables. The operations defined upon these tables are based upon the relational algebra, which is in turn based upon set theory. The major advantages claimed for relational databases are the simplicity of the concept of a database as a collection of tables, and the predictability of the relational operators due to their being based on a formal theoretical model. None of the requirements listed in section 2 state that DBIO must implement a relational database. Most of our needs can be met by structuring our data according to the relational data model (i.e., as tables), and providing a good \fBselect\fR operator for retrieving records from the database. If a semirelational database is sufficient to meet our requirements then most likely that is what will be built (at least initially; the relational operators are very attractive for data analysis). DBIO is not expected to be competitive with any commercial relational database; to try to make it so would probably compromise the requirement that the interface be compact. On the other hand, the database requirements of IRAF are similar enough to those addressed by commercial databases that we would be foolish not to try to make use of some of the same technology. .ks .nf \fBformal relational term\fR \fBinformal equivalents\fR relation table tuple record, row attribute field, column domain datatype primary key record id .fi .ke A DBIO \fBdatabase\fR shall consist of one or more \fBrelations\fR (tables). Each relation shall contain zero or more \fBrecords\fR (rows of the table). Each record shall contain one or more \fBfields\fR (columns of the table). All records in a relation shall share the same set of fields, but all of the fields in a record need not have been assigned values. When a new \fBattribute\fR (column) is added to an existing relation a default valued field is added to each current and future record in the relation. Each attribute is defined upon a particular \fBdomain\fR, e.g., the set of all nonnegative integer values less than or equal to 100. It shall be possible to specify minimum and maximum values for integer and real attributes and to enumerate the permissible values of a string type attribute. It shall be possible to specify a default value for an attribute. If no default value is given INDEF is assumed. One dimensional arrays shall be supported as attribute types; these will be treated as atomic datatypes by the relational operators. Array valued attributes shall be either fixed in size (the most efficient form) or variant. There need be no special character string datatype since one dimensional arrays of type character are supported. Each relation shall be implemented as a separate file. If the relations comprising a database are stored in a directory then the directory can be thought of as the database. Public databases will be stored in well known public (write protected) directories, private databases in user directories. The logical directory name of each public database will be the name of the database. Physical storage for a database need not necessarily be allocated locally, i.e., a database may be centrally located and remotely accessed if the host computer is part of a local area network. Locking shall be at the level of entire relations rather than at the record level, at least in the initial implementation. There shall be no support for indices in the initial implementation except possibly for the primary key. It should be possible to add either or both of these features to a future implementation without changing the basic DBIO interface. Modifications to the internal data structures used in database files will likely be necessary when adding such a major feature, making a save and restore operation necessary for each database file to convert it to the new format. The save format chosen (e.g. FITS table) should be independent of the internal format used at a particular time on a particular host machine. Images shall be stored in the database as individual records. All image records shall share a common subset of attributes. Related images (image records) may be grouped together to form relations. The IRAF image operators shall support operations upon relations (sets of images) much as the IRAF file operators support operations upon sets of files. A unary image operator shall take as input a relation (set of one or more images), inserting the processed images into the output relation. A binary image operator shall take as input either two relations or a relation and a record, inserting the processed images into the output relation. In all cases the output relation can be an input relation as well. The input relation will be defined either by a list or by selection using a theta-join (operationally similar to a filename template). .nh 2 Relational Operators DBIO shall support two basic types of database operations: operations upon relations and operations upon records. The basic relational operators are the following. All of these operators produce as output a new relation. .ls .ls create Create a new base relation (physical relation as stored on disk) by specifying an initial set of attributes and the (file)name for the new relation. Attributes and domains may be specified via a data definition file or by reference to an existing relation. A primary key (limited to a single attribute) should be identified. The new relation initially contains no records. .le .ls drop Delete a (possibly nonempty) base relation and any associated indices. .le .ls alter Add a new attribute or attributes to an existing base relation. Attributes may be specified explicitly or by reference to another relation. .le .ls select Create a new relation by selecting records from one or more existing base relations. Input consists of an algebraic expression defining the output relation in terms of the input relations (usage will be similar to filename templates). The output relation need not have the same set of attributes as the input relations. The \fIselect\fR operator shall ultimately implement all the basic operations of the relational algebra, i.e., select, project, join, and the set operations. At a minimum, selection and projection are required in the initial interface. The output of \fBselect\fR is not a named relation (base relation), but is instead intended to be accessed by the record level operators discussed in the next section. .le .ls edit Edit a relation. An interactive screen editor is entered allowing the user to add, delete, or modify tuples (not required in the initial version of the interface). Field values are verified upon input. .le .ls sort Make the storage order of the records in a relation agree with the order defined by the primary key (the index associated with the primary key is always sorted but index order need not agree with storage order). In general, retrieval on a sorted relation is more efficient than on an unsorted relation. Sorting also eliminates deadspace left by record deletion or by updates involving variant records. .le .le Additional nonalgebraic operators are required for examining the structure and contents of relations, returning the number of records or attributes in a relation, and determining whether a given relation exists. The \fIselect\fR operator is the primary user interface to DBIO. Since most of the relational power of DBIO is bound up in the \fIselect\fR operator and since \fIselect\fR will be driven by an algebraic expression (character string) there is considerable scope for future enhancement of DBIO without affecting existing code. .nh 2 Record (Tuple) Level Operators While the user should see primarily operations on entire relations, record level processing is necessary at the program level to permit data entry and implementation of special operators. The basic record level operators are the following. .ls .ls retrieve Retrieve the next record from the relation defined by \fBselect\fR. While the tuples in a relation theoretically form an unordered set, tuples will normally be returned in either storage order or in the sort order of the primary key. Although all fields of a retrieved record are accessible, an application will typically have knowledge of only a few fields. .le .ls update Rewrite the (possibly modified) current record. The updated record is written back into the base table from which it was read. Not all records produced by \fBselect\fR can be updated. .le .ls insert Insert a new record into an output relation. The output relation may be an input relation as well. Records added to an output relation which is also an input relation do not become candidates for selection until another \fBselect\fR occurs. A retrieve followed by an insert copies a record without knowledge of its contents. A retrieve followed by modification of selected fields followed by an insert copies all unmodified fields of the record. The attributes of the input and output relations need not match; unmatched output attributes take on their default values and unmatched input attributes are discarded. \fBInsert\fR returns a pointer to the output record, allowing insertions of null records to be followed by initialization of the fields of the new record. .le .ls delete Delete the current record. .le .le Additional operators are required to close or open a relation for record level access and to count the number of records in a relation. .nh 3 Constructing Special Relational Operators The record level operations may be combined with \fBselect\fR in compiled programs to implement arbitrary operations upon entire relations. The basic scenario is as follows: .ls .ls [1] The set of records to be operated upon, defined by the \fBselect\fR operator, is opened as an unordered set (list) of records to be processed. .le .ls [2] The "next" record in the relation is accessed with \fBretrieve\fR. .le .ls [3] The application reads or modifies a subset of the fields of the record, updating modified records or inserting the record in the output relation. .le .ls [4] Steps [2] and [3] are repeated until the entire relation has been processed. .le .le Examples of such operators are conversion to and from DBIO and LIST file formats, column extraction, mimimum or maximum of an attribute (domain algebra), and all of the DBMS and IMAGES operators. .nh 2 Field (Attribute) Level Operators Substantial processing of the contents of a database is possible without ever accessing the individual fields of a record. If field level access is required the record must first be retrieved or inserted. Field level access requires knowledge of the names of the attributes of the parent relation, but not their exact datatypes. Automatic type conversion occurs when field values are queried or set. .ls .ls get .sp Get the value of the named scalar or vector field (typed). .le .ls put .sp Put the value of the named scalar or vector field (typed). .le .ls read Read the named fields into an SPP data structure, given the name, datatype, and length (if vector) of each field in the output structure. There must be an attribute in the parent relation for each field in the output structure. .le .ls write Copy an SPP data structure into the named fields of a record, given the name, datatype, and length (if vector) of each field in the input structure. There must be an attribute in the parent relation for each field in the input structure. .le .ls access Determine whether a relation has the named attribute. .le .le