From fa080de7afc95aa1c19a6e6fc0e0708ced2eadc4 Mon Sep 17 00:00:00 2001 From: Joseph Hunkeler Date: Wed, 8 Jul 2015 20:46:52 -0400 Subject: Initial commit --- sys/dbio/db2.hlp | 612 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 612 insertions(+) create mode 100644 sys/dbio/db2.hlp (limited to 'sys/dbio/db2.hlp') diff --git a/sys/dbio/db2.hlp b/sys/dbio/db2.hlp new file mode 100644 index 00000000..ffe3b74c --- /dev/null +++ b/sys/dbio/db2.hlp @@ -0,0 +1,612 @@ +.help dbio Nov84 "Database I/O Design" +.ce +\fBIRAF Database I/O\fR +.ce +Doug Tody +.ce +November 1984 +.sp 3 +.nh +Introduction + + The IRAF database i/o package (DBIO) is a library of SPP callable +procedures used to create, modify, and access IRAF database files. +All access to these database files shall be indirectly or directly via the +DBIO interface. DBIO shall be implemented using IRAF file i/o and memory +management facilities, hence the package will be compact and portable. +The separate CL level package DBMS shall be provided for interactive database +access and for procedural access to the database from within CL scripts. +The DBMS tasks will access the database via DBIO. + +Virtually all runtime IRAF datafiles not maintained in text form shall be +maintained under DBIO, hence it is essential that the interface be both +efficient and compact. In particular, bulk data (images) and large catalogs +shall be maintained under DBIO. The requirement for flexibility in defining +and accessing IRAF image headers necessitates quite a sophisticated interface. +Catalog storage is required primarily for module intercommunication and +output of the results of the larger IRAF applications packages, but will also +be useful for accessing astronomical catalogs prepared outside IRAF (e.g., +the SAO star catalog). In short, virtually all IRAF applications packages +are expected to make use of DBIO; many will depend upon it heavily. + +The relationship of the DBIO and DBMS packages to each other and to the +related standard IRAF interfaces is shown in Figure 1.1. + + +.ks +.nf + DBMS + DBIO + FIO + MEMIO + (kernel) + (datafiles) + + (cl) | (vos) | (host) + + + +.fi +.ce +Fig 1.1 Major Interfaces +.ke + + +While images will be maintained under DBIO, access to the pixels will +continue to be provided by the IMIO interface. IMIO is a higher level interface +which will use DBIO to maintain the image header. Pixel storage will be either +in a separate pixel storage file or in the database file itself (as a one +dimensional array), depending on the size of the image. +A system defined thresold value will determine which type of storage is used. +The relationship of IMIO to DBIO is shown in Figure 1.2. + + +.ks +.nf + IMAGES + IMIO + DBIO + FIO + MEMIO + + (cl) | (vos) + + +.fi +.ce +Fig 1.2 Relationship of Database and Image I/O +.ke + +.nh +Requirements + + The requirements for the DBIO interface are driven by its intended usage +for image and catalog storage. It is arguable whether the same interface +should be used for both types of data, but development of an interface such +as DBIO with all the associated DBMS utilities is expensive, hence we would +prefer to have to develop only one such interface. Furthermore, it is desirable +for the user to only have to learn one such interface. The primary functional +and performance requirements which DBIO must meet are the following (in no +particular order). + +.ls +.ls [1] +DBIO shall provide a high degree of data independence, i.e., a program +shall be able to access a data structure maintained under DBIO without +detailed knowledge of its contents. +.le +.ls [2] +A DBIO datafile shall be self describing and self contained, i.e., it shall +be possible to examine the structure and contents of a DBIO datafile without +prior knowledge of its structure or contents. +.le +.ls [3] +DBIO shall be able to deal efficiently with records containing up to N fields +and with data groups containing up to M records, where N and M are at least +sysgen configurable and are order of magnitude N=10**2 and M=10**6. +.le +.ls [4] +The time required to access an image header under DBIO must be comparable +to the time currently required for the equivalent operation under IMIO. +.le +.ls [5] +It shall be possible for an image header maintained under DBIO to contain +application or user defined fields in addition to the standard fields +required by IMIO. +.le +.ls [6] +It shall be possible to dynamically add new fields to an existing image header +(or to any DBIO record). +.le +.ls [7] +It shall be possible to group similar records together in the database +and to perform global operations upon all or part of the records in a +group. +.le +.ls [8] +It shall be possible for a field of a record to be a one-dimensional array +of any of the primitive types. +.le +.ls [9] +Variant records (records containing variable size fields) shall be supported, +ideally without penalizing efficient access to databases which do not contain +such records. +.le +.ls [A] +It shall be possible to copy a record without knowledge of its contents. +.le +.ls [B] +It shall be possible to merge (join) two records containing disjoint sets of +fields. +.le +.ls [C] +It shall be possible to update a record in place. +.le +.ls [D] +It shall be possible to simultaneously access (retrieve, update, or insert) +multiple records from the same data group. +.le +.le + + +To summarize, the primary requirements are data independence, efficient access +to both large and small databases, and flexibility in the contents of the +database. + +.nh +Conceptual Design + + The DBIO database facilities shall be based upon the relational model. +The relational model is preferred due to its simplicity (to the user) +and due to the demonstrable fact that relational databases can efficiently +handle large amounts of data. In the relational model the database appears +to be nothing more than a set of \fBtables\fR, with no builtin connections +between separate tables. The operations defined upon these tables are based +upon the relational algebra, which is in turn based upon set theory. +The major advantages claimed for relational databases are the simplicity +of the concept of a database as a collection of tables, and the predictability +of the relational operators due to their being based on a formal theoretical +model. + +None of the requirements listed in section 2 state that DBIO must implement +a relational database. Most of our needs can be met by structuring our data +according to the relational data model (i.e., as tables), and providing a +good \fBselect\fR operator for retrieving records from the database. If a +semirelational database is sufficient to meet our requirements then most +likely that is what will be built (at least initially; the relational operators +are very attractive for data analysis). DBIO is not expected to be competitive +with any commercial relational database; to try to make it so would probably +compromise the requirement that the interface be compact. +On the other hand, the database requirements of IRAF are similar enough to +those addressed by commercial databases that we would be foolish not to try +to make use of some of the same technology. + + +.ks +.nf + \fBformal relational term\fR \fBinformal equivalents\fR + + relation table + tuple record, row + attribute field, column + domain datatype + primary key record id +.fi +.ke + + +A DBIO \fBdatabase\fR shall consist of one or more \fBrelations\fR (tables). +Each relation shall contain zero or more \fBrecords\fR (rows of the table). +Each record shall contain one or more \fBfields\fR (columns of the table). +All records in a relation shall share the same set of fields, +but all of the fields in a record need not have been assigned values. +When a new \fBattribute\fR (column) is added to an existing relation a default +valued field is added to each current and future record in the relation. + +Each attribute is defined upon a particular \fBdomain\fR, e.g., the set of +all nonnegative integer values less than or equal to 100. It shall be possible +to specify minimum and maximum values for integer and real attributes +and to enumerate the permissible values of a string type attribute. +It shall be possible to specify a default value for an attribute. +If no default value is given INDEF is assumed. +One dimensional arrays shall be supported as attribute types; these will be +treated as atomic datatypes by the relational operators. Array valued +attributes shall be either fixed in size (the most efficient form) or variant. +There need be no special character string datatype since one dimensional +arrays of type character are supported. + +Each relation shall be implemented as a separate file. If the relations +comprising a database are stored in a directory then the directory can +be thought of as the database. Public databases will be stored in well +known public (write protected) directories, private databases in user +directories. The logical directory name of each public database will be +the name of the database. Physical storage for a database need not necessarily +be allocated locally, i.e., a database may be centrally located and remotely +accessed if the host computer is part of a local area network. + +Locking shall be at the level of entire relations rather than at the record +level, at least in the initial implementation. There shall be no support for +indices in the initial implementation except possibly for the primary key. +It should be possible to add either or both of these features to a future +implementation without changing the basic DBIO interface. Modifications to +the internal data structures used in database files will likely be necessary +when adding such a major feature, making a save and restore operation +necessary for each database file to convert it to the new format. +The save format chosen (e.g. FITS table) should be independent of the +internal format used at a particular time on a particular host machine. + +Images shall be stored in the database as individual records. +All image records shall share a common subset of attributes. +Related images (image records) may be grouped together to form relations. +The IRAF image operators shall support operations upon relations +(sets of images) much as the IRAF file operators support operations upon +sets of files. + +A unary image operator shall take as input a relation (set of one or more +images), inserting the processed images into the output relation. +A binary image operator shall take as input either two relations or a +relation and a record, inserting the processed images into the output +relation. In all cases the output relation can be an input relation as +well. The input relation will be defined either by a list or by selection +using a theta-join (operationally similar to a filename template). + +.nh 2 +Relational Operators + + DBIO shall support two basic types of database operations: operations upon +relations and operations upon records. The basic relational operators +are the following. All of these operators produce as output a new relation. + +.ls +.ls create +Create a new base relation (physical relation as stored on disk) by specifying +an initial set of attributes and the (file)name for the new relation. +Attributes and domains may be specified via a data definition file or by +reference to an existing relation. +A primary key (limited to a single attribute) should be identified. +The new relation initially contains no records. +.le +.ls drop +Delete a (possibly nonempty) base relation and any associated indices. +.le +.ls alter +Add a new attribute or attributes to an existing base relation. +Attributes may be specified explicitly or by reference to another relation. +.le +.ls select +Create a new relation by selecting records from one or more existing base +relations. Input consists of an algebraic expression defining the output +relation in terms of the input relations (usage will be similar to filename +templates). The output relation need not have the same set of attributes as +the input relations. The \fIselect\fR operator shall ultimately implement +all the basic operations of the relational algebra, i.e., select, project, +join, and the set operations. At a minimum, selection and projection are +required in the initial interface. The output of \fBselect\fR is not a +named relation (base relation), but is instead intended to be accessed +by the record level operators discussed in the next section. +.le +.ls edit +Edit a relation. An interactive screen editor is entered allowing the user +to add, delete, or modify tuples (not required in the initial version of +the interface). Field values are verified upon input. +.le +.ls sort +Make the storage order of the records in a relation agree with the order +defined by the primary key (the index associated with the primary key is +always sorted but index order need not agree with storage order). +In general, retrieval on a sorted relation is more efficient than on an +unsorted relation. Sorting also eliminates deadspace left by record +deletion or by updates involving variant records. +.le +.le + + +Additional nonalgebraic operators are required for examining the structure +and contents of relations, returning the number of records or attributes in +a relation, and determining whether a given relation exists. + +The \fIselect\fR operator is the primary user interface to DBIO. +Since most of the relational power of DBIO is bound up in the \fIselect\fR +operator and since \fIselect\fR will be driven by an algebraic expression +(character string) there is considerable scope for future enhancement +of DBIO without affecting existing code. + +.nh 2 +Record (Tuple) Level Operators + + While the user should see primarily operations on entire relations, +record level processing is necessary at the program level to permit +data entry and implementation of special operators. The basic record +level operators are the following. + +.ls +.ls retrieve +Retrieve the next record from the relation defined by \fBselect\fR. +While the tuples in a relation theoretically form an unordered set, +tuples will normally be returned in either storage order or in the sort +order of the primary key. Although all fields of a retrieved record are +accessible, an application will typically have knowledge of only a few fields. +.le +.ls update +Rewrite the (possibly modified) current record. The updated record is +written back into the base table from which it was read. Not all records +produced by \fBselect\fR can be updated. +.le +.ls insert +Insert a new record into an output relation. The output relation may be an +input relation as well. Records added to an output relation which is also +an input relation do not become candidates for selection until another +\fBselect\fR occurs. A retrieve followed by an insert copies a record without +knowledge of its contents. A retrieve followed by modification of selected +fields followed by an insert copies all unmodified fields of the record. +The attributes of the input and output relations need not match; unmatched +output attributes take on their default values and unmatched input attributes +are discarded. \fBInsert\fR returns a pointer to the output record, +allowing insertions of null records to be followed by initialization of +the fields of the new record. +.le +.ls delete +Delete the current record. +.le +.le + + +Additional operators are required to close or open a relation for record +level access and to count the number of records in a relation. + +.nh 3 +Constructing Special Relational Operators + + The record level operations may be combined with \fBselect\fR in compiled +programs to implement arbitrary operations upon entire relations. +The basic scenario is as follows: + +.ls +.ls [1] +The set of records to be operated upon, defined by the \fBselect\fR +operator, is opened as an unordered set (list) of records to be processed. +.le +.ls [2] +The "next" record in the relation is accessed with \fBretrieve\fR. +.le +.ls [3] +The application reads or modifies a subset of the fields of the record, +updating modified records or inserting the record in the output relation. +.le +.ls [4] +Steps [2] and [3] are repeated until the entire relation has been processed. +.le +.le + + +Examples of such operators are conversion to and from DBIO and LIST file +formats, column extraction, mimimum or maximum of an attribute (domain +algebra), and all of the DBMS and IMAGES operators. + +.nh 2 +Field (Attribute) Level Operators + + Substantial processing of the contents of a database is possible without +ever accessing the individual fields of a record. If field level access is +required the record must first be retrieved or inserted. Field level access +requires knowledge of the names of the attributes of the parent relation, +but not their exact datatypes. Automatic type conversion occurs when field +values are queried or set. + +.ls +.ls get +.sp +Get the value of the named scalar or vector field (typed). +.le +.ls put +.sp +Put the value of the named scalar or vector field (typed). +.le +.ls read +Read the named fields into an SPP data structure, given the name, datatype, +and length (if vector) of each field in the output structure. +There must be an attribute in the parent relation for each field in the +output structure. +.le +.ls write +Copy an SPP data structure into the named fields of a record, given the +name, datatype, and length (if vector) of each field in the input structure. +There must be an attribute in the parent relation for each field in the +input structure. +.le +.ls access +Determine whether a relation has the named attribute. +.le +.le + +.nh 2 +Storage Structures + + The DBIO storage structures are the data structures used by DBIO to +maintain relations in physical storage. The primary design goals are +simplicity and efficiency in time and space. Most actual relations are +expected to fall into three classes: + +.ls +.ls [1] +Relations containing only a single record, e.g., an image stored alone +in a relation. +.le +.ls [2] +Relations containing several dozen or several hundred records, e.g., +a collection of spectra from an observing run. +.le +.ls [3] +Large relations containing 10**5 or 10**6 records, e.g., the output of an +analysis program or an astronomical catalog. +.le +.le + + +Updates and insertions are generally random access operations; retrieval +based on the values of several attributes requires efficient sequential +access. Efficient random access for relations [2] and [3] requires use +of an index. Efficient sequential access requires that records be +accessible in storage order without reference to the index, i.e., that +records be chained in storage order. Efficient field access where a +record contains several dozen attributes requires something better than +a linear search over the attribute list. + +The use of an index shall be limited initially to a single index for +the primary key. The primary key will be restricted to a single attribute, +with the application defining the attribute to be used (in practice few +attributes are usable as keys). +The index will be a standard B+ tree, with one exception: the root block +of the tree will be maintained in dedicated storage in the datafile. +If and only if a relation grows so large that it overflows the root block +will a separate index file be allocated for the index. This will eliminate +most of the overhead associated with the index for small relations. + +Efficient sequential access will be provided in either of two ways: via the +index in index order or via the records themselves in storage order, +depending on the operation being performed. If an external index is used +the leaves will be chained to permit efficient sequential access in index +order. If the relation also happens to be sorted in index order then this +mode of access will be very efficient. Link information will also be stored +directly in the records to permit efficient sequential access when it is +not necessary or possible to use the index. + +Assuming that there is at most one index associated with a relation, +at most two files will be required to implement the relation. The relation +itself will have the file extension ".db". The index file, if any, will +have the extension ".dbi". The root name of both files will be the name of +the relation. + +The datafile header structure will probably have to be maintained in binary +if we are to keep the overhead of datafile access to acceptable levels for +small relations. Careful design of the basic header structure should +make most future refinements to the header possible without modification of +existing databases. The revision number of DBIO used to create the datafile +will be saved in the header to make at least detection of obsolete headers +possible. + +.nh 3 +Structure of a Binary Relation + + Putting all this together we come up with the following structure for +a binary relation: + + +.ks +.nf + BOF + relation header -+ + magic | + dbio revision number | + creation date | + relation name | + number of attributes |- fixed size header + primary key | + record size | + domain list | + attribute list | + miscellaneous | + string buffer | + root block of index -+ + record 1 + physical record length (offset to next record) + logical record length (implies number of attributes set) + field storage + + record 2 + ... + record N + EOF +.fi +.ke + + +Vector valued fields with a fixed upper size will be stored directly in the +record, prefixed by the length of the actual vector (which may vary from +record to record). +Storage for variant fields will be allocated outside the record, placing only +a pointer to the data vector and byte count in the record itself. +Variant records are thus reduced to fixed size records, +simplifying record access and making sequential access more efficient. + +Records will change size only when a new attribute is added to an existing +relation, followed by assignment into a record written when there were +fewer attributes. If the new record will not fit into the physical slot +already allocated, the record is written at EOF and the original record +is deleted. Deletion of a record is achieved by setting the logical record +length to zero. Storage is not reclaimed until a sort occurs, hence +recovery of deleted records is possible. + +To minimize buffer space and memory to memory copies when accessing a +relation it is desirable to work directly out of the FIO buffers. +To make this possible records will not be permitted to straddle logical +block boundaries. A file block will typically contain several records +followed by a gap. The gap may be used to accommodate record expansion +without moving a record to EOF. The size of a file block is fixed when +the relation is created. + +.nh 3 +The Attribute List + + Efficient lookup of attribute names suggests maintenance of a hash table +in the datafile header. There will be a fixed upper limit on the number of +attributes permitted in a single relation (but not on the number of records). +Placing an upper limit on the number of attributes simplifies the software +considerably and permits use of a fixed size header, making it possible to +read or update the entire header in one disk access. There will also be an +upper limit on the number of domains, but the domain list is not searched +very often hence a linear search will do. + +All information about the decomposition of a record into fields, other than +the logical length of vector valued fields, is given by the attribute list. +Records contain only data with no embedded structural information other than +the length of the vector fields. New attributes are added to a relation by +appending to the attribute list. Existing records are not affected. +By comparing the logical length of a record to the offset for a particular +field we can tell whether storage has been allocated for that field in the +record. + +Domains are used to limit the range of values a field can take on in an +assignment, and to flag attribute comparisons which are likely to be erroneous +(e.g. order comparison of a pixel coordinate and a wavelength). The domains +"bool", "char", "short", etc. are predefined. The following information +must be stored for each user defined domain: + + +.ks +.nf + name may be same as attribute name + datatype bool, char, short, etc. + physical vector length 0=variant, 1=scalar, N=vector + default default value, INDEF if not given + minimum mimimum value (ints and reals) + maximum maximum value (ints and reals) + enumval enumerated values (strings) +.fi +.ke + + +The following information is required to describe each attribute. +The attribute list is maintained separately from the hash table of attribute +names and can be used to regenerate the hash table of attribute names if +necessary. + + +.ks +.nf + name no embedded whitespace + domain index into domain table + offset offset in record +.fi +.ke + + +All strings will be stored in a fixed size string buffer in the header +area; it is the index of the string which is stored in the domain and +attribute lists. This eliminates the need to place an upper limit on the +size of domain names and enumerated value lists and makes it possible +for a single attribute name string to be referenced in both the attribute +list and the attribute hash table. + +.nh +Specifications -- cgit