aboutsummaryrefslogtreecommitdiff
path: root/sys/dbio/db2.hlp
blob: ffe3b74c06d12a71affc158a2e9c64a3a949133e (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
.help dbio Nov84 "Database I/O Design"
.ce
\fBIRAF Database I/O\fR
.ce
Doug Tody
.ce
November 1984
.sp 3
.nh
Introduction

    The IRAF database i/o package (DBIO) is a library of SPP callable
procedures used to create, modify, and access IRAF database files.
All access to these database files shall be indirectly or directly via the
DBIO interface.  DBIO shall be implemented using IRAF file i/o and memory
management facilities, hence the package will be compact and portable.
The separate CL level package DBMS shall be provided for interactive database
access and for procedural access to the database from within CL scripts.
The DBMS tasks will access the database via DBIO.

Virtually all runtime IRAF datafiles not maintained in text form shall be
maintained under DBIO, hence it is essential that the interface be both
efficient and compact.  In particular, bulk data (images) and large catalogs
shall be maintained under DBIO.  The requirement for flexibility in defining
and accessing IRAF image headers necessitates quite a sophisticated interface.
Catalog storage is required primarily for module intercommunication and
output of the results of the larger IRAF applications packages, but will also
be useful for accessing astronomical catalogs prepared outside IRAF (e.g.,
the SAO star catalog).  In short, virtually all IRAF applications packages
are expected to make use of DBIO; many will depend upon it heavily.

The relationship of the DBIO and DBMS packages to each other and to the
related standard IRAF interfaces is shown in Figure 1.1.


.ks
.nf
		DBMS
			DBIO
				FIO
				MEMIO
					(kernel)
						(datafiles)

	     (cl)    |      (vos)     |     (host)



.fi
.ce
Fig 1.1 Major Interfaces
.ke


While images will be maintained under DBIO, access to the pixels will
continue to be provided by the IMIO interface.  IMIO is a higher level interface
which will use DBIO to maintain the image header.  Pixel storage will be either
in a separate pixel storage file or in the database file itself (as a one
dimensional array), depending on the size of the image.
A system defined thresold value will determine which type of storage is used.
The relationship of IMIO to DBIO is shown in Figure 1.2.


.ks
.nf
			IMAGES
				IMIO
					DBIO
					FIO
					MEMIO

			(cl)   |    (vos)


.fi
.ce
Fig 1.2 Relationship of Database and Image I/O
.ke

.nh
Requirements

    The requirements for the DBIO interface are driven by its intended usage
for image and catalog storage.  It is arguable whether the same interface
should be used for both types of data, but development of an interface such
as DBIO with all the associated DBMS utilities is expensive, hence we would
prefer to have to develop only one such interface.  Furthermore, it is desirable
for the user to only have to learn one such interface.  The primary functional
and performance requirements which DBIO must meet are the following (in no
particular order).

.ls
.ls [1]
DBIO shall provide a high degree of data independence, i.e., a program
shall be able to access a data structure maintained under DBIO without
detailed knowledge of its contents.
.le
.ls [2]
A DBIO datafile shall be self describing and self contained, i.e., it shall
be possible to examine the structure and contents of a DBIO datafile without
prior knowledge of its structure or contents.
.le
.ls [3]
DBIO shall be able to deal efficiently with records containing up to N fields
and with data groups containing up to M records, where N and M are at least
sysgen configurable and are order of magnitude N=10**2 and M=10**6.
.le
.ls [4]
The time required to access an image header under DBIO must be comparable
to the time currently required for the equivalent operation under IMIO.
.le
.ls [5]
It shall be possible for an image header maintained under DBIO to contain
application or user defined fields in addition to the standard fields
required by IMIO.
.le
.ls [6]
It shall be possible to dynamically add new fields to an existing image header
(or to any DBIO record).
.le
.ls [7]
It shall be possible to group similar records together in the database
and to perform global operations upon all or part of the records in a
group.
.le
.ls [8]
It shall be possible for a field of a record to be a one-dimensional array
of any of the primitive types.
.le
.ls [9]
Variant records (records containing variable size fields) shall be supported,
ideally without penalizing efficient access to databases which do not contain
such records.
.le
.ls [A]
It shall be possible to copy a record without knowledge of its contents.
.le
.ls [B]
It shall be possible to merge (join) two records containing disjoint sets of
fields.
.le
.ls [C]
It shall be possible to update a record in place.
.le
.ls [D]
It shall be possible to simultaneously access (retrieve, update, or insert)
multiple records from the same data group.
.le
.le


To summarize, the primary requirements are data independence, efficient access
to both large and small databases, and flexibility in the contents of the
database.

.nh
Conceptual Design
    
    The DBIO database facilities shall be based upon the relational model.
The relational model is preferred due to its simplicity (to the user)
and due to the demonstrable fact that relational databases can efficiently
handle large amounts of data.  In the relational model the database appears
to be nothing more than a set of \fBtables\fR, with no builtin connections
between separate tables.  The operations defined upon these tables are based
upon the relational algebra, which is in turn based upon set theory.
The major advantages claimed for relational databases are the simplicity
of the concept of a database as a collection of tables, and the predictability
of the relational operators due to their being based on a formal theoretical
model.

None of the requirements listed in section 2 state that DBIO must implement
a relational database.  Most of our needs can be met by structuring our data
according to the relational data model (i.e., as tables), and providing a
good \fBselect\fR operator for retrieving records from the database.  If a
semirelational database is sufficient to meet our requirements then most
likely that is what will be built (at least initially; the relational operators
are very attractive for data analysis).  DBIO is not expected to be competitive
with any commercial relational database; to try to make it so would probably
compromise the requirement that the interface be compact.
On the other hand, the database requirements of IRAF are similar enough to
those addressed by commercial databases that we would be foolish not to try
to make use of some of the same technology.


.ks
.nf
	\fBformal relational term\fR		    \fBinformal equivalents\fR

		relation			table
		tuple				record, row
		attribute			field, column
		domain				datatype
		primary key			record id
.fi
.ke


A DBIO \fBdatabase\fR shall consist of one or more \fBrelations\fR (tables).
Each relation shall contain zero or more \fBrecords\fR (rows of the table).
Each record shall contain one or more \fBfields\fR (columns of the table).
All records in a relation shall share the same set of fields,
but all of the fields in a record need not have been assigned values.
When a new \fBattribute\fR (column) is added to an existing relation a default
valued field is added to each current and future record in the relation.

Each attribute is defined upon a particular \fBdomain\fR, e.g., the set of
all nonnegative integer values less than or equal to 100.  It shall be possible 
to specify minimum and maximum values for integer and real attributes
and to enumerate the permissible values of a string type attribute.
It shall be possible to specify a default value for an attribute.
If no default value is given INDEF is assumed.
One dimensional arrays shall be supported as attribute types; these will be
treated as atomic datatypes by the relational operators.  Array valued
attributes shall be either fixed in size (the most efficient form) or variant.
There need be no special character string datatype since one dimensional
arrays of type character are supported.

Each relation shall be implemented as a separate file.  If the relations
comprising a database are stored in a directory then the directory can
be thought of as the database.  Public databases will be stored in well
known public (write protected) directories, private databases in user
directories.  The logical directory name of each public database will be
the name of the database.  Physical storage for a database need not necessarily
be allocated locally, i.e., a database may be centrally located and remotely
accessed if the host computer is part of a local area network.

Locking shall be at the level of entire relations rather than at the record
level, at least in the initial implementation.  There shall be no support for
indices in the initial implementation except possibly for the primary key.
It should be possible to add either or both of these features to a future
implementation without changing the basic DBIO interface.  Modifications to
the internal data structures used in database files will likely be necessary
when adding such a major feature, making a save and restore operation
necessary for each database file to convert it to the new format.
The save format chosen (e.g. FITS table) should be independent of the
internal format used at a particular time on a particular host machine.

Images shall be stored in the database as individual records.
All image records shall share a common subset of attributes.  
Related images (image records) may be grouped together to form relations.
The IRAF image operators shall support operations upon relations
(sets of images) much as the IRAF file operators support operations upon
sets of files.

A unary image operator shall take as input a relation (set of one or more
images), inserting the processed images into the output relation.  
A binary image operator shall take as input either two relations or a
relation and a record, inserting the processed images into the output
relation.  In all cases the output relation can be an input relation as
well.  The input relation will be defined either by a list or by selection
using a theta-join (operationally similar to a filename template).

.nh 2
Relational Operators

    DBIO shall support two basic types of database operations: operations upon
relations and operations upon records.  The basic relational operators
are the following.  All of these operators produce as output a new relation.

.ls
.ls create
Create a new base relation (physical relation as stored on disk) by specifying
an initial set of attributes and the (file)name for the new relation.
Attributes and domains may be specified via a data definition file or by
reference to an existing relation.
A primary key (limited to a single attribute) should be identified.
The new relation initially contains no records.
.le
.ls drop
Delete a (possibly nonempty) base relation and any associated indices.
.le
.ls alter 
Add a new attribute or attributes to an existing base relation.
Attributes may be specified explicitly or by reference to another relation.
.le
.ls select
Create a new relation by selecting records from one or more existing base
relations.  Input consists of an algebraic expression defining the output
relation in terms of the input relations (usage will be similar to filename
templates).  The output relation need not have the same set of attributes as
the input relations.  The \fIselect\fR operator shall ultimately implement
all the basic operations of the relational algebra, i.e., select, project,
join, and the set operations.  At a minimum, selection and projection are
required in the initial interface.  The output of \fBselect\fR is not a
named relation (base relation), but is instead intended to be accessed
by the record level operators discussed in the next section.
.le
.ls edit
Edit a relation.  An interactive screen editor is entered allowing the user
to add, delete, or modify tuples (not required in the initial version of
the interface).  Field values are verified upon input.
.le
.ls sort
Make the storage order of the records in a relation agree with the order
defined by the primary key (the index associated with the primary key is
always sorted but index order need not agree with storage order).
In general, retrieval on a sorted relation is more efficient than on an
unsorted relation.  Sorting also eliminates deadspace left by record
deletion or by updates involving variant records.
.le
.le


Additional nonalgebraic operators are required for examining the structure
and contents of relations, returning the number of records or attributes in
a relation, and determining whether a given relation exists.

The \fIselect\fR operator is the primary user interface to DBIO.
Since most of the relational power of DBIO is bound up in the \fIselect\fR
operator and since \fIselect\fR will be driven by an algebraic expression
(character string) there is considerable scope for future enhancement
of DBIO without affecting existing code.

.nh 2
Record (Tuple) Level Operators

    While the user should see primarily operations on entire relations,
record level processing is necessary at the program level to permit
data entry and implementation of special operators.  The basic record
level operators are the following.

.ls
.ls retrieve
Retrieve the next record from the relation defined by \fBselect\fR.
While the tuples in a relation theoretically form an unordered set,
tuples will normally be returned in either storage order or in the sort
order of the primary key.  Although all fields of a retrieved record are
accessible, an application will typically have knowledge of only a few fields.
.le
.ls update
Rewrite the (possibly modified) current record.  The updated record is
written back into the base table from which it was read.  Not all records
produced by \fBselect\fR can be updated.
.le
.ls insert
Insert a new record into an output relation.  The output relation may be an
input relation as well.  Records added to an output relation which is also
an input relation do not become candidates for selection until another
\fBselect\fR occurs.  A retrieve followed by an insert copies a record without
knowledge of its contents.  A retrieve followed by modification of selected
fields followed by an insert copies all unmodified fields of the record.
The attributes of the input and output relations need not match; unmatched
output attributes take on their default values and unmatched input attributes
are discarded.  \fBInsert\fR returns a pointer to the output record,
allowing insertions of null records to be followed by initialization of
the fields of the new record.
.le
.ls delete
Delete the current record.
.le
.le


Additional operators are required to close or open a relation for record
level access and to count the number of records in a relation.

.nh 3
Constructing Special Relational Operators

    The record level operations may be combined with \fBselect\fR in compiled
programs to implement arbitrary operations upon entire relations.
The basic scenario is as follows:

.ls
.ls [1]
The set of records to be operated upon, defined by the \fBselect\fR
operator, is opened as an unordered set (list) of records to be processed.
.le
.ls [2]
The "next" record in the relation is accessed with \fBretrieve\fR.
.le
.ls [3]
The application reads or modifies a subset of the fields of the record,
updating modified records or inserting the record in the output relation.
.le
.ls [4]
Steps [2] and [3] are repeated until the entire relation has been processed.
.le
.le


Examples of such operators are conversion to and from DBIO and LIST file
formats, column extraction, mimimum or maximum of an attribute (domain
algebra), and all of the DBMS and IMAGES operators.

.nh 2
Field (Attribute) Level Operators

    Substantial processing of the contents of a database is possible without
ever accessing the individual fields of a record.  If field level access is
required the record must first be retrieved or inserted.  Field level access
requires knowledge of the names of the attributes of the parent relation,
but not their exact datatypes.  Automatic type conversion occurs when field
values are queried or set.

.ls
.ls get
.sp
Get the value of the named scalar or vector field (typed).
.le
.ls put
.sp
Put the value of the named scalar or vector field (typed).
.le
.ls read
Read the named fields into an SPP data structure, given the name, datatype,
and length (if vector) of each field in the output structure.
There must be an attribute in the parent relation for each field in the
output structure.
.le
.ls write
Copy an SPP data structure into the named fields of a record, given the
name, datatype, and length (if vector) of each field in the input structure.
There must be an attribute in the parent relation for each field in the
input structure.
.le
.ls access
Determine whether a relation has the named attribute.
.le
.le

.nh 2
Storage Structures

    The DBIO storage structures are the data structures used by DBIO to
maintain relations in physical storage.  The primary design goals are
simplicity and efficiency in time and space.  Most actual relations are
expected to fall into three classes:

.ls
.ls [1]
Relations containing only a single record, e.g., an image stored alone
in a relation.
.le
.ls [2]
Relations containing several dozen or several hundred records, e.g.,
a collection of spectra from an observing run.
.le
.ls [3]
Large relations containing 10**5 or 10**6 records, e.g., the output of an
analysis program or an astronomical catalog.
.le
.le


Updates and insertions are generally random access operations; retrieval
based on the values of several attributes requires efficient sequential
access.  Efficient random access for relations [2] and [3] requires use
of an index.  Efficient sequential access requires that records be
accessible in storage order without reference to the index, i.e., that
records be chained in storage order.  Efficient field access where a
record contains several dozen attributes requires something better than
a linear search over the attribute list.

The use of an index shall be limited initially to a single index for
the primary key.  The primary key will be restricted to a single attribute,
with the application defining the attribute to be used (in practice few
attributes are usable as keys).
The index will be a standard B+ tree, with one exception: the root block
of the tree will be maintained in dedicated storage in the datafile.
If and only if a relation grows so large that it overflows the root block
will a separate index file be allocated for the index.  This will eliminate
most of the overhead associated with the index for small relations.

Efficient sequential access will be provided in either of two ways: via the
index in index order or via the records themselves in storage order,
depending on the operation being performed.  If an external index is used
the leaves will be chained to permit efficient sequential access in index
order.  If the relation also happens to be sorted in index order then this
mode of access will be very efficient.  Link information will also be stored
directly in the records to permit efficient sequential access when it is
not necessary or possible to use the index.

Assuming that there is at most one index associated with a relation,
at most two files will be required to implement the relation.  The relation
itself will have the file extension ".db".  The index file, if any, will
have the extension ".dbi".  The root name of both files will be the name of
the relation.

The datafile header structure will probably have to be maintained in binary
if we are to keep the overhead of datafile access to acceptable levels for
small relations.  Careful design of the basic header structure should
make most future refinements to the header possible without modification of
existing databases.  The revision number of DBIO used to create the datafile
will be saved in the header to make at least detection of obsolete headers
possible.

.nh 3
Structure of a Binary Relation

    Putting all this together we come up with the following structure for
a binary relation:


.ks
.nf
	BOF
	relation header				-+
		magic				 |
		dbio revision number		 |
		creation date			 |
		relation name			 |
		number of attributes		 |- fixed size header
		primary key			 |
		record size			 |
		domain list			 |
		attribute list			 |
		miscellaneous			 |
	string buffer				 |
	root block of index			-+
	record 1
		physical record length (offset to next record)
		logical record length (implies number of attributes set)
		field storage
		<gap>
	record 2
		...
	record N
	EOF
.fi
.ke


Vector valued fields with a fixed upper size will be stored directly in the
record, prefixed by the length of the actual vector (which may vary from
record to record).
Storage for variant fields will be allocated outside the record, placing only
a pointer to the data vector and byte count in the record itself.
Variant records are thus reduced to fixed size records,
simplifying record access and making sequential access more efficient.

Records will change size only when a new attribute is added to an existing
relation, followed by assignment into a record written when there were
fewer attributes.  If the new record will not fit into the physical slot
already allocated, the record is written at EOF and the original record
is deleted.  Deletion of a record is achieved by setting the logical record
length to zero.  Storage is not reclaimed until a sort occurs, hence
recovery of deleted records is possible.

To minimize buffer space and memory to memory copies when accessing a
relation it is desirable to work directly out of the FIO buffers.
To make this possible records will not be permitted to straddle logical
block boundaries.  A file block will typically contain several records
followed by a gap.  The gap may be used to accommodate record expansion
without moving a record to EOF.  The size of a file block is fixed when
the relation is created.

.nh 3
The Attribute List

    Efficient lookup of attribute names suggests maintenance of a hash table
in the datafile header.  There will be a fixed upper limit on the number of
attributes permitted in a single relation (but not on the number of records).
Placing an upper limit on the number of attributes simplifies the software
considerably and permits use of a fixed size header, making it possible to
read or update the entire header in one disk access.  There will also be an
upper limit on the number of domains, but the domain list is not searched
very often hence a linear search will do.

All information about the decomposition of a record into fields, other than
the logical length of vector valued fields, is given by the attribute list.
Records contain only data with no embedded structural information other than
the length of the vector fields.  New attributes are added to a relation by
appending to the attribute list.  Existing records are not affected.
By comparing the logical length of a record to the offset for a particular
field we can tell whether storage has been allocated for that field in the
record.

Domains are used to limit the range of values a field can take on in an
assignment, and to flag attribute comparisons which are likely to be erroneous
(e.g. order comparison of a pixel coordinate and a wavelength).  The domains
"bool", "char", "short", etc. are predefined.  The following information
must be stored for each user defined domain:


.ks
.nf
	name			may be same as attribute name
	datatype		bool, char, short, etc.
	physical vector length	0=variant, 1=scalar, N=vector
	default			default value, INDEF if not given
	minimum			mimimum value (ints and reals)
	maximum			maximum value (ints and reals)
	enumval			enumerated values (strings)
.fi
.ke


The following information is required to describe each attribute.
The attribute list is maintained separately from the hash table of attribute
names and can be used to regenerate the hash table of attribute names if
necessary.


.ks
.nf
	name			no embedded whitespace
	domain			index into domain table
	offset			offset in record
.fi
.ke


All strings will be stored in a fixed size string buffer in the header
area; it is the index of the string which is stored in the domain and
attribute lists.  This eliminates the need to place an upper limit on the
size of domain names and enumerated value lists and makes it possible
for a single attribute name string to be referenced in both the attribute
list and the attribute hash table.

.nh
Specifications