SOC3
Home Up SOC1 SOC2 SOC3 SOC4 SOC5

 

SCIENCE OF COMPUTING

FILE STRUCTURES

Files are an artifact that results from the shortcomings of primary memory.
If memory were large enough, permanent (non-volatile),and safe enough, there would be no need for secondary storage.
This page is about external storage structures.
 

 Sequential Files

Files are comprised of: Text or “Records.”
Records contain data, such as Business information:
Name, Address, Employee Id., SSN, etc.
Data goes in “fields.”
Fields (generally) have:
Names,
Sizes,
Data types.
Files may be organized in a number of ways:
Sequential (and Text )
Direct (Random Access) or Sequential
Direct only


Rudiments of Sequential Files


The sequential organization may be only conceptual in nature. The file may be stored in one form and be presented to the user on another (I.e. as a sequential system).
If the storage device is a tape system, the file is normally stored in sequential order because of the sequential nature of the tape itself.
If the device is a disk system, the records of the file might be disbursed over the disk to take advantage of available sectors.
In this case we could link the records with a pointer system where the pointers would represent locations on the disk rather than in main memory.
In reality, however, most operating systems eliminate the need for such pointers by maintaining a record of the sectors in which the file is stored and the order in which these sectors should be read when retrieving the file.
    example: FAT (file allocation table)
Regardless of the storage system used, the end of a sequential file usually is indicated by an EOF (end-of-file) mark.
In the linked structure, this mark might be similar to the NIL pointer in a linked list.
Another approach is to store a special record, called a sentinel, as the last record of the file to mark the end.
To avoid confusion, the fields in such a record must contain values that will never occur as data in the application.
Still other implementations record the length of the file at its beginning and then use this information to detect the end of the file.
DOS stores the size of the file in the name of the file.
Sequential files are limited to being viewed in a simple sequential order. The only way to retrieve records is to start at the beginning of the file and extract them in the order provided.
This sequential nature has some undesirable aspects.
Before processing the payroll (using the employee file), update the employee records with “time worked” during the current pay period.
Note the time worked by a particular employee,
Search the file to find the corresponding employee record,
Update the record.
Select the next time sheet & search for the corresponding record. The process is simplified if the order in which the time sheets are selected agrees with the order of the employee records in the file.
After updating one employee’s record, it is not necessary to return to the beginning of the file to initiate the search for the next record to be updated.
Instead, the search continues from the current file position.
For this reason, sequential files are normally stored in alphabetical or numerical order according to the contents of a selected field, known as the key field, such as last name, employee ID, or Social Security Number. If the time sheets are arranged according to this same key field, updating the payroll file reduces the process to merely updating the records one after the other as they appear in the sequential file. A lot of sorting takes place in relation to processing sequential files. To update a sequential file:
The new information (such as the collection of time sheets) is first recorded in the form of a sequential file known as a transaction file.
The transaction file is sorted to match the order of the file to be updated.
Finally, the records in the original file are updated in the order in which they appear.


Programming Concerns

High level languages tend to express file manipulation through procedures that are:
Defined as a part of the formal language itself or
Provided as language extensions in an adjoining library.
The parameters of these procedures identify:
The target file and
The area of main memory that is to receive or supply the data in the record being manipulated.
In Pascal:
read (MailList, MailRecord)
 and
write (MailList, MailRecord)
 are used to retrieve and deposit information in a record called MailRecord to/from to a sequential file identified as MailList.
Along with the file identifier within the parameter list, we find the name MailRecord (probably a “heterogeneous array”) that is used within the program to identify the block of data being transferred.
In C, we use: fscanf(MailList, '%s%s', FirstName, LastName);  and fprintf(MailList,'%s%s ', FirstName, LastName);
The first of these statements reads two strings of characters from the file identified as MailList, the first being assigned to the array named FirstName, the second to the array named LastName.
The second statement writes the strings identified as FirstName and LastName, each followed by a blank for separation purposes, in the file identified by MailList.
Here fscanf and fprintf are functions found in C's standard library.  These functions apply what is called formatted I/O. meaning that the I/O statement includes a description of the data's format (organization) within the file.
In many cases the program statements not only transfer bit patterns between main memory and mass storage, they also convert between coding systems. For example:
Reading is a variable of type integer that is assigned the value 35.
The bit pattern in main memory would be the two's complement representation of 35.
The file stores record values that are of type character.
The bit pattern on disk for this field in the record would be  F3F5.
The Pascal and C++ statements write(Temperatures, Reading); and Temperatures << Reading;           cause the ASCII representations of the characters 3 and 5 to be placed in the file
       Temperatures.
These statements include no explicit information as to the location in the file of the record being manipulated.
Because the file is sequential, there is no choice to be made.
The record being read is the one immediately following the current position in the file or the record being written is placed immediately following the current position.
Most high-level language systems provide features that assist with the problem of detecting the EOF mark.
In Pascal, for example, one can test for the EOF mark on the file identified as EmplData with the expression eof(EmplData).
If EOF mark has been reached, below evaluates to true, else false.
                          while (not eof(EmplData)) do
                             (read the next record and process the appropriate check);

Text Files

Restricting the size of the logical records in a sequential file to a single byte, we obtain another file type known as a text file.
Commonly used for storing documents consisting of text.
Each logical record consisting of a single symbol or control code (such as a carriage return, line feed, or font indicator).
When reading a text file, one receives the characters of the document in the sequential order in which they appear when the document is displayed in printed form.
An alternative to viewing a text file as a stream of individual bytes is to view the file as a sequence of lines separated by end-of-line markers.

Manipulating Text Files

For storage purposes:
A text file is broken into multiple-byte units that form physical records sized to be compatible with the mass storage system in use.
The manipulation of physical records is normally handled by the underlying software, usually the operating system.
When reading from or writing to a text file, the user has the image of a file consisting of a sequence of individual bytes or lines as desired.
When the first byte or line is requested from the file, the underlying software retrieves one or more entire physical records from the mass storage device and holds these records in a buffer in main memory. From this buffer, the request for a portion of the file as well as future requests are honored. As the underlying software passes the contents of the buffer to the user, additional physical records are retrieved into the buffer until the end of the file is reached. As a user writes to a text file:
The underlying software collects the individual bytes in a buffer until a complete physical record accumulates, or
until the user indicates that the end of the file has been reached.
It is only at these times that a physical record is transferred to mass storage.
In some cases the underlying software disguises the strict sequential nature of a text file.
A word processor, for example, might store documents as text files but not limit the processing of the file strictly to its sequential form.
Instead, it:
Reads several physical records into main memory and displays this text on the monitor as it would appear in the printed form of the document.
Allows the user to move back and forth within this block of text while making changes.
Retrieves more physical records from mass storage as the user moves farther down the text.
Displays the new text on the screen,
Re-divides and replaces previously accessed portions of the document on to the mass storage device.
In this way the word processor provides random access to that portion of the file that is currently held in main memory.


Programming Concerns

Text files are among the most common file structures supported by high-level programming languages. As with sequential files, this support takes the form of prewritten procedures that may be included as a part of the formal definition of the language or as offered in libraries as extensions to the language.
Pascal is an example of the first case.
It includes the routines read and write for accessing text files.
                    read(OldManuscript, Symbol);
 retrieves one byte from the text file identified as OldManuscript and assigns that byte to the variable Symbol.
Similarly, the statement:                     write(NewManuscript, Symbol);
 places the byte currently assigned to Symbol at the current location in the file NewManuscript.

Indexed Files

To use the file in an interactive query system, the records in the file will be requested in an arbitrary order.
Use a workstation to display employee records and allow I/O for time in service, available skills, past promotions, and so on.
Storing the employee data as a sequential file would result in  lengthy delays between a request for a record and the displaying of that record.
Quick access to the desired data is required.
One approach is based on the same idea used in a textbook in which an index allows a topic to be located more directly than through a sequential search of the entire book.
This index concept is easily applied to file storage, resulting in an indexed file.

Index Fundamentals

An index for a file consists of:
A listing of the key field values occurring in the file.
The location in mass storage of the corresponding record.
It may be necessary to access a file by more than one key field. Since a file can be kept in only one physical order at a time, it is frequently necessary to use a multiple index system.
One approach is an inverted file,
One key field designated as the primary key
The other keys are designated as secondary keys.
Multiple indexed files introduce their own problems:
As records are inserted and deleted, all indexes must be updated.
Indexed files are used in the context of file directory systems maintained by operating systems. (i.e., FAT)
These directory systems are essentially indexes in which entries represent files rather than records.
In a sense, the operating system considers the entire collection of files stored on a disk as one large file.
This directory (index) is stored on the disk and used by the operating system to find files as they are requested.
Users of personal computers can usually hear the directory (index) being accessed when they request a listing of the files contained on a floppy disk.
Moreover, if the directory is not yet in main memory, they may detect the two-step access process (read the directory from the disk, then read the desired data) when requesting access to a particular file.
Since the index must be moved to main memory to be searched, it must remain small enough to fit within a reasonable memory area.
This requirement can produce problems if the number of records in the file becomes large.
One technique used to overcome such a growth problem is to use the index to find an approximate, rather than the precise, location of the desired record.
Organizing the file in a sorted sequential order
Chop it into short, multi-record segments.
Each segment (consisting of several records) is then represented in the index by a single entry, which is normally the last key field value in the segment.
The result is a partial index containing only a few of the key field values appearing in the file.

INDEX ORGANIZATION

As an example of a partial-index structure:
The key field entry in each record in the index file is greater than or equal to the largest record value in the segment to which it points.
The retrieval of a record from this system consists of finding the first entry in the index that is equal to or greater than the desired entry and then searching the corresponding sequential segment for the target record. Second approach:
Divide the index into pieces.
Use an index file to locate the index.
The index system becomes a tree.
If the sequential segments of the index entries become too large, construct a separate index for each of them.
The high level index would no longer direct the search to a file segment, but rather to the correct segment index, which would then be used to find the location of the record in question.
This method is typically used in the UNIX file system.
Each directory (index) is hereogeneous, in that some entries refer to individual files and others refer to subdirectories (sub-indexes).

Programming Concerns

High-level programming languages do not generally contain commands specifically for manipulating files by means of an index.
Database systems usually provide abstract tools that relieve programmers from the task of maintaining their own indexed file systems.
However, many programming languages offer the building blocks required to construct an indexed file system.
In a C program, the function fsetpos can be used to establish a current position within a file:
                   fsetpos(Personnel, &Position);
         requests the current position in the file Personnel be placed in the variable            Position.
If we followed this statement with
                    fscanf(Personnel, '%s', Name);
          we would retrieve the name of that location in the file. By maintaining an index consisting of the various key field values and their associated positions in the file, a C programmer can construct an indexed file. To obtain a particular record, the program can first find the appropriate position in the index and then use fsetpos and fscanf to obtain the record. To assist in obtaining the locations to be entered in the index, the standard C library contains a function, fgetpos, that identifies the current position in a file.  The statement:                                                   fgetpos(Personnel, &Position);
         assigns the current position in the file (the location at which the next read or write operation
         will be performed) to the variable Position.
Thus fgetpos provides the information required to build the index as the file is being created.

Hashed Files

Thus far, two general approaches to file organization have been shown.
Sequential access, is represented by:
Sequential files
Text files.
Insists that the file be processed according to a particular serial order.
Direct access (sometimes called random access) allows individual records to be retrieved without interrogating other records in the file.
Indexed files provide an example of this paradigm.
By properly traversing the index, one can obtain sequential access to an indexed file.
An index system is a way of obtaining both sequential and direct access to a file (at the expense of index maintenance).
Hashed files represent a file storage system that provides direct access only.
They avoid the overhead encountered with indexed maintenance.
The idea behind a hashed file is to compute the location of a record in mass storage by applying an algorithm (known as the hash algorithm) to the value of the key field in question.
The result is a system that, given the desired key field, can determine the location of a record quickly without the use of auxiliary tables that must be maintained by indexed files.


A Particular Hashing Technique


Applying the hashing concept to our employee file:
Divide the mass storage area allotted to the file into several sections called buckets.
The number of buckets used is a design decision that will be presented later.  For this example, 40 buckets.
Define the key field in each record.
Assume that records in the file will always be requested in terms of the employee identification number,
Convert the key field value into a numeric value.
The employee identification number may contain non-numeric values I.e. the form 25X3Z or J2-X35.
Regardless, any information stored in the machine is represented in terms of a string of 0’s and 1’s and can always be interpreted as a binary number whether or not that was the original intention when the information was coded.
Using the numeric interpretation of the key field:
Divide the key field value by the number of buckets. This gives:
An integer value, called the quotient (throw it away, not important).
An integer value called the remainder.
The remainder is always in the range from 0 to 39.
Use the remainder as the “bucket address” in which to place the record. The record is related to exactly one of the 40 buckets.
Example: When divided by 40, the key field values of 14, 54, and 94 each produce a remainder of 14. Thus these records are stored in bucket #14.
To store a record using this system:
Consider each record individually,
Convert its key field value to an integer,
Apply the hash algorithm to identify a bucket in mass storage,
Divide the key field value by the number of buckets,
Use the remainder to determine the bucket in which to store the record.
Store the record in that bucket.
To retrieve a record with a certain key field value:
Transform this value to a bucket number as before,
Retrieve the records in that bucket,
Search the retrieved records for the one in question.


Distribution Problems


Thus far we have overlooked a few problems inherent in a hashed file. Once we have chosen the hash algorithm, we have no more control over the distribution of records in mass storage.
Using the divide-by-40 algorithm previously presented, if the numeric interpretation of the key field values tends to be multiples of 40, a disproportionate number of the records are placed in the bucket assigned to the remainder zero.
Unless the buckets are extremely large, a system must be provided for handling buckets that overflow.
It is advantageous to select a hash algorithm that evenly distributes the records among the buckets provided in mass storage.
The selection process is complicated by the fact that we normally do not know in advance exactly what the key field values will be. Because of employee turnover, the employee identification numbers used today will not be those used tomorrow.
The choice of a hash algorithm must be based on a combination of one's artistic abilities, statistical analysis, and rules of thumb.
If a dividend and a divisor both have a common factor, (e.g. 40) the factor is present in the remainder.
In turn, the remainders produced by the division process tend to be multiples of this common factor, while other values are ignored.
This was proposed in the possibility of the key fields being multiples of 40, which consistently produced remainders of zero.
A similar problem exists if the key fields are multiples of 5.
Since 40 is a multiple of 5, the factor of 5 appears in the remainder of our division process, and the records in the file cluster in those buckets associated with the remainders 0, 5, 10, 15, 20, 25, 30, and 35.
Similar situations occur in the case of key field values that are multiples of 2, 4, 8, 10, and 20, because they are all also factors of 40.
This observation suggests a partial solution:
Clustering due to common factors can be minimized by selecting the number of buckets to have as few factors as possible. Therefore, a prime number should be used as the number of buckets.
The chance of clustering in the employee file example can be greatly reduced by dividing mass storage into 41 buckets.
Another solution to the clustering problem is suggested by selecting a hash algorithm based on principles other than division.
The mid-square method:
Multiply the key field value by itself
Select the middle digits from the product to represent the bucket number.
The extraction method:
Select the digits appearing in certain positions within the key field and construct the bucket number by combining these selected digits using some predetermined process.
In any case, the performance of several hash algorithms is normally tested on sample records before settling on a final choice. Unfortunately, regardless of the hash algorithm we ultimately use, clustering of records will most likely occur Is a file is modified over a period of time. We can gain an understanding of how quickly this might occur by considering  what happens as we initially insert records into the modified 41-bucket employee file:
Assume:
We have found a hash algorithm that arbitrarily distributes records among the buckets,
The file is empty,
The records are to be inserted one at a time.
When we insert the first record, that record must go into an empty bucket.
When the first record is inserted, it must go into an empty bucket. When the next record is inserted, only 40 of the 41 buckets are empty.
The probability that the second record will be placed in an empty bucket is only 40/41.
If the second record is placed in an empty bucket, the third finds only 39 empty buckets.
The probability of its being placed in one of them is 39/41.
Continuing this process, we find that if the first seven records are placed in empty buckets, the eighth record then has a 34/41 probability of being placed in one of the remaining empty buckets. The probability of the first eight records being placed in empty buckets may be computed by:                                 (41/41)(40/ 41)(39/41)(38 /41) ... (34/41) =.482
 The probability is that the result is less than one-half. Thus clustering probably begins with only eight records stored among 41 buckets.


Handling Section Overflow

The high probability of collisions indicates that some plan must be established for handling the problems associated with collisions.
The fact that certain buckets might fill up and that records that should go into these buckets must then be placed elsewhere Must be anticipated
A typical technique for handling this case is to reserve an additional area of mass storage to hold overflow records.
Then, if a bucket fills up, records that normally are added to it are placed in the overflow area and linked to the appropriate bucket through an organization analogous to a linked list.
When trying to retrieve a record from such a file:
Extract the proper bucket using the hash algorithm.
If the desired record was not found there, one would search the overflow records linked to that bucket.
If a lot of overflowing takes place, the efficiency of searching the file can drop significantly.
The design of a hashed file therefore requires a careful analysis involving the choice of the hash algorithm, the number and size of the buckets in mass storage, and the size and structure of the overflow area.

Programming Concerns

Few, if any, high-level procedural programming languages in use today offer direct implementations of hashed files.
This is due to the application-dependent issues (hash algorithm, number of buckets, size of buckets) involved in the design of these files as well as the fact that, as in the case of indexed files, modern database management systems often relieve a programmer from the chore of implementing a hashed file.
When a customized hashed file is required, the same building blocks described in the previous section for implementing indexed files can be used.
In the C programming language, for example, one needs merely to maintain a record of the positions at which the various buckets are stored and then use the fsetpos function to locate the bucket indicated by the hash function.
Although we have introduced hashing techniques in their traditional file context, hashing is not restricted to systems involving mass storage.
Today hashing is often used to distribute data within areas of main memory.
For example, a large one-dimensional array can be established in which each component has the capacity to hold several data entries.
These components play the role of buckets, and a hash algorithm is used to identify the components in which each data entry belongs.
This expansion in the application of hashing techniques from mass storage systems into main memory systems is exemplary of the continuing evolution in data processing that is a problem for those who attempt to classify topics in the field.  Indeed, hashing, which traditionally has been a topic of file structures, may soon be more accurately classified within the subject of data structures.

The Role of the Operating System

We have seen that associated with each file structure is a variety of details relating to the retrieval or insertion of records.
The conclusion of each structure indicated that such details are often of no concern when accessing the file from within a high-level  programming language environment.
These environments tend to provide prewritten routines for manipulating files.  These routines, in turn, communicate with the operating system to perform their assigned tasks.
Thus much of the obligation for file manipulation ultimately falls on the operating system.
To fulfill its responsibilities, the operating system must have access to information about the file being manipulated. For example, it must know:
The structure of the file,
Which item within a record is the key field (if applicable),
Whether the file is to be saved after the program using it is finished.
Furthermore, some items of information must be remembered by the operating system between the retrieval of one record and the next. Depending on the type of file being manipulated, this information may include:
The current position in the file,
Which physical record is currently in a buffer in main memory,
Whether any abnormal conditions occurred during the previous access.
In an indexed file, was the requested record actually found?
To manage this information, the operating system maintains a table, often called a file descriptor or file control block, for each file being processed. All the information relating to the processing of a single file is kept there in an organized manner and made available to the various routines in the operating system as is needed.
Thus, if a program involves the processing of three files, the operating system must construct three file descriptors to assist in the file management.
In a high-level programming language, the construction of a file descriptor normally is initiated by a prewritten routine named open. A typical statement in FORTRAN has a form similar to OPEN(UNIT = 10, FILE 'EmplFile', STATUS = OLD, ACCESS = SEQUENTIAL)

 which requests the operating system to construct a file descriptor for the file named Emp1FiIe.  The parameters indicate:

The file is referred to later in the program as unit number 10
                    (UNIT = 10)
The name of the file is EmplFile
                    (File = 'EmplFile'),
The operating system should find this file already in mass storage
                    (STATUS = OLD),
The structure of the file is sequential
                   (ACCESS = SEQUENTIAL).
In Turbo Pascal (a popular dialect of Pascal provided by Borland International), file descriptors can be created by means of the predefined procedures called assign and reset. For example, the statements
               assign(DocFile,' document.txt');
                reset(DocFile);
                 cause the operating system to construct a file descriptor for the file named document.txt
                and tell the translator that this file is referred to throughout the program by the name
                DocFile.
In C the equivalent process would be requested by the statement
                       DocFile = fopen('document.txt', 'r');
           where the r stands for read, meaning that the file is to be read from rather than written to.
Note: the examples have shown that a file can be referenced later in a program by an identifier other than the file's proper name;
Future references to the file EmplFile in the FORTRAN example are in terms of the unit number 10,
In the Pascal and C examples the document .txt file is referred to as DocFile.
The distinction between a file's external name and the term used to reference it within a program reflect the distinction between the syntax rules of operating systems and programming languages.
A name used to identify a file in the context of an operating system may not be a syntactically valid identifier in the programming language being used.
Thus a means of name conversion is required.
Once established, this distinction between internal and external file identification also provides flexibility because a procedure designed to manipulate a file by means of an internal identifier can be used as a generic routine to process different files.
All one must do is open the desired file using the proper internal identifier and then apply the procedure.
In an object-oriented programming language, files are treated as objects.
Opening a file is done in the context of establishing the object that will play the role of the file.
In C++ the file named document.txt can be opened by the statement:                         fstream DocFile('document.txt', ios::out);
         which creates an object called DocFile having the characteristics of an fstream (a
          predefined class for manipulating text files.)
More precisely, this statement not only opens the file but also bundles the file with the routines (found in fstream), creating an object known as DocFile with the ability to respond to messages.
The notation ios::outsends the new object the message that the file is to be used as an output file.
Later in the program, data can be written to the file by sending the appropriate message to the object DocFile.  For example,
                        DocFile.put('K');
           tells the object DocFile to put the character K in its file.
Having been directed to construct a file descriptor, the operating system must also be told when it is no longer needed.
After a file has been processed, many programming languages require the use of a routine named close.
Basically this routine informs the operating system that the memory space used for the file descriptor can be used for something else.
However, in some settings, the statement initiates more than this simple release of memory space.
For instance, in the case of a text file that has been created by the program, the close routine causes the operating system to transfer the last physical record to mass storage.
In any case, the syntax of the close statement is rarely anything more than a simple instruction, such as
                 CLOSE (UNIT = 10)
 which, in FORTRAN, means the file identified as file number 10 can no longer be used in the program
Or if it is used again, it must be reopened);
The equivalent statement in Pascal is
                 close(DocFile);
 and in C it is
                 fclose(DocFile); Closing a file in an object-oriented environment is done by sending the appropriate object a message instructing the object to close its file. For example, in C++ the statement
                 DocFile.close;
 sends the object called DocFile the message to close its file.
 
 
The next page (SOC4) deals with Database Structures.

SOC1 SOC2 SOC3 SOC4 SOC5


Back Home Up Next