The four basic types of variables are regular numeric variables, numeric variables with special missing values, multiple-response questions, and single-response questions. Each variable type was associated with a different template for code creation, a specific form of variable names, and certain projected difficulties in recoding. For instance, a multiple-response variable was read in with a simple list of single-punch variables, was labeled simply with letter suffixes, and was not recoded. In contrast, an aggregated single-response variable was read in with a list of single-punch intermediate variables, was labeled with intermediate number suffixes, and was recoded into a final variable for the entire question. Retrieving the variable type directly from the documentation provided a guide to the particular piece of program code that was necessary for inputting and recoding the variable.
Regular numeric variables. Regular numeric variables range from 0 to 9. The values are computed by filling in each digit with the numbers in the appropriate columns. For example, Roper Report 9309, Question 68Y, asked "Do you think government lotteries produce an unwholesome gambling spirit in this country? Yes=1, No=2"
Numeric variables with special missing values. These variables are identical to regular numeric variables, except that there can be one or two special missing values (such as "don't know"), recorded at punch 11 and/or punch 12. For example, Roper Report 9309, Question 2X, asked, "Do you feel things in this country are generally going in the right direction today, or do you feel that things have pretty seriously gotten off on the wrong track? Right track=1, Wrong track=2, Don't know=Y"
Multiple-response questions. Multiple-response questions generally ask respondents to choose more than one option from a list of items, as in Roper Report 9309, Question 8W illustrated in table 1. If an item was checked, it was coded as 1 in the recoded data set; if it was not checked, it was coded as 0. Therefore, for any such question, there would be multiple binary variables corresponding to all of the possible responses (including special missing values), so that each possible response became a variable in a final recoded data set.
Single-response questions. Single-response questions allow one, and only one, response per item. They frequently include a special missing value at punch number 12, with several answer options between punch numbers 1 and 9. These questions require appropriate recoding for the final variable. For example, Roper 9309, Question 7Y, asked, "And thinking about crime in the United States, what one type of crime do you feel presents the biggest threat for you and your family today?" An example of multiple-column storage of a single-response question appears in table 2.
Table 2. Example of multiple-column storage of single-response question
6. Read in Data
The first step in the SAS programming process was reading in the data using one of the three SAS column binary informats: PUNCH.d, CBw.d, and ROWw.d. The informat PUNCH.d reads a specific bit (with a value of either 0 or 1) in the original data set. The d value indicates which row in a column of data is to be read. The informat CBw.d, on the other hand, looks to see which one of the 10 numeric bits (0 through 9) has been punched, and returns that number as a value for the column. ROWw.d begins reading at a user-specified bit, looks to see which one of a user-specified number of bits (after the first) has been punched, and returns a number equal to the relative location of the punched bit. For instance, a ROW5.5 informat would start reading at PUNCH.5 and continue for four more bits through PUNCH.9; if bit 8 was punched, then the ROW5.5 informat would return a 4. The PUNCH.d informat was the most appropriate for this project. For clarification of its use, refer to the sample SAS program in Appendix 2. Depending on the final form of the question, the pattern of punches in a column usually had to be logically recoded later in the program from an intermediate variable to a final variable that matched the response options in the original documentation.
The Statistical Package for the Social Sciences (SPSS) is also able to read data, including multiple-response question responses, from a column binary data file by using the keywords MODE=MULTI-PUNCH on the FILE HANDLE command. An example of an SPSS program that reads a single variable from Roper Report 9309 data with SPSS is:
In this example, a variable called FEDDEP is read in column 24. It has a possible Y coded as "don't know", requiring that SPSS read this variable as a character string (see also SPSS 1988, 84-86).
7. Identify Migration Formats
We selected the following formats to test the migration process:
-11-expertise, the transportability of output formats, storage requirements (size of output data sets), and long-term archival implications. While both SAS and SPSS software are able to read column binary data, the staff members working on the project had more experience with SAS, so we chose to work with that statistical package. With the exception of spread ASCII data files, each format we selected for testing contained completely recoded and renamed variables. Storage requirements were compared on the different platforms for many of the different types of SAS data files, as shown in Appendix 3.
8. Recode Data Files
Create intermediate variables and code missing values. After reading in the data with the SAS informat statements, the second step in the SAS programming process was to produce a set of if-then statements for recoding the individual punch data (the intermediate variables) into final variables. Each intermediate variable needed to have a column location, row location, variable name, and SAS informat instructions specified in the input statement, as seen in the example shown in table 3. The only things that changed while inputting this variable Q14 were the suffix and punch location for each intermediate variable. This redundancy increased when there were more intermediate variables, and especially with aggregated single-response variables. The logical statements used for recoding the intermediate variables also contained many repetitions of the same if-then commands.
Table 3. Example of variable location, name, and SAS informat statement
Since the creation of these translation programs in SAS involved a large amount of repetitive typing, we created templates for both the input and recoding statements. Templates were created for single-response
-12-variables (simple and aggregated), multiple-response variables, and numeric variables with special missing codes. Each template included generic characters for column locations, variable names, and SAS informats, as well as pre-formed logical recoding statements. Furthermore, the templates contained repeated lines of code for different numbers of punches or aggregate variables so that it would not be necessary to enter the redundant information for the different variables.
These templates considerably sped up translation code creation. Once the type of the current variable had been determined, the appropriate input and recode statements were copied from the template file, pasted into the program code, and then the key characters (such as column location and variable name) were replaced using a find/replace function in the text editor. For example, to create the code for variable Q14, a piece of the template would have been copied from the template file and pasted into the program, as illustrated in table 4. The aggregated single-response variable template included
-13-input and recode statements for sub-questions from A through Z; only A and B are shown in table 4.
Table 4. Examples of SAS input statements and recoding statements
Then the string Q000 would be replaced with Q14, and COL1 and COL2 would be replaced with 30 and 31, respectively, resulting in the final code for inputting and recoding the variables Q14A and Q14B.
Create macro programs to recode data files. Given that we could define fairly well the different types of variables in a data set, and that we could create input and recode template statements for these different variable types, it was tempting to think that we would be able to write programs to automate the whole procedure. That is, we might imagine a program that could perform all of the operations described above with minimal effort from the human operators. Although such automation may be possible in the future, the irregularity of the data files presented a major obstacle. Not only would an automatic translation program have to deal with some of the complexities of, for example, split samples with different variable types in the same column, but it would also have to handle the different types of errors that occur in the original data files.
For example, one fairly straightforward solution (though quite a lot of work) would have been to create a program that received some sort of variable list from a human operator, set up logical conditions to create code around split samples, and then write out a SAS program to translate and recode the data file. More sophisticated error-checking would have been necessary, however, to handle simple events such as a variable being incorrectly marked as a single-response type in the documentation, but actually having multiple responses coded in the data. Because the recoded data files needed to be of archival quality, this error-checking would have had to be quite rigorous. Furthermore, judgment calls would have sometimes arisen when dealing with irregularities (that is, should the variable be recoded differently or should the irregularity be ignored), so that leaving the decision solely to a computer program was not advisable. In short, creating an automatic translation program to recode the data would have involved several compromises that we would not recommend.
Debug programs and check data. Several types of errors occurred in the SAS programming: typographic errors, invalid informats, and unexpected changes in variable types (where the questionnaire did not match the data). The key indicators for these problems were INVALID DATA warnings that appeared in the SAS log.
Once the programs ran without INVALID DATA messages, the accuracy of the translation needed to be checked. Complete frequencies for all of the variables in the final data set had to be compared to the frequencies in the xray.
For example, if variable Q9 (deck 1, column 30) had the following frequencies (in this case, PUNCH.12 was recoded to the missing
value of 9):
then the xray for deck 1, column 30 should look like this:
It was necessary to check all of the variables in the final data set, because one badly coded variable in the conversion job could compromise the archival integrity of the entire data set. Unfortunately, a random subset of variables may miss the errors. If there was a mismatch between the xray and the frequencies, the difference needed to be tracked down and the SAS job edited. This frequency check was also the last chance to resolve ambiguities between the type of variable listed in the documentation and that shown in the xray. One final point to note during this check was the maximum number of possible digits in a recoded variable. If the data set was to be output in ASCII format, then the number of digits in the special missing value should be equal to the number of digits in the regular values. That is, if a variable had values 1 through 3, then the special missing value should be 9; if it had values 1 through 12, then the special missing should be 99; and so on.
Save recoded data. The final step in the reading and recoding process was to write to disk the resultant SAS data sets. For recoded SAS to SAS export files, we saved the files as SAS system files and export files with all intermediate and recoded variables, and as SAS system files and export files with just the recoded variables (see Appendix 3).
For recoded SAS to ASCII, once the SAS data set had been completely debugged and double-checked, PUT statements could be written to create an ASCII version of the recoded
-15-data (see Appendix 2). The simplest way to write these PUT statements was to copy the INPUT statements from the original job (or from the recoding statements), strip off the column binary informat information, and manually type in the new column locations for each variable. These PUT statements had two advantages: unlike automatic translation programs, they created ASCII data files with minimum wasted space and the PUT statement itself could be distributed as a data dictionary.
For the first data set translated, an ASCII version of the recoded SAS data set was created using an automatic translation program called DBMSCOPY, version 5.10. However, this ASCII data set was not satisfactory. Although the translation program automatically created a data dictionary (based on the variable names in the SAS data set), the ASCII data set was not compact. The translation program seemed to be incapable of concatenating one variable's width of columns against the previous variable's columns, so that there would be no wasted space within the flat data set. Instead, the program inserted multiple spaces between each variable, which allowed a differing number of characters within each variable, but also inserted approximately 10 ASCII space characters between each variable. Because of the enormously increased storage requirements this insertion causes, this approach to creating ASCII data files was abandoned and program code for creating compact ASCII data files was written in SAS.
9. Create Spread ASCII Data Files
The original column binary data files did not necessarily have to be recoded during the translation process. Another possible method of translation was simply to convert, or spread, the column binary into ASCII data. Spread ASCII data files keep the binary structure of the column binary data, but encode each 0 or 1 as an ASCII character. For example, the following column in the original data (each 0 or 1 is one bit)
would become 000010000000 in the spread version. While this example uses a single-response variable, multiple-response questions may have more than one bit with a value of 1 in the column. To create a spread version of the data, each bit (or punch) in the column binary file could be read (via SAS in our case) as an individual variable; each variable was then written to a new ASCII data set. Because the punches in the column binary data were rectangular, and because the variables being written out did not have to be meaningfully recoded, the SAS code itself was largely a simple iteration, over columns and rows, of INPUT and PUT statements.
The iterative nature of the SAS program suggested that a macro program could be written to automate the creation of SAS code. A simple C program was written based on this iteration over columns
-16-and rows. After prompting the user for unique information the name of the SAS program file to be created, the name of the data set to be created, the name and path of the column binary file, and the number of decks and record length (LRECL) of the original data the C program simply looped over columns and rows. The program passed through the loop twice. In the first pass, the program created an INPUT statement that would read in each punch (looping over columns and rows) as a new variable. The second pass created a PUT statement in which each new variable would be written to an ASCII file. The C program was not itself reading or writing the data files; instead, it created many repetitive lines of SAS code to read and write the data files (see Appendix 4).
Each card of column binary data was translated to a line of ASCII data. Since each card contained 960 data points (80 columns by 12 rows), each line of ASCII data contained 960 characters. The spread ASCII data set expanded about 600 percent from the original size. It took about 15 minutes to create the spread ASCII data on the IBM PC network, including all steps from running the C program to writing out the ASCII data.
We investigated the formats currently distributed by the Roper Center and the Institute for Research in the Social Sciences at the University of North Carolina at Chapel Hill (IRSS) to determine their utility over time, and to evaluate them in light of our findings. Both distributors were producing what we call the hybrid spread ASCII data files from the original column binary files. These files are produced by horizontally concatenating the data into a new format. The first 80 columns of the data set contained the 80 columns of the original file; however, any binary encoding of variables with multiple-punch coding was converted to ASCII. The entire file was then converted to spread ASCII and the resultant data were appended to the end of the record. Users could access the non-binary data in the original column locations; if they needed to access data that were originally punched in the binary mode, they could read those data from the spread ASCII portion of the record. Data dictionaries were produced that mapped the location of variables from both the original 80 columns and the spread ASCII portion of the record.
We obtained sample programs from IRSS to help us evaluate these hybrid spread file formats. We also acquired sample data files from IRSS and the Roper Center, with their enhanced data dictionaries, to evaluate their utility. We then adapted a SAS program from IRSS to produce the data map showing the new column locations of the data items. This data map allowed for quick translation from the column-by-row information necessary to read column binary data, to the column-only information necessary to read the spread ASCII data. A note at the top of the data map showed the order of punches in the spread ASCII data. For example, for a response located at column 56, row 1 in the original data set, the data map showed the same response's location as column 661 in the ASCII data set; a response at column 56, row 5 would be located at column 665; and a response at column 56, row Y(12) would be located at column 672 (see Appendix 5).
The Documentation Track
Several options were available for digitizing the documentation, including image scanning, OCR, combinations of image and text (PDF format), and text encoding. The initial part of the project included identifying formats to test and set up the scanning workstation. Complete documentation for each file was scanned, including the questionnaires, xrays, frequency counts, data maps, and other metadata whenever available.
Software and Equipment
The scanner used was a Hewlett-Packard ScanJet 4c with a document feeder and SCSI connection to an IBM 750-P90 PC. All files were stored on an external SCSI Iomega Jaz drive. For a major scanning project, a fast machine with 32 MB of memory was required. Also, a PCI bus SCSI card speeded up transfer rates from the scanner to the computer. An automatic document feeder reduced the labor by automating the page turning. Such labor-saving devices are cost-effective because scanning operations tend to absorb a lot of resources and to constrain work on other major tasks while the scanning is in progress.
Scanners differ in speed, and, for a given scanner, speed varies with the desired scanning resolution. Speed could be an important factor in making a purchasing decision for a major project, as it can have a considerable impact on labor costs. For the HP 4c, the time it takes to scan a page varied with the desired resolution as indicated below.
TextBridge Pro Optical Character Recognition
We first scanned the documentation for one Roper Report using the OCR software TextBridge Pro. We reviewed alternative OCR software products and, finding no significant benefits to using one over another, chose a package with which the staff had experience. Initial evaluation of the OCR output showed that there were significant numbers of errors in the resultant ASCII text. We tested various resolutions and documented time taken for setup and for scanning, optimum settings, range of file sizes, quality of proofing summaries, and procedures to follow.
The questionnaires we scanned with the TextBridge Pro software had an unacceptable rate of character recognition, including incorrect location information necessary for manipulating the accompanying data files. Handwritten notes were completely lost and the editing costs of reviewing the output and changing all errors
-18-would have been prohibitive. This format did not present us with an adequate archival solution to preserving the textual material, so no further documentation was scanned using this process. TextBridge Pro does not work well with poor originals, determining optimum scanning settings was very time-consuming (and sometimes impossible), compression formats did not give good results, and raw ASCII format required time-consuming reformatting (see Appendix 6 for a photocopy of a page from Roper Report 9309, Question 10, and for the TextBridge Pro sample output of the same question).
Document condition. When used with printed clean originals, OCR is very accurate even when the font size is small, and can replicate the formatting of the original document. For example, TextBridge Pro is capable of producing low-resolution images and exporting both images and text to a word processing document that retains the format of the original printed page. However, there are far more problems with this process when the quality of the original is anything less than perfect, as in our case. TextBridge has a particular problem with italics and underlining, even with good quality originals.
TextBridge Pro also does not work well with columns and has some difficulty recognizing tables and columns in originals, let alone poor photocopies. This problem gets much worse when some of the entries in some of the table cells are blank because the columns get shifted; cleaning up the resulting output files becomes a major undertaking.
Scanner settings. TextBridge Pro allows considerable control over the way in which the document is scanned in as an image. For some settings, this task can be delegated to TextBridge Pro by setting options to "automatic"; in other words, TextBridge Pro tries to figure out what works as it scans the page. But TextBridge Pro does not always make these determinations successfully. Nor are photocopies ideal for scanning, particularly if not all of the characters are completely clear and if not all of the pages are of the same lightness/darkness. Successful recognition requires changing the settings periodically to account for the varying quality of the photocopies. Two outcomes are possible if the settings are not optimal: in some cases, the program is unsuccessful in recognizing text that is legible in the original; in others, it gives the frustratingly cryptic error, "page too complex for selected mode."
We finally became convinced that there was no simple system for setting these options. In the worst case, it was a frustrating process of trial and error. Too dark a setting meant that the program tried to decipher each small dot on the page as though it were part of a character. As a result, it was not possible to use 600x600 resolution in cases where originals were speckled. Likewise, selecting the "text only" option for the original page format forced the program to try to convert everything on the page into text, including imperfections in the image or dark binding patches. On the other hand, sometimes the auto brightness setting scans were so light that no text was recognized
-19-on the page. In some cases, we spent hours trying to correct the settings manually.
Proofing text. TextBridge Pro provides an optional feature that facilitates the correction of recognition errors. When text is scanned, it is possible to save "proofing text." This information is used by special modules that are installed into WordPerfect or Word and are implemented as macros. If the relevant module is installed and the document opened in the word processor, all words are highlighted that TextBridge Pro was unsure it recognized. The color of the highlighted word indicates the confidence TextBridge Pro has in its accuracy.
TextBridge Pro can be taught to recognize particular fonts with a higher degree of accuracy through a period of training during which it asks the user to help it recognize ambiguous characters. We did not use this feature extensively. Our limited experience showed it does increase the likelihood of successful recognition if originals were poor but not truly awful. The effort is only worthwhile if the same types of fonts will be encountered often, which was not the case with the Roper Reports documentation.
Final format. The available formats into which the output text can be saved depends on the options selected. There are a very large number of possible word processor, text, and spreadsheet formats. However, if "save proofing text" is selected, then the file can be saved in only Word or WordPerfect format. Similarly, if "reconstitute text" is selected, only file formats that support fairly complex formatting are available. When formatting text with the "reconstitute text" option, TextBridge Pro will use some of the new "style" features of WordPerfect or Word in the new document. This can make subsequent editing cumbersome. Though the styles themselves can be edited, an alternative is to save in a file format that supports the text features that are being "reconstituted" but does not support "styles." In this way, the formatting will appear as regular tabs, font changes, and so forth, that can be directly edited. The editing of ASCII text to recreate the format of the original is a major undertaking and the time required to reformat each document is extensive. We found that getting pagination to match the original is particularly difficult.
The scanned images produced by TextBridge Pro can be stored in CCITT-3, a compression standard, for later processing, but the results from the subsequent processing of these images were not as good as those obtained from processing the images directly from the scanner. We decided that using these compression formats would not give usable results.
PDF Files from Adobe Capture
The next step in the documentation portion of the project was to produce documents in the portable document format (PDF) used by Adobe Acrobat, a widely accepted de facto standard for encoding electronic documents. The viewing software provided by Adobe allows for reading and searching the text, viewing the structure of a document, and printing high-quality hard copy. PDF documents provided
-20-clear, accurate reproductions of the questionnaires. The Adobe Capture software produced an interim ASCII text file that could be edited to improve text searching. An example of a viewing screen may be found at the end of Appendix 6.
Basic structure of Acrobat files. Adobe Acrobat files (distinguished by the PDF suffix on the file names) can contain both text and image information from the original document. There are different types of PDF files containing different kinds of information: normal PDF, image-only PDF, and image+text PDF.
Normal PDF files, by default, display the text information derived from the OCR process. Where the text information is unknown (when there is a nontext picture on the page or there were difficulties in the OCR process), a normal PDF file will insert the original image into that space on the display page. Image-only PDF files are, in effect, paginated pictures of the original pages. Like tagged image file format (TIFF) files, the text in these images is not searchable. Like image-only files, image+text PDF files show the image of the original pages, but also contain searchable text.
The image+text files were chosen as the most appropriate for this project. The user would see a faithful reproduction of the original documentation (complete with handwritten notes) with the PDF browser, but could also search for specific text within the document. If text in the search function looked suspicious, a user could view the original image. In comparison, files produced by OCR programs contain only the text information, with no way to double-check the text against the image of the original.
Adobe Acrobat Capture procedures. When scanning the document with the Capture software, a set of pages was scanned in sequence. Each page was stored as a separate TIFF file, with the filenames numbered sequentially. For example, a 40-page document produced 40 TIFF files, named page01, page02, through page40. These original image files in TIFF format were stored separately from the final reformatted PDF documents, providing a set of image files for digital storage.
When translating the images into editable text, the TIFF files were concatenated into a single document during the OCR scanning process. Acrobat Capture analyzed the page layout and grouped text into regions. It then identified characters and grouped them into words. The words were looked up in the Acrobat dictionary (which can be customized) and spelling suspects noted. Fonts were analyzed and font suspects identified. The interim text layer of the final PDF file contains no image data and can be edited with the Capture Reviewer so that the text matches the original document as closely as possible. During the OCR process, each word was assigned a confidence rating, representing the software's estimate of its OCR accuracy.
For the searchable text of the final PDF file to be as accurate as possible, many OCR errors were corrected by editing the interim files. Fortunately, the program used to edit interim files, Acrobat Capture Reviewer, would highlight words whose confidence levels fell below a certain threshold, or that were not included in a
-21-dictionary file. The majority of unrecognized words could be easily spotted and corrected to match the original document. Although the Reviewer software allows one to change fonts and other formatting options, the only editing necessary for the project was in the content of the words used for searching purposes (words used to locate terms in question text and variable coding). Since the user would see only the image reproduction of the original, the underlying ASCII text need to be reformatted as a visual reproduction of the original. Once the document was edited, it was then saved in the image+text PDF format.
Time and storage requirements of PDF files. The total time required to process a 39-page document was approximately four hours, from scanning to saving the final PDF. The scanning itself took 30 minutes; the OCR process took 20 minutes; and editing the resulting file took 3 hours 15 minutes. The storage space required for this document is shown in table 5.
Table 5. Example of storage space required for Acrobat PDF files
Documentation for the other nine Roper Report data files was also scanned and edited. Some data files with split samples had dual documentation. The time taken for the scanning process using a document feeder ranged from 5 to 30 minutes. The OCR software took between 15 and 35 minutes to process each document. The time taken to edit each document varied widely, from one to eight-and-a-half hours. The time it took to complete a single document depended largely on the quality of the original. Features such as background shadows, crooked lines of text, compressed fonts, jagged edges on letters, and handwriting increased both the time it took the OCR software to process the page, and, more importantly, the time it took to edit or insert accurate, searchable text. In some cases, blocks of text were so unrecognizable that whole questions needed to be typed in as hidden search text. Additional time was required for error-checking.
Problems encountered in text recognition and editing of PDF files. Although Adobe Acrobat Preview highlights most words with low confidence levels, some forms of errors are not so easily detected during the editing phase.
In each of these cases, an attentive editor must catch the OCR error and make appropriate changes to the text to verify accurate search and retrieval. We did not edit enough documents to estimate the average time needed for cleaning a complete document. Future projects will need to budget extensive editing costs.
HTML and SGML/XML Marked-up Files
Conversion of scanned text to hypertext markup language (HTML) format would provide a more readily accessible browsing format. However, the text of each document would need to be fully edited and formatted. As indicated above, the ASCII output from the OCR
-23-technology we used could not provide us with text clean enough to use in HTML. Moving text into documents adhering to the standard general markup language/extensible markup language (SGML/XML) is the most labor-intensive but also the most dynamic alternative for text applications. SGML/XML tagging allows customized and robust access to specific pieces of the documentation (such as question text and variable location information). SGML/XML Document Type Definitions maintain the integrity of document content and structure and also define the relationships among the elements. The emerging social science documentation standard for both formatting and content, the Data Documentation Initiative (DDI) Document Type Definition (DTD), provides standard elements developed specifically for social science numeric resources. The standard adheres to our requirement that text be stored in a system-independent, nonproprietary format. Furthermore, this standard, developed by representatives from the international social science research community, is intended to fill the need for a structured codebook standard that will serve as an interchange format and permit the development of new Web applications. However, this format requires that the text be fully edited and the components of the documentation tagged, and funding for this work was not included in the budget for this project.
Findings and Recommendations
Upon completing the steps defined in both the data and documentation tracks, we carefully examined the processes of migrating hardware- and software-dependent formats to independent formats. We also evaluated the formats in relation to their utility and ease of use over time. In both the data migration and the documentation migration processes, it was imperative that the original content be preserved. Migration that included recoding the content of the data files (changing the character or numeric equivalents in a data file) proved to be labor-intensive and error-prone, and produced unacceptable changes to the original content of the data. Editing the text output from the scanning process proved to be the same: error-prone, time-consuming, and incomplete. Therefore, recoding as a part of migration is not recommended. However, simply copying a file in its original format from medium to medium (refreshing) is not enough.
Software-dependent data file formats, such as the original column binary files examined in the project, cannot be read without specific software routines. If standard software packages do not offer those specific routines in the future, translation programs that emulate the software's reading of the column binary format could provide a solution. However, these emulation programs will themselves require migration strategies over time. We offer another alternative for the column binary format: convert the data out of the column binary format into ASCII without changing the coded values of the files. The spread ASCII format meets the criterion of software independence while simultaneously preserving the original content of the data set. It does, however, require a file-by-file migration strategy that would be time-consuming for a large collection of files.
Finding a parallel solution for the documentation files is not possible at this time. We can not accurately generate character-by-character equivalents of the paper records. We can, however, scan the paper into digital representations that could be used in future character recognition technologies. The Adobe PDF image+text format does provide an interim solution by producing digital versions of the image and limited ASCII representation of the text. However, the types of documentation files produced by the Adobe process are software dependent. If software packages move away from the format used to store the image+text files, translators will be necessary to search, print, and display the files. We therefore recommend archiving the images of the printed pages in both nonproprietary TIFF format and PDF image+text format.
We asked faculty and graduate students to make an informal review of the findings and sample output from the project. All our evaluators had previous experience using data files from the Yale Roper Collection. Regarding the data conversions, they expressed relief at not having to use column binary input statements to read the data files. They had no difficulty in using the chart that mapped column binary to ASCII in order to locate variables in the spread ASCII
-25-version of the data files, once it was explained that each variable mapped directly to a 12-column equivalent and instructions were given for finding single- and multiple-punch locations.
As for the documentation track, faculty and student reviewers found viewing and browsing PDF format files acceptable. Since most users had accessed PDF files on the Web, they seemed comfortable moving from the Internet browser to the Adobe Reader to locate question text and variable location information. This may not be the case with inexperienced users. These users were eager to have more questionnaire texts available for browsing and searching. The lack of a large sample collection of questionnaires did not allow evaluation of question text searching on a large scale.
Findings about Data Conversion
Column binary into SAS and SPSS. As long as software packages can read the SAS and SPSS export formats, recoding the column binary format into SAS and SPSS export files is an attractive option. These file formats can be used easily, are transportable to multiple operating systems and equipment configurations, and can be transformed into other software-specific formats. They do, however, have a number of drawbacks.
First, the original data file must be recoded, a process that is lengthy and potentially error-prone and one that places great reliance on the person doing the translation. If that person does not adequately check for errors, annotate the documentation for irregular variables, or properly recode the original patterns of punches, the translated data set becomes inconsistent with the original. Also, some irregularities in the original data set, which may be meaningful in the analysis of the data, can become lost when the data set is cleaned up. For instance, the original documentation might indicate that a question has four possible responses: PUNCH.1 through PUNCH.3 for a rating scale, and PUNCH.12 for a "don't know" response. On examination of the xray, though, it is discovered that about 200 people had their responses coded as PUNCH.11, not PUNCH.12. A decision must be made: will those PUNCH.11 observations be given the same special missing value code as that for PUNCH.12? This solution will put the responses in the data set into accordance with the documentation by assuming that the strange punches were simply due to errors in data entry. On the other hand, the strange punches could have been intentionally entered to mark out those observations for special reasons that are not listed in the documentation. In this case, it would be better to give the PUNCH.11 observations a special missing value different from that for PUNCH.12. Once the recoding is done, future researchers will be unable to re-create the original data set with its irregularities.
Second, this process of reading, recoding, and cleaning data files to produce SAS (or SPSS) system files and export files is very time-consuming. For example, it took 20 hours of work to write, debug, and double-check a SAS program to recode the data set for Roper
-26-Report 9209, which includes a split sample and a variety of variable types. Assuming a wage of $15 an hour for an experienced SAS programmer, we would expect a cost of approximately $600 per data set for a complete job of data recoding. This estimate, of course, does not include the cost of consultations with data archivists about the recoding of particular variables or the cost of rewriting documentation to reflect the new format of the data set.
Third, data files stored in SAS and SPSS (and other statistical software) formats require proprietary software to read the information. Although there has been an increase in programs that can read and transfer data files from one program, and one version of a program, to another, there is no guarantee that programs for specific versions of software will be available in the future. U.S. Census data from the 1970s were produced in compressed format (called Dualabs) that relied on custom programs and can no longer be read on most of today's computing platforms. Such system- and software-dependent formats require expensive migration strategies to move them to future computing technologies.
Spread ASCII format. If an archival standard is defined as a non-column binary, nonproprietary format that faithfully reproduces the content of the original files, only the spread ASCII format meets these conditions. This spread format, however, is at least 600 percent larger than the original file and requires converting the original column binary structure. It also requires producing additional documentation, since each punch listed in the original documentation must be assigned a new column location in the spread ASCII data. We recommend that each file be converted to a standard spread ASCII format so that a single conversion map may be used for all the data files (see Appendix 5 for an example from this project). Producing such an ASCII data set has several advantages. Information is not lost from the original data set, because the pattern of 0s and 1s remains conceptually the same across data files. It is unnecessary for a data translator to interpose herself between the original data and the final user.
However, the spread ASCII format is not perfect. The storage requirements for spread ASCII data are on a par with the size requirements for a SAS data set containing both recoded and intermediate variables. For the overhead in storage cost, the SAS data set at least provides internally referenced variable names and meaningful variable values. The size of the spread ASCII data becomes even more apparent when it is contrasted with the size of a recoded ASCII data set. In a recoded data set, each unused bit can be left out of the final data; particular bits in a column need not even be input if they contain no values for the variable, and columns without variables may also be skipped. In contrast, in the spread ASCII data, each bit is input and translated to a character, whether it is used or not.
The spread ASCII format requires that users must know how the variables, particularly the multiple-response variables, relate to the 12-column equivalent. A spread ASCII data set is not as easily used as a fully recoded one; after all, recoding the 0s and 1s into usable
-27-variable values will fall to the end user with ASCII data, just as it currently does with column binary data. Finally, each punch listed in the original documentation must be assigned a new column location in the spread ASCII data. Users must refer to an additional piece of documentation, the ASCII data map, to locate data of interest, and this extra step inevitably creates some initial confusion.
Hybrid spread ASCII. The hybrid spread ASCII format, distributed by the Roper Center and the Institute for Research in Social Science at University of North Carolina, offers another alternative. The original data are stored in column binary format in the first horizontal layer of the file. This format preserves the original structure of the data file in the first part of a new record. A second horizontal layer of converted data, in spread ASCII format, is added to the new record. The primary advantage of this hybrid spread ASCII data file format is that users can access the nonbinary portions of the original data file in their original column locations as indicated on the questionnaires. However, users have to know whether the question and variables of interest were coded in the binary format in order to determine whether to read the first part of the record or the second. The ASCII codes in the first horizontal layer are readable, but any binary coding in that first horizontal layer are not, since the binary coding is converted to ASCII to avoid problems while reading the data with statistical software. If users want to read data that were coded in binary in the original file, they can read the spread ASCII version of multiple-punched equivalents in the second horizontal layer without having to learn the column binary input statements to read the data.
We chose to produce converted ASCII files that did not have this two-part structure. To use the spread ASCII data files we constructed, users first determine the original column location of a particular variable from the questionnaire, and then use a simple data map to locate the column location in the new spread ASCII file (see Appendix 5).
Original column binary format. The column binary format itself turned out to be more attractive as a long-term archival standard than we had anticipated. It conserves space, it preserves the original coding of data and matches the column location information in the documentation, it can be transferred among computers in standard binary, and it can be read by standard statistical packages on all of the platforms we used in testing. Unfortunately, it is difficult to locate and decipher information about how to read column binary data with SAS and SPSS, as the latest manuals (for PC versions of the software) no longer contain supporting information about this format. This lack of documentation support indicates the possibility that the input formats will not be offered in subsequent software versions.
On the other hand, as long as the format exists, there seems to be some level of commitment to support it. As stated in the SAS Language: Reference text, "because multipunched decks and card-image data sets remain in existence the SAS System provides informats for reading column-binary data." (SAS 1990, 38-9). If the column binary format is refreshed onto new media and preserved only in its
-28-original form, we recommend that sample programs for reading the data with standard statistical packages, or a stand-alone translation program, be included in the collection of supporting files. But even with these supplemental programs, accessing and converting the files will continue to present significant challenges to researchers. Given these considerations, we do not recommend this format over the spread ASCII alternative.
Findings about Documentation Conversion
The OCR output files from TextBridge Pro did not provide us with an adequate means for archiving the textual material. The questionnaires we scanned had an unacceptable rate of character recognition, including incorrect location information necessary for manipulating the accompanying data files. Handwritten notes were completely lost. Although the format allowed for searching of a particular text that was successfully recognized, the amount of editing required to produce a legible version of the original, review the output, and correct all errors was found to be prohibitive. Subsequent viewing was poor without the formatting capabilities of proprietary word processing software.
The PDF format provided solutions to some of the documentation distribution and preservation problems we faced, but it did not meet all of our needs. For one thing, the format does not go far enough in providing internal structure for the manipulation, output, and analysis of the metadata. Like a tagged MARC record in an online public access catalog, full-text documentation for numeric data requires specific content tagging to allow search, retrieval, manipulation, and reformatting of individual sections of the information. Another drawback is that PDF files are produced and stored in a format that may be difficult to read and search in the future. The PDF format, although a published standard, depends on proprietary software that may not be available in future computing environments. (A similar problem can be seen with dBase data files that are rapidly becoming outmoded, causing major problems with large collections of CD-ROM products distributed by the U.S. government.) We see increasing numbers of PDF documents distributed on the Internet and the format will be used by ICPSR for the distribution of machine-readable documentation to its member institutions. So, given the large number of PDF files in distribution, software for conversion will most likely be developed over time. However, the current popularity of the PDF format does not guarantee that software to read it will continue to be available throughout the future of technological evolution.
The TIFF graphic image file format is useful for viewing and for distribution on the Web and as an intermediate archival format, allowing storage of files until they can be processed in the future using more advanced OCR technology. Even though this format does not allow text searching, tagging/mark up, or editing, it moves the endangered material into digital format. ICPSR has decided to retain
-29-such digital images of its printed documentation collection for reprocessing as OCR technology evolves.
It is our recommendation that both PDF/Adobe Capture edited output and scanned image files be produced and archived. The PDF/Adobe image+text files allow searching and viewing of text in original formatting. The scanned image files can be archived for future character recognition and enhancement. We also want to emphasize the importance of the long-term development of tagged documentation using the DDI format. This is by far the most desirable format, albeit one that is difficult to produce from printed documentation, given the inadequacy of OCR technology and the costs of subsequent editing. We urge future producers and distributors of numeric data to help develop and adhere to standard system- independent documentation.
Recommendations to Data Producers
Design affects maintenance costs and long-term preservation. Producers of statistical data files need to be cognizant of preservation strategies and the importance of system-independent formats that can be migrated through generations of media and technological applications. In this project, we had significant problems with the column binary format of the data. Had long-term maintenance plans been considered, and the costs of migration been taken into consideration, the creators of the data format might not have chosen the column binary format. At the time, however, it was the most compact format to use and standard software could then be adapted to the format.
One thing is very clear: data producers would be advised and should be persuaded to take long-term maintenance and preservation considerations into account as they create data files and as they design value-added systems. Our experience shows that the most simple format is the best long-term format: the flat ASCII data file. We would urge producers to provide column-delimited ASCII files, accompanied by complete machine-readable documentation in nonproprietary and nonplatform-specific formats. Programming statements (also known as control cards) for SAS and SPSS are also highly recommended so that users can convert the raw data into system files. The control card files can be modified for later versions of statistical software and used for other programming applications and indexing.
In accordance with emerging standards for resource discovery, data files should contain a standard electronic header or be accompanied by machine-readable metadata identification information. This information should include complete citations to all the parts of a particular study (data files, documentation, control card files, and so forth) and serve as a study-level record of contents and structure.
Metadata standards. Not only must standards be considered for the structure of the numeric data files, but the metadata information describing the data must also conform to content and format standards for current use and for long-term preservation applications.
-30-Content standards require common elements, or a common set of "containers" that hold specific types of information. Standards for coding variables should be followed so that linkages and cross-study analysis are enhanced, and common thesauri for searching and mining data collections also need to be produced. As for format standards, metadata should be produced in system-independent formats that provide standard structures for the common elements and coding schemes. These format standards should provide consistent tagging of elements that can be mapped to resource discovery and viewing software and to statistical analysis and database systems.
For most social science data files, machine-readable documentation should be supplied in ASCII format that conforms to standard guidelines for both content and format. With some very large surveys using complex survey instruments, files in this ASCII format may be so big that complex structures in nonproprietary format need to be developed to reduce storage requirements. If paper documentation is distributed, it should be produced using high-quality duplication techniques with simple fonts, no underlining, no handwritten notations, and plenty of white space.
Of particular interest is the Data Documentation Initiative (DDI) DTD (Document Type Definition in XML), which is a developing standard for documentation describing numeric databases. It provides both content and format standards for creating digitized documentation in XML. Data producers in the United States, Canada, and Europe will be testing the DTD as part of their documentation production process. Data archives will be converting their digitized documentation into this format, and some will be scanning paper documentation and tagging the content with the DDI.
Complex statistical systems. We must also be concerned about the long-term preservation plans for complex systems. As we see more efforts to integrate data and documentation within linked systems, we see a growing tension between access and preservation. Complex database management systems such as ORACLE or SQL present us with more complex questions: What parts of the system need to be preserved to save the content of the information as well as its integrity? Will snapshots of the system provide future users with enough information to simulate access as we see it today? Is the content usable outside the context of the system? These are the database preservation challenges of the future.
ASCII -- American Standard Code for Information Interchange. A character encoding scheme used by many computers. The ASCII standard uses 7 of the 8 bits that make up a byte to define the codes for 128 characters. Example: in ASCII, the number seven is a treated as a character and is encoded as: 00010111. Because a byte can have a total of 256 possible values, there are an additional 128 possible characters that can be encoded into a byte, but there is no formal ASCII standard for those additional 128 characters. Most IBM-compatible personal computers do use an IBM extended character set that includes international characters, line and box drawing characters, Greek letters, and mathematical symbols.
C -- A programming language.
CBw.d -- Instructions that the SAS System uses to read standard numeric values from column-binary files, translating the data into standard binary format. The w value specifies the width of the variable, usually 8, but has a range between 1 and 32. The d value specifies the number of digits to the right of the decimal point in the numeric value.
card -- Also known as deck, a physical record of data. A survey may have multiple cards for each respondent, all cards together comprising a logical record. Based on the IBM punch cards of 80-column length.
case -- The unit of analysis in a particular data file. Can be an individual respondent to a questionnaire, a customer, or an industry. In the Roper Reports, each case is an interview respondent.
codebook -- Description of the organization and content of a data file. Contains the code ranges and the code meanings needed to interpret the data file.
column binary -- A code originally used with punched cards in which successive bits are represented by the presence or absence of punches in contiguous positions in columns. Using this method, responses to more than one question can be stored in a single column.
data dictionary -- A file, part of a file, or part of a printed codebook containing information about a data file, including the name of the element, its format, location, and size.
Data Documentation Initiative (DDI) -- An international committee sponsored by ICPSR that is developing a new metadata standard for social science documentation. This standard, developed by representatives from the international social science research community, is intended to fill the need for a structured codebook standard that will serve as an interchange format and permit the development of new Web applications. The Document Type Definition (DTD) for the DDI
-32-standard is written in XML (Extensible Markup Language) and is available at http://www.icpsr.umich.edu/DDI/.
deck -- Also known as card, a logical record of data. A survey may have multiple cards for each respondent, all cards together comprising a logical record. Based on the IBM punch cards of 80-column length.
documentation -- Information that accompanies a data file, describing the condition of the data, the creation of the file, the location and size of variables in the file, and the values (or codes) of the variables.
export file -- A file produced by a software package that is designed to be read on another computer, often with a different operating system, running a version of the same software package.
HTML -- HyperText Markup Language
ICPSR -- Inter-university Consortium for Political and Social Research
informat -- The instructions that specify how SAS reads the numbers and characters in a data file.
intermediate variable -- A variable used when recoding data to input information from individual punches in multipunch data. Sets of intermediate variables are then recoded to produce final variables.
logical record -- A complete unit of data for a particular unit of analysis, in this project a single respondent. Multiple physical records, called cards or decks, may make up a logical record.
missing value -- A value code that indicates no data are present for a variable for a particular case. To be distinguished from non-response values (respondent refused to answer or was not asked the question) and from invalid responses (the response did not have a valid value code equivalent). Non-responses and invalid responses may or may not have value categories provided in the questionnaire and may be treated differently from true missing data during analysis.
multipunched -- A way of recording data, originally used with punched cards, in which successive bits are represented by the presence or absence of punches in contiguous positions in columns. Using this method, responses to more than one question can be stored in a single column.
OCR -- Optical character recognition
PDF -- Portable Document Format, a published standard format developed by Adobe Systems, accessed with proprietary software.
punch card -- A paper medium used for recording computer-readable data. The card is punched by a special machine called a keypunch that works like a typewriter, except that it punches holes in cards instead of typing characters on paper. The punch cards are then processed with a card reader that transfers the punched information to a computer-readable digital format.
PUNCH.d -- Instructions that the SAS system uses to read standard numeric values from column binary files. The d value specifies which row in a card column to read. Valid values for the d value are 1 through 12.
questionnaire -- The set of questions asked in a survey. In the Yale Roper Collection, the questionnaire, with columns and codes written next to the question, may substitute for a codebook.
recode -- Changing the value code of a variable from one value to another. For example, changing 0 and 1 values in column binary data files to value ranges of 0 through 12. Also known as data transformation.
respondent -- In survey research, the person responding to the survey questions.
ROWw.d -- Instructions that the SAS system uses to read a column-binary field down a card column. The w value specifies the row where the field begins, with a range between 1 and 12. The d value specifies the length in rows of the field. Valid values for d are 1 through 25, with the default value of 1. The informat assigns the relative position of the punch in the field range to a numeric variable.
SAS -- Set of proprietary computer programs used for analysis of social science statistical data. (No longer an acronym; originally stood for Statistical Analysis System.)
SGML -- Standard General Mark-Up Language
SPSS -- Statistical Package for the Social Sciences. Set of proprietary computer programs used for analysis of social science statistical data.
single-punch -- A single response coded in a column.
split sample -- A method of data collection in which one group of respondents is queried with one form of a questionnaire and the second group is queried with a different form of the questionnaire.
spread -- Recoding multiple responses that have been coded in a single column of a record to a separate column for each response.
system file -- A data file or collection of data files specifically formatted for a particular software package; may not be readable by other software packages.
TIFF -- Tagged Image File Format
values -- The numeric or character equivalents for a particular variable in a data file.
variable -- An item in a data file to which a value has been assigned. A data file contains the values of certain variables measured for a set of cases. In the Roper Report data files, variables are responses to questions or parts of questions from each person interviewed.
XML -- Extensible Markup Language (XML) is a data format for structured document interchange on the Web.
xray -- A form of output that is organized by card, column, and row; each bit has its own unique location within this framework. The total number of punched bits across all observations is recorded for each location in the data set. This sum often provides a response frequency for individual response options.
Sources of Glossary Terms
Appendix 1: Roper Report documentation page 3W:
(Sept. 1993 Roper Report)
|Storage Requirements (bytes)||Percentage of Original|
|All platforms||Original (column-binary)||2858400|
|IBM PC datasets||SAS dataset (full set of variables)||19464984||681|
|SAS XPORT dataset (full set of variables)||19144720||670|
|SAS dataset with 4 byte integers (full set of variables)||9978112||349|
|SAS dataset with 3 byte integers (full set of variables)||7528192||263|
|SAS dataset (partial set of variables)||8879360||311|
|SAS XPORT dataset (partial set of variables)||8843760||309|
|SAS dataset with 4 byte integers (partial set of variables)||4571392||160|
|SAS dataset with 3 byte integers (partial set of variables)||3471616||121|
|Mainframe datasets||SAS dataset (full set of variables)||23347200||817|
|SAS XPORT dataset (full set of variables)||19144720||670|
|SAS dataset (partial set of variables)||11059200||387|
|SAS XPORT dataset (partial set of variables)||8843760||309|
Part one: The SAS Macro
* ALWAYS INDICATE FULL PATH & DATASET NAME OF *;
* COLUMN-BINARY DATA (DSN=) AND THE FULL PATH *;
* & DATASET NAME OF THE SPREAD DATA (OUT=) *;
* IN THE MACRO CALL *;
Part two: Read in column binary data, convert to ASCII; create data map showing new column locations
OPTION ERRORS = 0; /* CHANGES INVALID DATA MESSAGES TO WARNINGS */
OPTIONS NONUMBER NODATE;
INFILE "&DSN" lrecl=160 recfm=f;
FILE "&OUT" lrecl=960 ;
length a $1;
DO i=1 TO 80;
INPUT @i A $CB1. @i row1 punch.1
@i row2 punch.2
@i row3 punch.3
@i row4 punch.4
@i row5 punch.5
@i row6 punch.6
@i row7 punch.7
@i row8 punch.8
@i row9 punch.9
@i row10 punch.0
@i row11 punch.11
@i row12 punch.12 @;
PUT @(1+(12*(i-1))) (row1-row12) (1.) @;
title DATA MAP FOR COLUMN-BINARY SPREAD DATA;
title3 NOTE: Rows 1-9,0,X,Y correspond to columns 1-12.;
put; put; put;
do i = 1 to 40;
col1 = 1 + (12*(i-1));
col2 = 12 + (12*(i-1));
col3 = 1 + (12*(40+i-1));
col4 = 12 + (12*(40+i-1));
cola = i;
colb = 40 + i;
put @1 "Column " cola 2. " maps to " col1 4. " through " col2 4. @;
put @41 "Column " colb 2. " maps to " col3 4. " through " col4 4. @;
NOTE: Rows 1-9,0,X,Y correspond to columns 1-12.
Column 1 maps to 1 through 12
Column 2 maps to 13 through 24
Column 3 maps to 25 through 36
Column 4 maps to 37 through 48
Column 5 maps to 49 through 60
Column 6 maps to 61 through 72
Column 7 maps to 73 through 84
Column 8 maps to 85 through 96
Column 9 maps to 97 through 108
Column 10 maps to 109 through 120
Column 11 maps to 121 through 132
Column 12 maps to 133 through 144
Column 13 maps to 145 through 156
Column 14 maps to 157 through 168
Column 15 maps to 169 through 180
Column 16 maps to 181 through 192
Column 17 maps to 193 through 204
Column 18 maps to 205 through 216
Column 19 maps to 217 through 228
Column 20 maps to 229 through 240
Column 21 maps to 241 through 252
Column 22 maps to 253 through 264
Column 23 maps to 265 through 276
Column 24 maps to 277 through 288
Column 25 maps to 289 through 300
Column 26 maps to 301 through 312
Column 27 maps to 313 through 324
Column 28 maps to 325 through 336
Column 29 maps to 337 through 348
Column 30 maps to 349 through 360
Column 31 maps to 361 through 372
Column 32 maps to 373 through 384
Column 33 maps to 385 through 396
Column 34 maps to 397 through 408
Column 35 maps to 409 through 420
Column 36 maps to 421 through 432
Column 37 maps to 433 through 444
Column 38 maps to 445 through 456
Column 39 maps to 457 through 468
Column 40 maps to 469 through 480
Column 41 maps to 481 through 492
Column 42 maps to 493 through 504
Column 43 maps to 505 through 516
Column 44 maps to 517 through 528
Column 45 maps to 529 through 540
Column 46 maps to 541 through 552
Column 47 maps to 553 through 564
Column 48 maps to 565 through 576
Column 49 maps to 577 through 588
Column 50 maps to 589 through 600
Column 51 maps to 601 through 612
Column 52 maps to 613 through 624
Column 53 maps to 625 through 636
Column 54 maps to 637 through 648
Column 55 maps to 649 through 660
Column 56 maps to 661 through 672
Column 57 maps to 673 through 684
Column 58 maps to 685 through 696
Column 59 maps to 697 through 708
Column 60 maps to 709 through 720
Column 61 maps to 721 through 732
Column 62 maps to 733 through 744
Column 63 maps to 745 through 756
Column 64 maps to 757 through 768
Column 65 maps to 769 through 780
Column 66 maps to 781 through 792
Column 67 maps to 793 through 804
Column 68 maps to 805 through 816
Column 69 maps to 817 through 828
Column 70 maps to 829 through 840
Column 71 maps to 841 through 852
Column 72 maps to 853 through 864
Column 73 maps to 865 through 876
Column 74 maps to 877 through 888
Column 75 maps to 889 through 900
Column 76 maps to 901 through 912
Column 77 maps to 913 through 924
Column 78 maps to 925 through 936
Column 79 maps to 937 through 948
Column 80 maps to 949 through 960
Copyright © 2006 Digital Library Federation. All rights reserved.