Notes
Slide Show
Outline
1
Economic Growth Center
Digital Library
a project funded by the Andrew W. Mellon Foundation
  • Ann Green, Steve Citron-Pousty, and Marcia Ford
  • Social Science Research Services
  • Yale ITS Academic Media & Technology


  • Sandy Peterson and Julie Linden
  • Yale Social Science Libraries and Information Services


  • Christopher Udry
  • Professor, Yale Dept of Economics
  • Director, Economic Growth Center and Dept of Economics
2
Goals of the grant:
Improving access to statistical resources not born digital
  • Build a prototype archive of statistics not born digital.
  • Implement standard digitization practices and emerging metadata standards for statistical tables.
  • Document the costs and processes of creating a statistical digital library from print.
  • Build the collection based upon long range digital life cycle requirements.
  • Present the prototype digital library to the scholarly community for evaluation.


3
Research questions
  • What effect does online access to the statistical information have on scholarly use of the materials?
  • Are common digitization practices and standards suited to statistically-intensive documents?
  • What are the long term preservation requirements of the EGCDL?
  • What are the costs of producing high quality statistical tables with OCR and editing?
  • How scalable is this process, for what kinds of collections, and for what purposes?
4
Selecting the EGC
  • Why Economic Growth Center collection?
  • extraordinary wealth of statistical material in printed form
  • condition of paper is poor; preservation concerns
  • a range of access problems imposed not only by non-digital physical formats but also by inadequate descriptions of the contents of the publications


  • Why Mexico?
  • Faculty connections, library strengths, image quality and completeness of collection
  • Publisher:  Instituto Nacional de Estadística, Geografía e Informática (Mexico, INEGI) Anuarios Estatales
5
Digitization process
  • Vendor review:
    • Part one:  request for proposals (13 vendors)
    • Part two:  production of samples and extended proposals (6 vendors)
    • Part three:  final review and budget evaluation
    • Part four:  contract negotiation and processing
  • Prepared and shipped 221 volumes in Fall ’03
  • All materials received back at Yale Feb ’04
  • Quality assessment period complete March ’04
6
Deliverables received:  images and PDFs
  • 300 dpi TIFFs of each page of each volume, including cover and back of volume
  • (103,115 TIFF files; 460 gigabytes)
    • Color TIFFs for color pages, black and white TIFFs for pages with only black and white
  • PDF image + text files of each chapter of each volume (5,607 files)
    • Separate files for each subject chapter, front matter, indices, and back matter
7
Deliverables received: statistical tables in Excel
  • Excel tables (16,488) of demographic and economic tables for 1994, 1996, 1998, 2000
  • Selected OCR’d statistical tables converted to Excel format
  • DSI operators reviewed each table twice
  • Custom tagging done by DSI improved ability to extract columns and rows into online database and build XML metadata
8
Quality assessment of PDFs
  • Overall assessment was very good; minor problems on a very small number of pages
  • Sampled 5% of the PDFs
    • subjective review of image:
      • Tilting, cut off text, illegible characters, bleed through, color evaluation, noise
      • Vendor did not produce PDFs exactly to our specification, had to reformat 216 files
9
Quality assessment of Excel tables
  • Visual review
  • Checksums:
    • Wrote script to find tables with ‘Total’ columns
    • Wrote Excel macro to compare column sums with Total values
    • Excellent numeric transfer of numbers from print
10
Automated Dublin Core metadata production for PDF and Excel files
  • Defined metadata format for PDF and Excel documents
    • Dublin Core subset (title, date, format, identifier, source, coverage, subject)
  • Wrote scripts to produce individual metadata records for each PDF chapter and each Excel file
    • Matched chapter numbers with chapter titles: created standard matching tables
    • Generated subject term for each chapter: created standard tables of chapter titles to topic list
    • Generated series title list from library online catalog
    • Other text generated from file name and directory information
11
 
12
First generation interface:
Select, view, download
ssrs.yale.edu/egcdl
  • Features:
  • Select files by year, state, topic, and/or type of file (PDF or Excel)
  • Reconstruct the full volume
  • Based upon Dublin Core metadata
13
 
14

Next generation interface:
NESSTAR
  • Features:
    • Search across tables
    • View tables, select columns and rows
    • Create graphs and charts
    • Download and extract
  • Based upon the Data Documentation Initiative (DDI) metadata standard for statistical tables
  • Uses statistical software similar to SPSS
  • Nesstar is in use by data archives in Europe, World Bank, Health Canada; in review by ICPSR and The Roper Center
15
Metadata production and data publishing:  Nesstar process
  • Script creates DDI XML file from Dublin Core record and CSV file from Excel table
  • Staff publish each pair of DDI XML file and CSV file using Nesstar ‘cube builder’ software
  • Some editing needed at this stage:
    • Add ‘measure’ (what the table is measuring—persons, events, pesos, etc)
    • Add column header name
  • Publish metadata and table data to Nesstar server
16
 
17
 
18
 
19
 
20
Evaluation of Nesstar interface
  • No major advantage over Excel in terms of viewing or manipulating data
  • Metadata and table can’t be viewed together
  • De-contextualization of tables (search results don’t indicate what volumes the tables belong to)
  • Lack of flexibility in customization
21
 
22
Evaluation of Nesstar-produced metadata
  • Labor-intensive to create and edit; no batch processing
  • Have to interpret table elements (add column header and measure labels)
  • Some data lost (textual codes for missing data; footnotes within cells)
  • Some deviations from DDI specification


23
Automated DDI production:
script-based process
  • Scripts pulled elements from Excel files and Dublin Core records
  • No manual editing necessary
  • DDI records are valid
  • Metadata describes table “as is” without our interpretation imposed on it
  • Challenges with marking up hierarchical tables in DDI
24
Search/browse UI for PDFs and Excel
  • Lucene index for keyword searching
    • Includes text from table titles, column and row labels, and footnotes
    • Can use accent marks or not in search terms; Lucene returns appropriate results set


25
 
26
 
27
What we are learning and documenting
  • Costs and processes of digitizing paper and building a statistical digital library:
    • Scanning requirements (TIFFs, PDF/a, etc)
    • OCR of Spanish text
    • OCR of numbers into spreadsheets (zoned scanning)
    • Quality assessment
    • Is it less expensive to key in the tables? What is the ‘tipping point’?

28
Cost comparison of keying in the data
  • Sent sample tables to two vendors and asked for cost estimates for keying
    • Varied widely – different pricing structures (per 1000 characters vs. per Kb)
    • Both require shipping volumes (or photocopies) overseas
29
Costs of digitizing Nigeria data
  • Many volumes have poor quality paper and print
    • mimeographed copies of typewritten pages
    • skewing, bleed through, strikeovers, text cut off
  • Goals of project
    • Try to find “tipping point” for digitizing collection vs. digitizing selection of tables
    • Test whether lessons learned from relatively clean tables apply to lower-quality print materials
    • Determine how these materials would fit into EGCDL UI
  • Cost estimates
    • Considering TIFF and PDF for all volumes, flagging good quality volumes for automated conversion to Excel
30
 
31
What we are learning and documenting
  • Digital Life Cycle considerations:
    • Formats and metadata standards
    • Long term home for digital assets and the Rescue Repository
    • Exactly what assets do we preserve?
    • Questions about versioning
    • XML to Excel as potential preservation format

32
What we are learning and documenting
  • Metadata production: extensive use of automated processes
    • Dublin core for file level metadata
    • DDI for table level metadata


  • User interface development and Spanish character challenges
    • Nesstar implementation for aggregate data
    • Developed own UI for more flexibility, ability to include PDFs and Excel tables
33
What we are learning and documenting
    • Scholarly use of the EGCDL:  Finding and accessing tables
    • Great advantage in locating data content by the full text of the tables; value in online access to individual tables.
    • Integrate searching and access into existing resources
    • Are investments in online analysis and visualization worth it?  Are Excel tables what faculty and students want?
34
What we are learning and documenting
    • Production of more tables:
    • Do faculty value a collections based digitization effort or is there more value in on demand service?
    • What other countries in the EGC print collection are of interest?
    • Cost comparisons of building full volume equivalents vs selected tables.
    • The process can be leveraged to facilitate production for other research and learning projects.
35
Contact information
  • ann.green@yale.edu


  • julie.linden@yale.edu


  • sandra.k.peterson@yale.edu


  • Project web site:
  • ssrs.yale.edu/egcdl