random quote Link: Publications Forum Link: About DLF Link: News
Search
Link: Digital Collections Link: Digital Production Link: Digital Preservation Link: Use, users, and user support Link: Build: Digital Library Architectures, Systems, and Tools
photo of books

DLF PARTNERS

""

DLF ALLIES

""

Comments

Please send the DLF Director your comments or suggestions.

DIGITAL LIBRARY FEDERATION
WORKSHOP ON SOCIAL SCIENCE DATA ARCHIVES

Report of the meeting held January 27-28, 1999
Princeton University

On January 27-28, 1999, the Digital Library Federation (DLF) hosted its first workshop on the state of the art of digital libraries in the social sciences. The purpose of the workshop was to explore current problems and emerging solutions in three areas:

  1. facilities for users to discover and retrieve relevant and related data sets;
  2. means for users to interpret and evaluate the comparability of data sets; and
  3. tools for and methods of data extraction for analysis.

The DLF convened experts in the development of social science digital libraries and managers of social science data archives to reach common understandings about the problems and potential solutions in the three focal areas, and to establish a collaborative environment for advancing digital data libraries. Experts defined the state of the art and participants joined in identifying criteria for best practices with the objective of identifying work that the DLF can undertake to advance the state of the art.

Welcome and Introductions

The workshop opened with welcoming statements from Kevin Barry, head of the Social Science Reference Center, Princeton University; Karin Trainer, university librarian, Princeton University; and Donald Waters, director of the Digital Library Federation.

Mr. Barry opened the workshop with a description of Princeton's Social Science Reference Center services. In 1981 the social science reference function was transferred to the Social Science Reference Center located in the Library which resulted in improved services. The Center serves undergraduates, graduate students, and faculty, particularly those in the social sciences, and helps students with their senior theses, an increasing number of which are quantitative. The Center pulls together data, human expertise, myriad data and information resources, and services such as statistical analysis consulting.

Barry welcomed the assembled group and expressed his hopes and expectation that the meeting would foster collaboration among the participants and improve services to users, including helping them to become more self-sufficient through the use of better online tools.

Welcoming the workshop participants to Princeton University, Karin Trainer encouraged collaboration to improve services and to leverage the investments and expertise of individual institutions. She acknowledged the DLF's contributions to the state of our collective knowledge about digital libraries, to the development of best practices, and to the goals of improved services to users.

Don Waters thanked Princeton University for hosting the meeting and reinforced the collective belief that significant gains can be made through collaborative action. He recognized and thanked planning committee members Kevin Barry, Ann Green (Yale), Lauris Olson (University of Pennsylvania), and Rebecca Graham (DLF).

Waters explained that DLF initiatives focus on the need to reduce barriers to cooperation. The initiatives are catalytic, not operational. They aim to foster organizational and project-based efforts toward further progress. The DLF has focused on architecture and infrastructure (for example, intellectual property and access management, naming, rights and permission protocols, metadata, and linking). He described specific projects such as the DLF/NISO workshop on naming, a workshop cosponsored with NISO to identify metadata for images, studies in digital preservation, and initiatives in scholarly communication (including working with faculty in art history, and divinity and theological institutions).

The Social Science Data Initiative goes back to the founding of the DLF. Growing out of an organizational initiative among Berkeley, Harvard, Stanford, UCSD, and Oregon State, it has expanded to include the assembled institutions (for a list of participants, see http://www.diglib.org/collections/ssda/ssdaparticipants.htm). Whereas earlier efforts attempted to develop a comprehensive plan for data and services, the current effort will focus on identifying a set of steps that can provide improved services in the short term and build a foundation for the future. Sid Verba of Harvard, who serves on the Board of the Council on Library and Information Resources (CLIR), has been a particularly strong advocate of developing online social science data libraries.

Waters described the goals of the workshop. It is urgent to develop improved social science data services for research, public service, and teaching. He hoped that the workshop would be intimate and interactive, and that it aim its deliberations at the managerial level to inform future action. The workshop is organized into three sessions:

  • discovery and retrieval;
  • comparability and interpretation;
  • data extraction.

Although other areas are also important for social science data services, they are secondary to the topics above.

Waters challenged the group to think about what might characterize excellent services. What might be done together or separately to advance the state of the art of digital libraries in the social sciences? What resources are needed? He emphasized the ultimate goals of fostering communication among scholars and data literacy for students.

Part 1: Discovery and retrieval

From Catalog to Portal: Integrating Software, Metadata, Resources, and Services

Richard Rockwell, executive director, Inter-university Consortium for Political and Social Research (ICPSR)

Rockwell described the transition in ICPSR's use of metadata. ICPSR is creating a new approach: a portal that provides an integrated set of services to a tightly focused community. The ICPSR's portal provides quantitative social science data resources for research and instruction (it does not provide text or images). It supports retrieval and use of data, provides training and user support, and allows discovery. Through its services, one can progress from data, through information, to knowledge. User support, which is a very important component in the ICPSR portal, is not frequently prominent in other portals.

Today's vastly improved computing environment can deliver data instantaneously with documentation and services, and thus allows the portal to integrate software, metadata, resources, and services. Rockwell compared the current state of affairs with that of the recent past, when reels of magnetic tape had to be loaded into mainframe computers for use, and metadata were found in printed volumes.

The ICPSR is interested in all social science data resources, whether they are ICPSR data themselves, resources from trusted data archives, or data from original data producers. ICPSR will make all types of data available through its portal.

Rockwell provided a tutorial on the subject of data archives. Social science data archives include data from surveys, censuses, administrative records, direct observation, diaries, and the like. The data consist of logical records containing the measurements for each of the units of observation of the variables (e.g., persons: sex, political party, occupation; organizations: number of employees, annual sales).

Data are unusable without documentation. To progress from raw data to information, the researcher must use a statistical analysis program (or spreadsheet) to produce analytical results, then use the data documentation (codebook) to interpret the results.

The codebook is metadata, which includes multiple levels of information about a data set. At the highest level is archive-level information, indicating the data gatherer and data set creator, the price, conditions of use, copyright, and other agreements. Next are the study-level metadata, indicating what kind of data are in the set, such as survey, census, administrative records, and years covered. Third are the variable/measurement metadata, indicating the variables, sample design, units of measurement, and how the measurements were made and recorded. Thus, the codebook provides the essential metadata that make the data usable. Traditionally, the codebook has been a print product accompanying the data file.

Rockwell suggested that as social science data become more available over the network, and as library users begin to expect services in support of their use of data, librarians will need to move into substantively new areas of responsibility and gain new competencies. Helping students and faculty find information in printed sources is qualitatively different than helping them to use data. For example, one does not need to instruct users in reading or help them understand the structure of a book. In contrast, the social science data librarian must be prepared to help users understand a stratified random sample, how to weight various types of data, and how to choose tools of analysis.

Although the codebook serves as the principal metadata for social science data archives, some information is captured in higher-level metadata, such as a catalog. To produce a seamless system for data exploitation, finding aids must operate on both the catalog and the codebooks.

Originally, the ICPSR produced an annual printed catalog, which ultimately reached 915 pages. Later, the catalog was converted into a SPIRES electronic database, and, finally, in the early 1990s moved first to Gopher, then to the Web. But in all of these cases, the catalog included only study-level metadata, not variable-level information.

The OSIRIS statistical package provided structured markup of codebooks, and ICPSR was able to convert this markup into a "Variables" database. But OSIRIS was too restrictive and simplistic. It was not a standard in libraries or IT communities, or with data producers outside Institute for Social Research (ISR) and thus could not be readily exchanged or archived.

Today, the community is working to establish best practices for digital codebooks. This effort is called the Data Documentation Initiative (DDI). It is intended to be more flexible than the OSIRIS approach, operate in a native Web environment using standard Web browsers, incorporate both study-level and variable-level metadata in one document database, and use the XML dialect of SGML. Final revisions are now being made to DDI release 1.0.

From the beginning, in 1993, the organization and funding of DDI have been international and have involved both data producers and data archivists. The initiative was funded initially by ICPSR, then by a National Science Foundation (NSF) grant in 1996. NESSTAR, a European Community-funded project, has been using the DDI prototype standard. In February 1999, ICPSR will issue an RFP for beta testers of the DDI. The developers would like to test as many different types of codebooks as possible, and to enlist as broad participation as possible. To view the DDI, go to http://www.icpsr.umich.edu/DDI/codebook.html.

The DDI includes the following kinds of study-level data:

  • Agency or principal investigator(s);
  • Persons of organizations responsible for data collection;
  • Weighting;
  • Date and geographic location of data collection and the time period covered;
  • Index or table of contents;
  • Technical information on files;
  • Official title of the data collection;
  • Project description;
  • Response rate;
  • Unit(s) of analysis/observation;
  • Flowchart of the data collection instrument;
  • Restrictions on use of the data;
  • Funding sources and related acknowledgments, where applicable;
  • Sample and sampling procedures;
  • Data source(s);
  • Data collection instruments;
  • List of abbreviations and other conventions;
  • Bibliographic citations.

It includes the following kinds of variable-level information:

  • Precise wording of the question or the exact meaning of the datum;
  • Missing data codes;
  • Exact meaning of codes;
  • The item or questionnaire number (e.g., Question 3a);
  • Imputation and editing information;
  • Unweighted frequency distributions or summary statistics for the item.

Rockwell addressed the question: Why create yet another document type definition (DTD)? He explained that the Text Encoding Initiative (TEI) created a DTD to be used for texts, primarily humanities texts, that bears no cognitive relationship to quantitative data.

Similarly, the encoded archival description (EAD) is a finding aid for objects such as textual materials, visual materials, recordings, and artifacts.

The DDI developers have conformed whenever possible to the Dublin Core for study-level metadata. But the DDI has been developed specifically to provide access to quantitative data and to conform to the cognitive structure of those data. The DDI responds to broader needs than the internal requirements of ICPSR. Although the Web held promise for improving access to quantitative data, by the mid-1990s it was becoming an unorganized collection of information of highly variable and often dubious quality, quite transient in nature.

It was clear that for information to be easily found, navigated, and used, there was a need for cognitive structure, which was lacking in HTML-based Web documents. It had also become obvious that free-text searching was a poor way to retrieve information, since it had low levels of precision and specificity.

The 1996 proposal to NSF argued that a logical document structure was indispensable to data discovery and retrievalÑa point on which there is widespread agreement today.

The DDI still faces several problems. To ensure success, mechanisms for long-term governance and maintenance of the DDI must be developed, data producers must be encouraged to adopt it, and automated or semi-automated markup programs need to be developed.

A continuing challenge will be to keep the DDI focused on quantitative data sets. Each domain has its own cognitive structure, and developers of the DDI believe that it would be a mistake to attempt to extend it to additional domains having different cognitive structures.

Rockwell explained that the DDI promises to enable quantitative data users to search data resources worldwide in a single probe and to access both study-level and variable-level information within a single system. With DDI, language will not matter, at least in the context of European languages, if NESSTAR/ICPSR succeeds. Through the DDI, the user can immediately conduct an analysis of the data. These analyses can be elementary or exploratory, analytical extracts can be drawn, customized documentation can be generated, and a variety of views of the data can be created. In the longer term, expert systems might be developed to aid researchers in the choice of data. Lowering the barriers to access and use of quantitative data can expand undergraduate use.

Rockwell described how the Web has transformed the nature of the data archive. No longer are complete data archives maintained at a central location, or in several locations. Today, some are directly distributed and housed by data producers; others are housed at a variety of institutions in a virtual, or distributed, archive. The distributed system of data archiving and access poses the risk of a fragmented information system. But the solution is not necessarily to import all data sets into centralized data archives. The alternative is to provide expertly reviewed linkages to data resources outside of the major data archives. Thus, the meaning of the term data archive changes from a data collection to a data service.

According to Rockwell, professional archiving is very important, but archives need to be information providers, not just data providers. The creation of a system of distributed archives, following a common set of best practices, will lead to greater precision and specificity in finding and using data, and will allow the potential for analyses across data sets.

Major issues remain in creating a virtual international data library. National differences in access to data create barriers across boundaries. The information costs may inhibit widespread access and use. Systems must be established to ensure confidentiality and privacy protection. Data security and authenticity must be assured.

We have learned several lessons from our experience with networked quantitative data. Users expect openness and transparency in the systems and have low tolerance for systems that are difficult to use. Data are still hard to find; we must empower users to discover data resources, wherever they are, whatever they contain. There is a very high potential for erroneous use of data by naïve users.

We now know that the union catalog for social science data archives will be distributed, virtual, and dynamic. Data archives will shift from primarily enabling retrieval of data to also enabling retrieval of information (i.e., to interactive analyses)

New standards will make distributed systems transparent to the user. There is a great need for and receptivity to the idea of developing common metadata standards, but there is less receptivity to adopting those standards.

Rockwell suggested that the community must resist the urge to revise standards continually to meet some additional community's needs, to extend standards beyond their original scope, and to make all standards dependent upon each other. But it must make interoperability its goal.

The pilot project has demonstrated that far too little money is being invested in preparation of metadata, and that in the long run, far too much will be invested unless the process is automated or semi-automated.

The overall conclusion of the DDI project is that logically structured metadata based upon expert domain-specific knowledge is indispensable to accurate and efficient discovery, retrieval, and use of Web documents, including social science data.

Rockwell believes that the DLF can help the community make the DDI a best practice by persuading publishers, including data producers, to do markup. Moreover, he believes that the DLF can facilitate quantitative research in the social sciences by educating librarians about social science data and metadata, and by encouraging them to gain new skills.

Integrating Access to Distributed Networked Resources

Daniel Greenstein, Arts and Humanities Data Service, King's College, London

The Arts and Humanities Data Service (AHDS) is funded by the Joint Information Systems Committee (JISC) and emphasizes data archiving. It manages a heterogeneous collection that is distributed, interdisciplinary, and variously formatted and cataloged at differing "collection" levels. The AHDS wants to establish a single point of discovery, access, and retrieval for its own resources and for those from third-party providers.

Greenstein described how different disciplines have different data needs and perspectives on information. The question for AHDS is how to present heterogeneous collections in a unified way. How can one facilitate the resource discovery process so that users can search across domains in which information providers and domain specialists use very different standards? Interoperability is a goal of AHDS, despite an environment that is cross disciplinary, has heterogeneous collections and heterogeneous formats, and has varying levels of collection or record.

The following are examples of what is included in AHDS:

  • The Archaeology Data Service (York, GIS etc., NGDF, FDI OLIB/SQL)
  • The History Data Service (Essex, databases and statistical sets, DDI, CHESHIRE/SGML) The Oxford Text Archive (Oxford, e-texts and linguistic corpora, TEI, PAT/SGML)
  • The Performing Arts Data Service (Glasgow, film and sound, proprietary Hyperwave/OODB)
  • The Visual Arts Data Service (Surrey, images, VRA, CIMI/SSL)
  • Large number of possible third-party configurations

In deciding how to present diverse collections in a unified way, it is first useful to review users' requirements. Users of AHDS want integrated access to information resources, irrespective of place, format, and curatorial tradition. They need rich search and retrieval capabilities, and seamless interfaces between discovery, delivery, and use. They also want the ability personally to configure the network environment.

What are some of the opportunities to accomplish this? Cataloging standards have emerged from a variety of domains, including libraries, data archives, museums and heritage organizations, and the geospatial disciplines. In addition, the momentum behind the Dublin Core provides the possibility for a resource that can enable discovery-metadata standards that have potential to improve cross-domain search, retrieval, and use.

The Z39.50 network application protocol promises easy exchange of information between systems, and there is robust organizational momentum behind interoperability.

Testbeds include OCLC, CIMI, RLG, NordInfo, and UKOLN.

Variations among users and data are common problems and not unique to AHDS. AHDS is interested in performing research and development on these ubiquitous issues of discovery and retrieval and architecture. It is working on two fronts: developing cataloging and resource discovery metadata, and developing appropriate information architectures.

Domain-specific metadata that can provide rich descriptive information already exist, such as ISAD(G) or EAD, MARC, NGDF (FGDC), CIMI/CIDOC/SPECTRUM, DDI, and TEI Header.

How can one make the various types of metadata interoperable? Perhaps Dublin Core Metadata, representing fundamental information, can be the initial source for resource discovery. But it would be necessary to achieve consensus on this point. Therefore, AHDS sponsored six workshops to answer four questions:

  • What are the major cataloging standards within your domain?
  • In light of those standards and your knowledge of user behavior, what are your core resource discovery requirements? (These are first-order requirements, not subsequent filters and the like.)
  • How do your core resource discovery requirements map onto the Dublin Core?
  • Evaluate the Dublin Core in light of your requirements.

Greenstein then explained the AHDS architecture and demonstrated various resources including the Archaeology Data Service, which uses the Cheshire system, and the AHDS gateway, which provides access to all data services, including the History Data Service, the Performing Arts Data Service, and the Visual Arts Data Service.

Having realized many of its original goals in the current AHDS system, AHDS is now considering second-order issues. For example, there is a need to ensure that the system is scalable; that it can handle more information providers, data resources, and users; and that it can continue to provide appropriate user support.

To ensure continuing responsiveness to users' needs, AHDS is analyzing users' behavior. It is exploring the possibilities of assisted searching, working on Gateway-to-Gateway interoperability, and developing domain-specific Dublin Core implementation guidelines written from a cross-domain perspective. Finally, AHDS is developing Z39.50 Implementation Guidelines (ZIG).

Reviewing what has been accomplished, AHDS has identified some key hurdles. First, it has been difficult to involve users in a meaningful way. Second, it has been difficult to focus serious efforts on cross-domain issues. A continuing challenge is to build consensus around communities and then communities around consensus. It has been challenging to develop realistic expectations of emerging technologies and their suppliers. And, finally, the costs of facilitating collaboration and consensus are always higher than expected.

AHDS is continuing to work with trusted information providers on collaborative filtering, cognitive structures, and "meta standards." A clear lesson has been that guidelines for implementation of standards are required to ensure consistency equivalent to a standard.

Libby Stevenson, standing in for Ann Green, facilitated a group discussion on criteria for best practices and ways to advance the state of the art in discovery and retrieval.

Library and data center collaborations have been strengthened in recent years, facilitated by the Web. Data integration work is being done through gateways such as CESDA and ICPSR. Although item-level searches work well, there are still considerable challenges in cross-item searching.

A number of questions, issues, and concerns arose during the discussion. There is a clear need for cross-codebook searching, and this will be possible only if there is standardized coding. One needs to be realistic about what gateways can achieve in this formative period, before clear best practices have been established.

  • How much further integration of multiple, distributed, catalogs will be possible, and will this be possible in real time?
  • Can the codebook feed a search?
  • Can one perform an intermediate analysis and use the results to refine a search?
  • What do we know about the state of the art in discovery and retrieval?
  • What might we still need to learn and from whom?
  • What are the implications of what we do know?
  • What do we need to know about users and intermediation?
  • Where are areas of synergy and collaboration?
  • Can we identify paths and collaborations?
  • What is our motivation?

: To get started?

: To make progress?

  • How do we involve the producers (including faculty and researchers)?
  • We need better documentation about data sets. How can we track updates and achieve version control?
  • How can we facilitate producers' work?
  • How can we deal with varying levels of quality?
  • Who are the other stakeholders? (e.g., IASSIST)
  • Who are the funding bodies and what strategic leverage might they offer over producers?
  • What is our public accountability?
  • How can we ensure reusability of data (archiving)?
  • How do we share expertise? Communicate intergenerationally about data? Develop a common language?
  • How can we help to facilitate new forms of scholarly communication?

The ICPSR Data Preparation Manual is intended to be a resource for best practices. Can it be both a "club" (making its use a condition of receiving extramural funds), and a carrot (an inducement because it makes data more accessible)? Should funders of sponsored research require that data sets resulting from that research be submitted in a standard format with a DDI codebook?

We need to build expertise in version control (should we use the Digital object identifier, or DOI?) and, at a minimum, tag erroneous versions of data. It was suggested that eradicating the erroneous data indiscriminately would be unwise because there may be references or citations to those data.

How do we deal with the problems of legacy data, such as pre-machine-readable codebooks on deteriorating paper?

What about collection development? How many copies of a data set does the country need? How many versions? Where should they be stored? Local replication is no longer necessary or desirable in all cases. But do local service needs require multiple archives?

How should we define the roles and responsibilities of archives, information providers, and local services? What distinctions need to be made between archives and access repositories?

What is cataloging? Of objects? Of works? Does ownership matter?

The user needs more information about a data set than a highly abstracted MARC record. In the online environment, the record can be a direct link to the information itself rather than merely a surrogate for the object. What is the relationship of traditional catalog entries to owned vs. remote data? To web tools and search engines? To portals? The traditional catalog function of collocation is not well served by the multitudinous discovery tools and repositories that exist.

Lunch

During lunch, Henry Farber, Hughes-Rogers Professor of Economics at Princeton, spoke on the user's perspective.

Dr. Farber described himself as a labor economist who views the world through data and analysis. According to Farber, economics is a systematic way of interpreting data. At Princeton, there are 150 seniors per year in the Economics Department, and at least half write a data-based thesis.

Because data accessibility is the key to effective research, the researcher needs a readily available catalog of data sets. It is not necessary for each institution to duplicate data sets, although heavily used sets probably need to be housed onsite. Codebooks are essential if the data sets are to be usable. To improve access, a consortium should share the work of scanning the data and maintaining a number of paper documentation centers.

Reflecting on his own research, Farber said that he wants his data online and instantly available, although he concedes that perhaps little-used data could be in less accessible storage. In his view, "tapes are history." Like other sophisticated users, he wants raw, not processed, data. A problem with common extraction tools is that they process the data. He can use Unix servers, command line access and manipulation, and FTP. He would like easy extraction and retrieval tools that present flat ASCII files. He would like to see systems that don't force people to use SASS or SPSS. The community needs to address problems of data sets that are available only through proprietary extraction tools. There is a need for better authorization of users, particularly for remote data.

Although data can be anywhere, Farber believes that services must be local. For example, students need training to use data effectively. They must learn how to collect data, keep a lab notebook, define the protocols of their research, name all variables, and name all values. They must ensure that results are reproducible (this may rule out on-the-fly analysis). The results must be reproducible from the existing data set and for a new sample.

Farber ended by suggesting that the scholarly reward structure be adjusted to include deposit of primary source data.

Part 2: Data comparability and interpretation

Turning Numbers Into Information from the Products of the American Community Survey

Cynthia Taeuber and Elaine Quesinberry, American Community Survey, U.S. Census Bureau

The Census Bureau is conducting the American Community Survey (ACS). By 2003, it will be done in every county and will sample three million addresses per year. By 2010, it will replace the decinnial census long form, and will provide annual data to the census tract level. The survey will simplify the dicennial census and improve data by using professional interviewers.

The American Community Survey is being produced on CD-ROM, and is described as a "one-stop" shopping center for American demographic information. It includes a tutorial and a guided tour. The data sets include hot links to definitions. The data analyses include direct estimate, deviation, and confidence intervals. A complementary Web site has research results and case studies. The current population survey, which provides monthly information about the labor force and employment, will not be replaced. The ACS will incorporate local and state data from the Current Population Survey.

The ACS is distributed in a bundled proprietary product, and this raises several archiving issues. What will be the migration path for these data?

Experience-Education-Interest: A Collaborative Approach to Data Reference and Interpretation

Bobray Bordelon, economics librarian, Princeton University

Bobray Bordelon described the multiple competencies that a social science librarian must possess: subject, format, technological; yet salaries remain low. He suggests that since no one librarian can know enough to meet all of the users' needs, subject and data librarians should form partnerships. He described the librarian as using multiple sources from multiple agencies, and multiple databases to answer queries. They must understand the scope, strengths, and limitations of each of the sources available. Moreover, the queries themselves are often not "standard" in the sense that there is a published answer to which the questioner can be referred. There are myriad difficulties in answering real-world questions in a rapidly changing world of heterogeneous data sources. One can build knowledge through productive interactions with colleagues who possess varying skills and knowledge. Nevertheless, there will be frequent referrals from one professional to another, sometimes within the same question. Bordelon asks whether the increased intervention of the librarian in mediation, evaluating the quality of information, advising students, and participating in research is appropriate.

Bordelon suggested some action items:

  • Develop core competencies for data librarians.
  • Develop Web-based discovery tools.
  • Develop data and services that cut across databases and vendors.
  • Develop systems and tools for services.
  • Establish training programs for non-data librarians.

Judith Rowe, senior data services specialist, Princeton University, facilitated a discussion of criteria for best practices and ways to advance the state of the art in data comparability and interpretation.

The group made the following points:

  • The conversion of heavily used print materials to machine readable form would strengthen the historical record.
  • Preservation of older data series and data recovery need to be high priorities.
  • Training of professionals, including data librarians and subject specialists, is much needed, especially training in data comparability and new tools.
  • Training of users is also important.
  • Automating user services and data retrieval will improve their responsiveness.
  • It would be useful to develop model data policies for accession, access, and archiving.
  • Data librarians should coordinate their efforts with vendors, including establishing consortial pricing, and setting service and quality demands.
  • Can we add value to free public data?
  • It would be good to establish a master plan for coordination of local work/projects.
  • The DLF could be a catalyst for training that complements training offered by ICPSR and IASSIST.
  • The DDI will solve some documentation and archiving issues.
  • The DLF's imprimatur on DDI could help it move forward effectively; public pressure needs to be put on the Census to adopt DDI.

Part 3: Extraction

Toward a Virtual Data Center

Gary King, professor, Department of Government, Harvard University, and Harvard-MIT Data Center

Gary King, who teaches quantitative methods at Harvard, directs the Harvard-MIT Data Center, which maintains a Web site that includes codebooks and data sets. At Harvard, the number of people physically visiting the data center has declined, but data uses have increased significantly. Researchers are seeing and working with more data. He suggests that a way be found to give credit to those who have created data. He would like to see multiple data centers cooperate better. The possibilities of sharing data are compromised when sites each want to control their own data, but have no mechanism to integrate these locally controlled holdings with other data centers.

The NSF Digital Libraries Initiative Program Committee has recommended that Harvard receive an NSF-sponsored (with DARPA/NEH/LC/NLM/NASA) grant that will be used to create the Virtual Data Center (VDC). The software will be compiled into a CD or some device that will enable its installation anywhere. The project will create an infrastructure for common interfaces and services across data centers; these centers will be able to serve local holdings and to seamlessly share data and services.

The initial VDC features comprise four categories: data preparation, data access, user interface, and interoperability. Any large-scale production system that operates in an open environment has to come to terms with these features. Yet many features, such as naming, property rights, and payment, raise research problems that are as yet unsolved. Indeed, a full solution to many of these problems can come about only when communities, as a whole, adopt standard approaches. We do not expect to solve these alone, but we intend to create an interim solution for social science data that incorporates insights from previous digital library research to explore how these problems can be approached in a real production system. This interim solution will be one of the first to address a number of digital library issues in a production environment, and so might be used as a production framework for more complete solutionsÑas technologies for naming, metadata, payment and other services develop. We also expect to produce a framework in which we can develop services for other types of digital objects, such as journal articles, and which will allow us to launch major user-studies.

The long-term funding model remains a challenge. The basic code will be open-source, free, and non-commercial. "Snap-in" modules will allow easy modification and updating of the code. The organizational structure and architecture are designed to foster minimum labor at each site. But there still needs to be a source of funds to maintain the system and to provide preservation and persistence services. Should it be established on a consortium model? Should commercial data providers be allowed, or even encouraged, to sell o the system? Should it accept advertising?

In addition to the funding model, there are a number of other institutional issues: the VDC will provide the technology to locate and share data, but what institutions are necessary to assure the persistence of this data? What will be the conventions for assigning unique names to data sets? What are intellectual property rights over data, and how can they be protected? What types of schemes to charge for data and services should be supported? What are good ways to credit researchers for providing data? They envision a community-based, bottom-up approach to these issues.

There are also questions about how to connect data to text. Initially, the VDC will work with dissertations to explore these questions, in collaboration with University Microfilms, Inc.

King invited members of DLF institutions to collaborate in developing the VDC.

Taste Before you Chew: Allowing Users to Browse Data before Downloading

Tom Piazza, Environmental Resources Librarian, Computer-assisted Survey Methods Program of the University of California, Berkeley (http://csa.berkeley.edu:7502/digital/)

Piazza used analogies of libraries and bookstores to explore the range of access to data sets that is possible.

Comparisons of Data Archives with Traditional Libraries and Bookstores

Traditional Library : Bookstore : Data Archive

Closed stacks : Shrink wrap : Download only

Open stacks : Stand and browse : Codebook online

Carrels : Sit and read : Online Analysis

Take out a book : Buy a book : Download a data set (or a subset)

He then proceeded to give a demonstration of the Survey Documentation and Analysis System. He browsed a codebook, ran a crosstab, and retrieved a subset of a data set on the fly. The data sets include a title page, general introduction, study description, variables, and appendices.

He described some problems with data archives, again by analogy to libraries and bookstores.

Some Problems Faced by Data Archives Compared with Libraries and Bookstores

Traditional Library : Bookstore : Data Archive

Client can read : Client can read : Client may not have software for analysis or know what to do with a file

Financed from general : Heavy competition : No one's child

funds (not users) : from chains : competes with other (especially for value-added services)

He listed a variety of next steps:

More options for accessing data subsets are needed, including stata data definitions and SPSS portable files. Some users need additional analysis programs such as correlation matrices, multiple regression, and the ability to list contents of selected data records. It would be helpful to allow students and researchers to create their own variables, to combine with archive variables. Both recorded variables and computed variables would be helpful. The Computer-assisted Survey Methods Program seeks additional ideas from its users, including members of the DLF.

Patrick Yott, director, Geospatial and Statistical Data Center, University of Virginia Library, facilitated a discussion of criteria for best practices and ways to advance the state of the art in extraction.

Questions and issues raised by the group include:

When does a service become a disservice?

What is the role of mediation (if there is one)? Is disintermediation possible? What does mediation mean on the Web?

What are the benefits/strengths of existing data systems? Are these strengths a clue to "best practices"?

What is the purpose of "best practices"; what are we trying to achieve?

Privacy is a concern.

Users, data archivists, and librarians need good documentation to understand data sets.

Functional requirements need to reflect the needs of the discipline(s) as well as those of the information provider.

Are best practices for students different from those of researchers? For example, are researchers more likely than students to want flat files instead of pre-packaged sets?

The group proceeded to explore the possibility of creating a taxonomy of uses and users, proceeding from the novice user, or simple query, to the expert user or difficult query. There are a range of operations that need to be supported, for example:

Get a number ® analyze a subset ® get and work on an entire (flat) file,

Or there could be a range of users, for example:

Public/ready reference ® undergraduates/proto researchers (the educational function of the archive; helping users gain quantitative literacy) ® research users

And there are a variety of services, for example:

Documentation ® discovery ® exploratory analysis ® delivery.

At the level of the simplest query, or look-up, as well as at the level of the sophisticated researcher doing complex regressions, minimal services needed. But the proto-researcher, who is learning about quantitative research, needs consultation about extraction, analysis, and visualization, and will require substantial levels of services. The group concluded that the state of the art needs to be most advanced for beginning and proto-researchers. They need help from people experienced in the "reference interview," and would benefit from methods assistance.

Summary and next steps

Don Waters, director, Digital Library Federation, summarized the results of the meeting and described next steps.

First, he reiterated that "the Federation is you," meaning that the DLF facilitates collaborative efforts, builds upon them, and relies on the sponsors to set its agenda and direction, and to carry out its work. The Federation recognizes many models of cooperation, and it considers digital libraries very broadly. The DLF is interested in advancing the services of the digital library. A key concern is how to make workable libraries out of information that is "born digital." Social science data are a good example of data born digital. But they also fall into the category of information that, although born digital, may need to be digitized, or reborn, to become useful within a broad digital library framework. Social science data, especially time series and printed codebooks, present challenges of dealing with a body of material of differing types.

The DLF wants to focus on reducing barriers by catalyzing efforts to address problems. What are the strategic efforts that will improve users' experience with digital libraries, particularly social science data libraries? How can systems be constructed so that scholars can depend on an organized body of knowledge, preserved over time? Social science data provide a perspective on scholarly communication complementing scholarly publishing: in many cases, scholars create the data and also publish results from it.

The DLF has an interest in extending bodies of research and teaching resources to new constituencies to strengthen libraries' public service role. Social science data are a prime resource to be exploited more widely when made more easily accessible and usable.

Waters noted five themes from the workshop:

  1. user requirements;
  2. staff development;
  3. mechanics of cooperation;
  4. discovery; and
  5. interfaces.

User requirements: A recurring theme of the workshop was the need to assess users' needs as we design digital social science data libraries. A key question is how do we learn what users' requirements are? To be effective, we must focus on disciplinary and pedagogical needs. Our systems must be designed to meet the needs of specialized scholars and our services tailored to promote the educational mission. As we gain understanding of users' needs, it will be necessary to make these needs known to data providers in order to sustain a common level of services among data resources from various sources.

Staff development: Digital libraries place new demands on librarians. But building digital libraries of quantitative data introduce even more demands, require development of new competencies, and increase the dimensions of collaboration required to be effective. Although DLF is not a training agency, it could work with others and lobby to get programs established. The DLF is particularly interested in training tomorrow's leaders. The Council on Library and Information Resources has established the Frye Leadership Institute to address the changing nature and requirements of leadership in twenty-first century libraries and information technology. It might be desirable for the Frye Institute to include in its curriculum a segment that addresses data management and archiving issues.

Mechanics of cooperation: The workshop brought out the need for institutions to collaborate more effectively in several areas. Creating distributed virtual data archives requires close coordination on collection development policies and preservation to be technically feasible. Moreover, it implies the creation of very different service models than in the past. How can we develop these service models collaboratively? If patterns of collecting, use, and expense are not similar among institutions (or perhaps even if they are) the collective must establish its own fair economic models. Finally, it will be desirable to act collectively to develop model licenses, to monitor and contain the cost of data, and to avoid the transfer of public data into the private sphere.

Discovery: Discovery proceeds at many different levels, most of which are amenable to cooperative action. These include documentation, cataloging of data sets, digitization and mark-up of codebooks, indexing of resources, and development of gateways. Waters focused on the possibilities for collaborative action in the cataloging of data sets. How can we create information about data sets economically, then distribute it effectively and efficiently? How can we compensate one another for the work that is done? How do we distribute records? How do we distribute cataloging skill? Do we have the resources and infrastructure locally to implement the DDI? Who should do this? There is a key role for the social science community in proposing a focused effort, one that recognizes the heterogeneous environment in which we find ourselves.

Waters then addressed the larger for discovery. How can we establish robust gateways? These gateways will rely on careful documentation of sources, deep and rich indexing, and services appropriate to the users and resources.

Interfaces: For the distributed archive to work effectively for users, it must present a common interface. Collective effort will be required to achieve this, including understanding of user needs, adoption of best practices, and shared resources. The interfaces of the various repositories must facilitate comparability and interpretation. Moreover, common practices will facilitate preservation of the data. Creation of systems that can support undergraduate education is very strategic; it focuses on an area of increasing attention in higher education, and may reduce aggregate costs in the long-run if tackled collaboratively.

The interface question was addressed in several ways in the workshop, and there are several alternatives. One model (perhaps like Tom Piazza's) would be to design highly-organized data sets, interfaces, and tools that could be shared among institutions. In contrast, Gary King's model focuses on a robust collection of data sets in standard formats, within repositories that have pledged to maintain archival responsibilities. In the first model, a key question is how can independent repositories be encouraged to follow common practices at a detailed level? In the second, the question is how to layer coherent services on the distributed archive.

Finally, it is clear that librarians need guides to data resources. Where are data on various topics? Which are best for particular purposes (e.g., novice vs. research use)? What development projects are underway? It might be useful for the DLF to convene key group leaders from various institutions to share information about various projects, and to gain their commitment to maintain such a registry. In addition, preservation of data sets anticipated to be in future demand will be of growing importance. How can we know who has committed to preserving various data sets?

Ultimately, the challenge will be to convert individual projects into operational services for the greater good.

Waters enumerated three areas in which he foresees follow-up action:

  1. Cataloging
  2. Virtual Data Center cooperation
  3. Registry of Projects
For further information please consult the following pages:

return to top >>