DIGITAL LIBRARY FEDERATION
Report of the meeting held January 27-28, 1999
WORKSHOP ON SOCIAL SCIENCE DATA ARCHIVES
On January 27-28, 1999, the Digital Library Federation (DLF)
hosted its first workshop on the state of the art of digital
libraries in the social sciences. The purpose of the workshop was
to explore current problems and emerging solutions in three
- facilities for users to discover and retrieve relevant and
related data sets;
- means for users to interpret and evaluate the comparability
of data sets; and
- tools for and methods of data extraction for analysis.
The DLF convened experts in the development of social science
digital libraries and managers of social science data archives to
reach common understandings about the problems and potential
solutions in the three focal areas, and to establish a
collaborative environment for advancing digital data libraries.
Experts defined the state of the art and participants joined in
identifying criteria for best practices with the objective of
identifying work that the DLF can undertake to advance the state
of the art.
Welcome and Introductions
The workshop opened with welcoming statements from Kevin
Barry, head of the Social Science Reference Center, Princeton
University; Karin Trainer, university librarian, Princeton
University; and Donald Waters, director of the Digital Library
Mr. Barry opened the workshop with a description of
Princeton's Social Science Reference Center services. In 1981 the
social science reference function was transferred to the Social
Science Reference Center located in the Library which resulted in
improved services. The Center serves undergraduates, graduate
students, and faculty, particularly those in the social sciences,
and helps students with their senior theses, an increasing number
of which are quantitative. The Center pulls together data, human
expertise, myriad data and information resources, and services
such as statistical analysis consulting.
Barry welcomed the assembled group and expressed his hopes and
expectation that the meeting would foster collaboration among the
participants and improve services to users, including helping
them to become more self-sufficient through the use of better
Welcoming the workshop participants to Princeton University,
Karin Trainer encouraged collaboration to improve services
and to leverage the investments and expertise of individual
institutions. She acknowledged the DLF's contributions to the
state of our collective knowledge about digital libraries, to the
development of best practices, and to the goals of improved
services to users.
Don Waters thanked Princeton University for hosting the
meeting and reinforced the collective belief that significant
gains can be made through collaborative action. He recognized and
thanked planning committee members Kevin Barry, Ann Green (Yale),
Lauris Olson (University of Pennsylvania), and Rebecca Graham
Waters explained that DLF initiatives focus on the need to
reduce barriers to cooperation. The initiatives are catalytic,
not operational. They aim to foster organizational and
project-based efforts toward further progress. The DLF has
focused on architecture and infrastructure (for example,
intellectual property and access management, naming, rights and
permission protocols, metadata, and linking). He described
specific projects such as the DLF/NISO workshop on naming, a
workshop cosponsored with NISO to identify metadata for images,
studies in digital preservation, and initiatives in scholarly
communication (including working with faculty in art history, and
divinity and theological institutions).
The Social Science Data Initiative goes back to the founding
of the DLF. Growing out of an organizational initiative among
Berkeley, Harvard, Stanford, UCSD, and Oregon State, it has
expanded to include the assembled institutions (for a list of
participants, see http://www.diglib.org/collections/ssda/ssdaparticipants.htm).
Whereas earlier efforts attempted to develop a comprehensive plan
for data and services, the current effort will focus on
identifying a set of steps that can provide improved services in
the short term and build a foundation for the future. Sid Verba
of Harvard, who serves on the Board of the Council on Library and
Information Resources (CLIR), has been a particularly strong
advocate of developing online social science data libraries.
Waters described the goals of the workshop. It is urgent to
develop improved social science data services for research,
public service, and teaching. He hoped that the workshop would be
intimate and interactive, and that it aim its deliberations at
the managerial level to inform future action. The workshop is
organized into three sessions:
- discovery and retrieval;
- comparability and interpretation;
- data extraction.
Although other areas are also important for social science
data services, they are secondary to the topics above.
Waters challenged the group to think about what might
characterize excellent services. What might be done together or
separately to advance the state of the art of digital libraries
in the social sciences? What resources are needed? He emphasized
the ultimate goals of fostering communication among scholars and
data literacy for students.
Part 1: Discovery and retrieval
From Catalog to Portal: Integrating Software, Metadata,
Resources, and Services
Richard Rockwell, executive director, Inter-university
Consortium for Political and Social Research (ICPSR)
Rockwell described the transition in ICPSR's use of metadata.
ICPSR is creating a new approach: a portal that provides an
integrated set of services to a tightly focused community. The
ICPSR's portal provides quantitative social science data
resources for research and instruction (it does not provide text
or images). It supports retrieval and use of data, provides
training and user support, and allows discovery. Through its
services, one can progress from data, through information, to
knowledge. User support, which is a very important component in
the ICPSR portal, is not frequently prominent in other
Today's vastly improved computing environment can deliver data
instantaneously with documentation and services, and thus allows
the portal to integrate software, metadata, resources, and
services. Rockwell compared the current state of affairs with
that of the recent past, when reels of magnetic tape had to be
loaded into mainframe computers for use, and metadata were found
in printed volumes.
The ICPSR is interested in all social science data resources,
whether they are ICPSR data themselves, resources from trusted
data archives, or data from original data producers. ICPSR will
make all types of data available through its portal.
Rockwell provided a tutorial on the subject of data archives.
Social science data archives include data from surveys, censuses,
administrative records, direct observation, diaries, and the
like. The data consist of logical records containing the
measurements for each of the units of observation of the
variables (e.g., persons: sex, political party, occupation;
organizations: number of employees, annual sales).
Data are unusable without documentation. To progress from raw
data to information, the researcher must use a statistical
analysis program (or spreadsheet) to produce analytical results,
then use the data documentation (codebook) to interpret the
The codebook is metadata, which includes multiple
levels of information about a data set. At the highest level is
archive-level information, indicating the data gatherer
and data set creator, the price, conditions of use, copyright,
and other agreements. Next are the study-level metadata,
indicating what kind of data are in the set, such as survey,
census, administrative records, and years covered. Third are the
variable/measurement metadata, indicating the variables,
sample design, units of measurement, and how the measurements
were made and recorded. Thus, the codebook provides the essential
metadata that make the data usable. Traditionally, the codebook
has been a print product accompanying the data file.
Rockwell suggested that as social science data become more
available over the network, and as library users begin to expect
services in support of their use of data, librarians will need to
move into substantively new areas of responsibility and gain new
competencies. Helping students and faculty find information in
printed sources is qualitatively different than helping them to
use data. For example, one does not need to instruct users in
reading or help them understand the structure of a book. In
contrast, the social science data librarian must be prepared to
help users understand a stratified random sample, how to weight
various types of data, and how to choose tools of analysis.
Although the codebook serves as the principal metadata for
social science data archives, some information is captured in
higher-level metadata, such as a catalog. To produce a seamless
system for data exploitation, finding aids must operate on both
the catalog and the codebooks.
Originally, the ICPSR produced an annual printed catalog,
which ultimately reached 915 pages. Later, the catalog was
converted into a SPIRES electronic database, and, finally, in the
early 1990s moved first to Gopher, then to the Web. But in all of
these cases, the catalog included only study-level metadata,
not variable-level information.
The OSIRIS statistical package provided structured markup of
codebooks, and ICPSR was able to convert this markup into a
"Variables" database. But OSIRIS was too restrictive and
simplistic. It was not a standard in libraries or IT communities,
or with data producers outside Institute for Social Research
(ISR) and thus could not be readily exchanged or archived.
Today, the community is working to establish best practices
for digital codebooks. This effort is called the Data
Documentation Initiative (DDI). It is intended to be more
flexible than the OSIRIS approach, operate in a native Web
environment using standard Web browsers, incorporate both
study-level and variable-level metadata in one document database,
and use the XML dialect of SGML. Final revisions are now being
made to DDI release 1.0.
From the beginning, in 1993, the organization and funding of
DDI have been international and have involved both data producers
and data archivists. The initiative was funded initially by
ICPSR, then by a National Science Foundation (NSF) grant in 1996.
NESSTAR, a European Community-funded project, has been using the
DDI prototype standard. In February 1999, ICPSR will issue an RFP
for beta testers of the DDI. The developers would like to test as
many different types of codebooks as possible, and to enlist as
broad participation as possible. To view the DDI, go to http://www.icpsr.umich.edu/DDI/codebook.html.
The DDI includes the following kinds of study-level data:
- Agency or principal investigator(s);
- Persons of organizations responsible for data
- Date and geographic location of data collection and the time
- Index or table of contents;
- Technical information on files;
- Official title of the data collection;
- Project description;
- Response rate;
- Unit(s) of analysis/observation;
- Flowchart of the data collection instrument;
- Restrictions on use of the data;
- Funding sources and related acknowledgments, where
- Sample and sampling procedures;
- Data source(s);
- Data collection instruments;
- List of abbreviations and other conventions;
- Bibliographic citations.
It includes the following kinds of variable-level
- Precise wording of the question or the exact meaning of the
- Missing data codes;
- Exact meaning of codes;
- The item or questionnaire number (e.g., Question 3a);
- Imputation and editing information;
- Unweighted frequency distributions or summary statistics for
Rockwell addressed the question: Why create yet another
document type definition (DTD)? He explained that the Text
Encoding Initiative (TEI) created a DTD to be used for texts,
primarily humanities texts, that bears no cognitive relationship
to quantitative data.
Similarly, the encoded archival description (EAD) is a finding
aid for objects such as textual materials, visual materials,
recordings, and artifacts.
The DDI developers have conformed whenever possible to the
Dublin Core for study-level metadata. But the DDI has been
developed specifically to provide access to quantitative data and
to conform to the cognitive structure of those data. The DDI
responds to broader needs than the internal requirements of
ICPSR. Although the Web held promise for improving access to
quantitative data, by the mid-1990s it was becoming an
unorganized collection of information of highly variable and
often dubious quality, quite transient in nature.
It was clear that for information to be easily found,
navigated, and used, there was a need for cognitive structure,
which was lacking in HTML-based Web documents. It had also become
obvious that free-text searching was a poor way to retrieve
information, since it had low levels of precision and
The 1996 proposal to NSF argued that a logical document
structure was indispensable to data discovery and
retrievalÑa point on which there is widespread agreement
The DDI still faces several problems. To ensure success,
mechanisms for long-term governance and maintenance of the DDI
must be developed, data producers must be encouraged to adopt it,
and automated or semi-automated markup programs need to be
A continuing challenge will be to keep the DDI focused on
quantitative data sets. Each domain has its own cognitive
structure, and developers of the DDI believe that it would be a
mistake to attempt to extend it to additional domains having
different cognitive structures.
Rockwell explained that the DDI promises to enable
quantitative data users to search data resources worldwide in a
single probe and to access both study-level and variable-level
information within a single system. With DDI, language will not
matter, at least in the context of European languages, if
NESSTAR/ICPSR succeeds. Through the DDI, the user can immediately
conduct an analysis of the data. These analyses can be elementary
or exploratory, analytical extracts can be drawn, customized
documentation can be generated, and a variety of views of the
data can be created. In the longer term, expert systems might be
developed to aid researchers in the choice of data. Lowering the
barriers to access and use of quantitative data can expand
Rockwell described how the Web has transformed the nature of
the data archive. No longer are complete data archives maintained
at a central location, or in several locations. Today, some are
directly distributed and housed by data producers; others are
housed at a variety of institutions in a virtual, or distributed,
archive. The distributed system of data archiving and access
poses the risk of a fragmented information system. But the
solution is not necessarily to import all data sets into
centralized data archives. The alternative is to provide expertly
reviewed linkages to data resources outside of the major data
archives. Thus, the meaning of the term data archive
changes from a data collection to a data
According to Rockwell, professional archiving is very
important, but archives need to be information providers,
not just data providers. The creation of a system of
distributed archives, following a common set of best practices,
will lead to greater precision and specificity in finding and
using data, and will allow the potential for analyses across data
Major issues remain in creating a virtual international data
library. National differences in access to data create barriers
across boundaries. The information costs may inhibit widespread
access and use. Systems must be established to ensure
confidentiality and privacy protection. Data security and
authenticity must be assured.
We have learned several lessons from our experience with
networked quantitative data. Users expect openness and
transparency in the systems and have low tolerance for systems
that are difficult to use. Data are still hard to find; we must
empower users to discover data resources, wherever they are,
whatever they contain. There is a very high potential for
erroneous use of data by naïve users.
We now know that the union catalog for social science data
archives will be distributed, virtual, and dynamic. Data archives
will shift from primarily enabling retrieval of data to also
enabling retrieval of information (i.e., to interactive
New standards will make distributed systems transparent to the
user. There is a great need for and receptivity to the idea of
developing common metadata standards, but there is less
receptivity to adopting those standards.
Rockwell suggested that the community must resist the urge to
revise standards continually to meet some additional community's
needs, to extend standards beyond their original scope, and to
make all standards dependent upon each other. But it must make
interoperability its goal.
The pilot project has demonstrated that far too little money
is being invested in preparation of metadata, and that in the
long run, far too much will be invested unless the process is
automated or semi-automated.
The overall conclusion of the DDI project is that logically
structured metadata based upon expert domain-specific knowledge
is indispensable to accurate and efficient discovery, retrieval,
and use of Web documents, including social science data.
Rockwell believes that the DLF can help the community make the
DDI a best practice by persuading publishers, including data
producers, to do markup. Moreover, he believes that the DLF can
facilitate quantitative research in the social sciences by
educating librarians about social science data and metadata, and
by encouraging them to gain new skills.
Integrating Access to Distributed Networked
Daniel Greenstein, Arts and Humanities Data Service,
King's College, London
The Arts and Humanities Data Service (AHDS) is funded by the
Joint Information Systems Committee (JISC) and emphasizes data
archiving. It manages a heterogeneous collection that is
distributed, interdisciplinary, and variously formatted and
cataloged at differing "collection" levels. The AHDS wants to
establish a single point of discovery, access, and retrieval for
its own resources and for those from third-party providers.
Greenstein described how different disciplines have different
data needs and perspectives on information. The question for AHDS
is how to present heterogeneous collections in a unified way. How
can one facilitate the resource discovery process so that users
can search across domains in which information providers and
domain specialists use very different standards? Interoperability
is a goal of AHDS, despite an environment that is cross
disciplinary, has heterogeneous collections and heterogeneous
formats, and has varying levels of collection or record.
The following are examples of what is included in AHDS:
- The Archaeology Data Service (York, GIS etc., NGDF, FDI
- The History Data Service (Essex, databases and statistical
sets, DDI, CHESHIRE/SGML) The Oxford Text Archive (Oxford,
e-texts and linguistic corpora, TEI, PAT/SGML)
- The Performing Arts Data Service (Glasgow, film and sound,
- The Visual Arts Data Service (Surrey, images, VRA,
- Large number of possible third-party configurations
In deciding how to present diverse collections in a unified
way, it is first useful to review users' requirements. Users of
AHDS want integrated access to information resources,
irrespective of place, format, and curatorial tradition. They
need rich search and retrieval capabilities, and seamless
interfaces between discovery, delivery, and use. They also want
the ability personally to configure the network environment.
What are some of the opportunities to accomplish this?
Cataloging standards have emerged from a variety of domains,
including libraries, data archives, museums and heritage
organizations, and the geospatial disciplines. In addition, the
momentum behind the Dublin Core provides the possibility for a
resource that can enable discovery-metadata standards that have
potential to improve cross-domain search, retrieval, and use.
The Z39.50 network application protocol promises easy exchange
of information between systems, and there is robust
organizational momentum behind interoperability.
Testbeds include OCLC, CIMI, RLG, NordInfo, and UKOLN.
Variations among users and data are common problems and not
unique to AHDS. AHDS is interested in performing research and
development on these ubiquitous issues of discovery and retrieval
and architecture. It is working on two fronts: developing
cataloging and resource discovery metadata, and developing
appropriate information architectures.
Domain-specific metadata that can provide rich descriptive
information already exist, such as ISAD(G) or EAD, MARC, NGDF
(FGDC), CIMI/CIDOC/SPECTRUM, DDI, and TEI Header.
How can one make the various types of metadata interoperable?
Perhaps Dublin Core Metadata, representing fundamental
information, can be the initial source for resource discovery.
But it would be necessary to achieve consensus on this point.
Therefore, AHDS sponsored six workshops to answer four
- What are the major cataloging standards within your
- In light of those standards and your knowledge of user
behavior, what are your core resource discovery requirements?
(These are first-order requirements, not subsequent filters and
- How do your core resource discovery requirements map onto the
- Evaluate the Dublin Core in light of your requirements.
Greenstein then explained the AHDS architecture and
demonstrated various resources including the Archaeology Data
Service, which uses the Cheshire system, and the AHDS gateway,
which provides access to all data services, including the History
Data Service, the Performing Arts Data Service, and the Visual
Arts Data Service.
Having realized many of its original goals in the current AHDS
system, AHDS is now considering second-order issues. For example,
there is a need to ensure that the system is scalable; that it
can handle more information providers, data resources, and users;
and that it can continue to provide appropriate user support.
To ensure continuing responsiveness to users' needs, AHDS is
analyzing users' behavior. It is exploring the possibilities of
assisted searching, working on Gateway-to-Gateway
interoperability, and developing domain-specific Dublin Core
implementation guidelines written from a cross-domain
perspective. Finally, AHDS is developing Z39.50 Implementation
Reviewing what has been accomplished, AHDS has identified some
key hurdles. First, it has been difficult to involve users in a
meaningful way. Second, it has been difficult to focus serious
efforts on cross-domain issues. A continuing challenge is to
build consensus around communities and then communities around
consensus. It has been challenging to develop realistic
expectations of emerging technologies and their suppliers. And,
finally, the costs of facilitating collaboration and consensus
are always higher than expected.
AHDS is continuing to work with trusted information providers
on collaborative filtering, cognitive structures, and "meta
standards." A clear lesson has been that guidelines for
implementation of standards are required to ensure consistency
equivalent to a standard.
Libby Stevenson, standing in for Ann Green, facilitated
a group discussion on criteria for best practices and ways to
advance the state of the art in discovery and retrieval.
Library and data center collaborations have been strengthened
in recent years, facilitated by the Web. Data integration work is
being done through gateways such as CESDA and ICPSR. Although
item-level searches work well, there are still considerable
challenges in cross-item searching.
A number of questions, issues, and concerns arose during the
discussion. There is a clear need for cross-codebook searching,
and this will be possible only if there is standardized coding.
One needs to be realistic about what gateways can achieve in this
formative period, before clear best practices have been
- How much further integration of multiple, distributed,
catalogs will be possible, and will this be possible in real
- Can the codebook feed a search?
- Can one perform an intermediate analysis and use the results
to refine a search?
- What do we know about the state of the art in discovery and
- What might we still need to learn and from whom?
- What are the implications of what we do know?
- What do we need to know about users and intermediation?
- Where are areas of synergy and collaboration?
- Can we identify paths and collaborations?
- What is our motivation?
: To get started?
: To make progress?
- How do we involve the producers (including faculty and
- We need better documentation about data sets. How can we
track updates and achieve version control?
- How can we facilitate producers' work?
- How can we deal with varying levels of quality?
- Who are the other stakeholders? (e.g., IASSIST)
- Who are the funding bodies and what strategic leverage might
they offer over producers?
- What is our public accountability?
- How can we ensure reusability of data (archiving)?
- How do we share expertise? Communicate intergenerationally
about data? Develop a common language?
- How can we help to facilitate new forms of scholarly
The ICPSR Data Preparation Manual is intended to be a
resource for best practices. Can it be both a "club" (making its
use a condition of receiving extramural funds), and a carrot (an
inducement because it makes data more accessible)? Should funders
of sponsored research require that data sets resulting from that
research be submitted in a standard format with a DDI
We need to build expertise in version control (should we use
the Digital object identifier, or DOI?) and, at a minimum, tag
erroneous versions of data. It was suggested that eradicating the
erroneous data indiscriminately would be unwise because there may
be references or citations to those data.
How do we deal with the problems of legacy data, such as
pre-machine-readable codebooks on deteriorating paper?
What about collection development? How many copies of a data
set does the country need? How many versions? Where should they
be stored? Local replication is no longer necessary or desirable
in all cases. But do local service needs require multiple
How should we define the roles and responsibilities of
archives, information providers, and local services? What
distinctions need to be made between archives and access
What is cataloging? Of objects? Of works? Does ownership
The user needs more information about a data set than a highly
abstracted MARC record. In the online environment, the record can
be a direct link to the information itself rather than merely a
surrogate for the object. What is the relationship of traditional
catalog entries to owned vs. remote data? To web tools and search
engines? To portals? The traditional catalog function of
collocation is not well served by the multitudinous discovery
tools and repositories that exist.
During lunch, Henry Farber, Hughes-Rogers Professor of
Economics at Princeton, spoke on the user's perspective.
Dr. Farber described himself as a labor economist who views
the world through data and analysis. According to Farber,
economics is a systematic way of interpreting data. At Princeton,
there are 150 seniors per year in the Economics Department, and
at least half write a data-based thesis.
Because data accessibility is the key to effective research,
the researcher needs a readily available catalog of data sets. It
is not necessary for each institution to duplicate data sets,
although heavily used sets probably need to be housed onsite.
Codebooks are essential if the data sets are to be usable. To
improve access, a consortium should share the work of scanning
the data and maintaining a number of paper documentation
Reflecting on his own research, Farber said that he wants his
data online and instantly available, although he concedes that
perhaps little-used data could be in less accessible storage. In
his view, "tapes are history." Like other sophisticated users, he
wants raw, not processed, data. A problem with common extraction
tools is that they process the data. He can use Unix servers,
command line access and manipulation, and FTP. He would like easy
extraction and retrieval tools that present flat ASCII files. He
would like to see systems that don't force people to use SASS or
SPSS. The community needs to address problems of data sets that
are available only through proprietary extraction tools. There is
a need for better authorization of users, particularly for remote
Although data can be anywhere, Farber believes that services
must be local. For example, students need training to use data
effectively. They must learn how to collect data, keep a lab
notebook, define the protocols of their research, name all
variables, and name all values. They must ensure that results are
reproducible (this may rule out on-the-fly analysis). The results
must be reproducible from the existing data set and for a new
Farber ended by suggesting that the scholarly reward structure
be adjusted to include deposit of primary source data.
Part 2: Data comparability and interpretation
Turning Numbers Into Information from the Products of
the American Community Survey
Cynthia Taeuber and Elaine Quesinberry, American
Community Survey, U.S. Census Bureau
The Census Bureau is conducting the American Community Survey
(ACS). By 2003, it will be done in every county and will sample
three million addresses per year. By 2010, it will replace the
decinnial census long form, and will provide annual data to the
census tract level. The survey will simplify the dicennial census
and improve data by using professional interviewers.
The American Community Survey is being produced on CD-ROM, and
is described as a "one-stop" shopping center for American
demographic information. It includes a tutorial and a guided
tour. The data sets include hot links to definitions. The data
analyses include direct estimate, deviation, and confidence
intervals. A complementary Web site has research results and case
studies. The current population survey, which provides monthly
information about the labor force and employment, will not be
replaced. The ACS will incorporate local and state data from the
Current Population Survey.
The ACS is distributed in a bundled proprietary product, and
this raises several archiving issues. What will be the migration
path for these data?
Experience-Education-Interest: A Collaborative Approach
to Data Reference and Interpretation
Bobray Bordelon, economics librarian, Princeton
Bobray Bordelon described the multiple competencies that a
social science librarian must possess: subject, format,
technological; yet salaries remain low. He suggests that since no
one librarian can know enough to meet all of the users' needs,
subject and data librarians should form partnerships. He
described the librarian as using multiple sources from multiple
agencies, and multiple databases to answer queries. They must
understand the scope, strengths, and limitations of each of the
sources available. Moreover, the queries themselves are often not
"standard" in the sense that there is a published answer to which
the questioner can be referred. There are myriad difficulties in
answering real-world questions in a rapidly changing world of
heterogeneous data sources. One can build knowledge through
productive interactions with colleagues who possess varying
skills and knowledge. Nevertheless, there will be frequent
referrals from one professional to another, sometimes within the
same question. Bordelon asks whether the increased intervention
of the librarian in mediation, evaluating the quality of
information, advising students, and participating in research is
Bordelon suggested some action items:
- Develop core competencies for data librarians.
- Develop Web-based discovery tools.
- Develop data and services that cut across databases and
- Develop systems and tools for services.
- Establish training programs for non-data librarians.
Judith Rowe, senior data services specialist, Princeton
University, facilitated a discussion of criteria for best
practices and ways to advance the state of the art in data
comparability and interpretation.
The group made the following points:
- The conversion of heavily used print materials to machine
readable form would strengthen the historical record.
- Preservation of older data series and data recovery need to
be high priorities.
- Training of professionals, including data librarians and
subject specialists, is much needed, especially training in data
comparability and new tools.
- Training of users is also important.
- Automating user services and data retrieval will improve
- It would be useful to develop model data policies for
accession, access, and archiving.
- Data librarians should coordinate their efforts with vendors,
including establishing consortial pricing, and setting service
and quality demands.
- Can we add value to free public data?
- It would be good to establish a master plan for coordination
of local work/projects.
- The DLF could be a catalyst for training that complements
training offered by ICPSR and IASSIST.
- The DDI will solve some documentation and archiving
- The DLF's imprimatur on DDI could help it move forward
effectively; public pressure needs to be put on the Census to
Part 3: Extraction
Toward a Virtual Data Center
Gary King, professor, Department of Government, Harvard
University, and Harvard-MIT Data Center
Gary King, who teaches quantitative methods at Harvard,
directs the Harvard-MIT Data Center, which maintains a Web site
that includes codebooks and data sets. At Harvard, the number of
people physically visiting the data center has declined, but data
uses have increased significantly. Researchers are seeing and
working with more data. He suggests that a way be found to give
credit to those who have created data. He would like to see
multiple data centers cooperate better. The possibilities of
sharing data are compromised when sites each want to control
their own data, but have no mechanism to integrate these locally
controlled holdings with other data centers.
The NSF Digital Libraries Initiative Program Committee has
recommended that Harvard receive an NSF-sponsored (with
DARPA/NEH/LC/NLM/NASA) grant that will be used to create the
Virtual Data Center (VDC). The software will be compiled into a
CD or some device that will enable its installation anywhere. The
project will create an infrastructure for common interfaces and
services across data centers; these centers will be able to serve
local holdings and to seamlessly share data and services.
The initial VDC features comprise four categories: data
preparation, data access, user interface, and interoperability.
Any large-scale production system that operates in an open
environment has to come to terms with these features. Yet many
features, such as naming, property rights, and payment, raise
research problems that are as yet unsolved. Indeed, a full
solution to many of these problems can come about only when
communities, as a whole, adopt standard approaches. We do not
expect to solve these alone, but we intend to create an interim
solution for social science data that incorporates insights from
previous digital library research to explore how these problems
can be approached in a real production system. This interim
solution will be one of the first to address a number of digital
library issues in a production environment, and so might be used
as a production framework for more complete solutionsÑas
technologies for naming, metadata, payment and other services
develop. We also expect to produce a framework in which we can
develop services for other types of digital objects, such as
journal articles, and which will allow us to launch major
The long-term funding model remains a challenge. The basic
code will be open-source, free, and non-commercial. "Snap-in"
modules will allow easy modification and updating of the code.
The organizational structure and architecture are designed to
foster minimum labor at each site. But there still needs to be a
source of funds to maintain the system and to provide
preservation and persistence services. Should it be established
on a consortium model? Should commercial data providers be
allowed, or even encouraged, to sell o the system? Should it
In addition to the funding model, there are a number of other
institutional issues: the VDC will provide the technology to
locate and share data, but what institutions are necessary to
assure the persistence of this data? What will be the conventions
for assigning unique names to data sets? What are intellectual
property rights over data, and how can they be protected? What
types of schemes to charge for data and services should be
supported? What are good ways to credit researchers for providing
data? They envision a community-based, bottom-up approach to
There are also questions about how to connect data to text.
Initially, the VDC will work with dissertations to explore these
questions, in collaboration with University Microfilms, Inc.
King invited members of DLF institutions to collaborate in
developing the VDC.
Taste Before you Chew: Allowing Users to Browse Data
Tom Piazza, Environmental Resources Librarian,
Computer-assisted Survey Methods Program of the University of
California, Berkeley (http://csa.berkeley.edu:7502/digital/)
Piazza used analogies of libraries and bookstores to explore
the range of access to data sets that is possible.
Comparisons of Data Archives with Traditional Libraries and
Traditional Library : Bookstore : Data Archive
Closed stacks : Shrink wrap : Download only
Open stacks : Stand and browse : Codebook online
Carrels : Sit and read : Online Analysis
Take out a book : Buy a book : Download a data set (or a
He then proceeded to give a demonstration of the Survey
Documentation and Analysis System. He browsed a codebook, ran a
crosstab, and retrieved a subset of a data set on the fly. The
data sets include a title page, general introduction, study
description, variables, and appendices.
He described some problems with data archives, again by
analogy to libraries and bookstores.
Some Problems Faced by Data Archives Compared with
Libraries and Bookstores
Traditional Library : Bookstore : Data Archive
Client can read : Client can read : Client may not have
software for analysis or know what to do with a file
Financed from general : Heavy competition : No one's child
funds (not users) : from chains : competes with other
(especially for value-added services)
He listed a variety of next steps:
More options for accessing data subsets are needed, including
stata data definitions and SPSS portable files. Some users need
additional analysis programs such as correlation matrices,
multiple regression, and the ability to list contents of selected
data records. It would be helpful to allow students and
researchers to create their own variables, to combine with
archive variables. Both recorded variables and computed variables
would be helpful. The Computer-assisted Survey Methods Program
seeks additional ideas from its users, including members of the
Patrick Yott, director, Geospatial and Statistical Data
Center, University of Virginia Library, facilitated a discussion
of criteria for best practices and ways to advance the state of
the art in extraction.
Questions and issues raised by the group include:
When does a service become a disservice?
What is the role of mediation (if there is one)? Is
disintermediation possible? What does mediation mean on the
What are the benefits/strengths of existing data systems? Are
these strengths a clue to "best practices"?
What is the purpose of "best practices"; what are we trying to
Privacy is a concern.
Users, data archivists, and librarians need good documentation
to understand data sets.
Functional requirements need to reflect the needs of the
discipline(s) as well as those of the information provider.
Are best practices for students different from those of
researchers? For example, are researchers more likely than
students to want flat files instead of pre-packaged sets?
The group proceeded to explore the possibility of creating a
taxonomy of uses and users, proceeding from the novice user, or
simple query, to the expert user or difficult query. There are a
range of operations that need to be supported, for example:
Get a number ® analyze a subset ® get and work on
an entire (flat) file,
Or there could be a range of users, for
Public/ready reference ® undergraduates/proto researchers (the educational
function of the archive; helping users gain quantitative
literacy) ® research users
And there are a variety of services, for
Documentation ® discovery
® exploratory analysis ® delivery.
At the level of the simplest query, or
look-up, as well as at the level of the sophisticated researcher
doing complex regressions, minimal services needed. But the
proto-researcher, who is learning about quantitative research,
needs consultation about extraction, analysis, and visualization,
and will require substantial levels of services. The group
concluded that the state of the art needs to be most advanced for
beginning and proto-researchers. They need help from people
experienced in the "reference interview," and would benefit from
Summary and next steps
Don Waters, director, Digital
Library Federation, summarized the results of the meeting and
described next steps.
First, he reiterated that "the
Federation is you," meaning that the DLF facilitates
collaborative efforts, builds upon them, and relies on the
sponsors to set its agenda and direction, and to carry out its
work. The Federation recognizes many models of cooperation, and
it considers digital libraries very broadly. The DLF is
interested in advancing the services of the digital library. A
key concern is how to make workable libraries out of information
that is "born digital." Social science data are a good example of
data born digital. But they also fall into the category of
information that, although born digital, may need to be
digitized, or reborn, to become useful within a broad digital
library framework. Social science data, especially time series
and printed codebooks, present challenges of dealing with a body
of material of differing types.
The DLF wants to focus on reducing
barriers by catalyzing efforts to address problems. What are the
strategic efforts that will improve users' experience with
digital libraries, particularly social science data libraries?
How can systems be constructed so that scholars can depend on an
organized body of knowledge, preserved over time? Social science
data provide a perspective on scholarly communication
complementing scholarly publishing: in many cases, scholars
create the data and also publish results from it.
The DLF has an interest in extending
bodies of research and teaching resources to new constituencies
to strengthen libraries' public service role. Social science data
are a prime resource to be exploited more widely when made more
easily accessible and usable.
Waters noted five themes from the
- user requirements;
- staff development;
- mechanics of cooperation;
- discovery; and
User requirements: A recurring
theme of the workshop was the need to assess users' needs as we
design digital social science data libraries. A key question is
how do we learn what users' requirements are? To be effective, we
must focus on disciplinary and pedagogical needs. Our systems
must be designed to meet the needs of specialized scholars and
our services tailored to promote the educational mission. As we
gain understanding of users' needs, it will be necessary to make
these needs known to data providers in order to sustain a common
level of services among data resources from various
Staff development: Digital
libraries place new demands on librarians. But building digital
libraries of quantitative data introduce even more demands,
require development of new competencies, and increase the
dimensions of collaboration required to be effective. Although
DLF is not a training agency, it could work with others and lobby
to get programs established. The DLF is particularly interested
in training tomorrow's leaders. The Council on Library and
Information Resources has established the Frye Leadership
Institute to address the changing nature and requirements of
leadership in twenty-first century libraries and information
technology. It might be desirable for the Frye Institute to
include in its curriculum a segment that addresses data
management and archiving issues.
Mechanics of cooperation: The
workshop brought out the need for institutions to collaborate
more effectively in several areas. Creating distributed virtual
data archives requires close coordination on collection
development policies and preservation to be technically feasible.
Moreover, it implies the creation of very different service
models than in the past. How can we develop these service models
collaboratively? If patterns of collecting, use, and expense are
not similar among institutions (or perhaps even if they are) the
collective must establish its own fair economic models. Finally,
it will be desirable to act collectively to develop model
licenses, to monitor and contain the cost of data, and to avoid
the transfer of public data into the private sphere.
Discovery: Discovery proceeds at
many different levels, most of which are amenable to cooperative
action. These include documentation, cataloging of data sets,
digitization and mark-up of codebooks, indexing of resources, and
development of gateways. Waters focused on the possibilities for
collaborative action in the cataloging of data sets. How can we
create information about data sets economically, then distribute
it effectively and efficiently? How can we compensate one another
for the work that is done? How do we distribute records? How do
we distribute cataloging skill? Do we have the resources and
infrastructure locally to implement the DDI? Who should do this?
There is a key role for the social science community in proposing
a focused effort, one that recognizes the heterogeneous
environment in which we find ourselves.
Waters then addressed the larger for
discovery. How can we establish robust gateways? These gateways
will rely on careful documentation of sources, deep and rich
indexing, and services appropriate to the users and
Interfaces: For the distributed
archive to work effectively for users, it must present a common
interface. Collective effort will be required to achieve this,
including understanding of user needs, adoption of best
practices, and shared resources. The interfaces of the various
repositories must facilitate comparability and interpretation.
Moreover, common practices will facilitate preservation of the
data. Creation of systems that can support undergraduate
education is very strategic; it focuses on an area of increasing
attention in higher education, and may reduce aggregate costs in
the long-run if tackled collaboratively.
The interface question was addressed in
several ways in the workshop, and there are several alternatives.
One model (perhaps like Tom Piazza's) would be to design
highly-organized data sets, interfaces, and tools that could be
shared among institutions. In contrast, Gary King's model focuses
on a robust collection of data sets in standard formats, within
repositories that have pledged to maintain archival
responsibilities. In the first model, a key question is how can
independent repositories be encouraged to follow common practices
at a detailed level? In the second, the question is how to layer
coherent services on the distributed archive.
Finally, it is clear that librarians
need guides to data resources. Where are data on various topics?
Which are best for particular purposes (e.g., novice vs. research
use)? What development projects are underway? It might be useful
for the DLF to convene key group leaders from various
institutions to share information about various projects, and to
gain their commitment to maintain such a registry. In addition,
preservation of data sets anticipated to be in future demand will
be of growing importance. How can we know who has committed to
preserving various data sets?
Ultimately, the challenge will be to
convert individual projects into operational services for the
Waters enumerated three areas in which
he foresees follow-up action:
For further information please consult the
- Virtual Data Center
- Registry of Projects
return to top >>