DLF PARTNERS

DLF ALLIES

Comments

Please send the DLF Director your comments or suggestions.

Supporting Access to Diverse and Distributed Finding Aids: A Final Report to the Digital Library Federation on the Distributed Finding Aid Server Project

Project genesis and background
Overview of project
Distributed searching vs. searching locally replicated databases
Common Access Points
Display and navigation of results
Display and navigation of finding aids
Conclusions and directions for further research
Notes
Attendees at April meeting
References

Project genesis and background

The DFAS project sought to establish an architecture and build a prototype for distributed finding aid searching at a number of DLF institutions. Distributed access to finding aids created and housed at a number of different institutions is an important component in building a viable digital library.

The project began with the recognition that finding aids encoded in SGML are proving to be a significant part of the metadata strategy in the emerging digital library. The Encoded Archival Description (EAD) DTD used for the SGML markup of finding aids allows for considerable diversity of practice and does not prescribe a particular approach to encoding a finding aid. The diversity of SGML markup of finding aids has at least two dimensions: the organization of the markup, and the detail in encoding. In the first case, different institutions may choose different approaches to encoding the intellectual structure of their documents (the order and type of tags applied to the text), leading to sometimes disparate results when rendered online, and to potential problems identifying the location of elements for indexing. In the second case, institutions may choose different practices regarding how much markup to use (for example of names, dates, places, etc.) in their documents, leading to differences of search results.

The RLG Archival Collection Guides search system takes a hybrid approach to this problem. Markup is not prescribed (although RLG does publish guidelines that strongly suggest the use of some tags), finding aids are periodically gathered from local institutions using a web spider, and full SGML display from the local institution is a selectable option. Even so, this system still exhibits the primary problems of trying to homogenize search and display of finding aids across institutions: display of finding aids from some institutions omits some sections that were marked up in unexpected ways; and indexing works inconsistently across the collection, leading to unexpected search results. Another significant feature of the RLG implementation is the lack of ability to retrieve portions of a finding aid. Especially as these collections grow richer, the possibility of encountering large finding aids (e.g., several hundred pages) will make the prospect of transferring the entire finding aid as a result unattractive. Moreover, this sort of imprecise feedback to the user undermines part of the value of having used structured encoding. Finally, the RLG model both removes local control of display and behavior and removes some of the impetus for the community of stakeholders to arrive at a consensus about practice and standards.

The DFAS participants believe that many larger institutions will demand the greater flexibility of defining their own markup practices using the EAD (which we know to be desirable in the heterogeneous world of archival finding aids), and will want to provide a search interface to their finding aid collections that best reflects that practice and the collections represented. In order to test possibilities for accommodating this diversity of markup practice and the ability of institutions to house their own finding aids in their own digital library systems, while still allowing searching across institutional collections, we have built a prototype instance of a Distributed Finding Aids Server (DFAS).

Overview of project

In the original project proposal we asserted that at the end of the project we hoped to have learned

Whether or not distributed searching is a reasonable model for cross-collection searching of SGML-encoded finding aids;
What indexes are minimally useful in a cross-collection search of encoded finding aids;
What are useful approaches to handling indexing across diversely encoded finding aids to ensure reasonable search results;
What are useful approaches to managing intermediate result sets from cross-collection searches using the distributed model;
What are useful approaches to managing extremely large finding aids by presenting them as navigable structures, and what are the display screens needed to make this useful;
How do we optimize display across heterogeneously encoded finding aids to ensure consistent results;

As part of our fourth goal, we committed to exploring questions like: what should be the behavior and appearance of such a cross-collection search? Should the user be presented with a WebZ-style set of results (e.g., six hits in collection A; twenty hits in collection B; eight hits in collection C)? When should the system begin to display actual results grouped by institution? Should the system ever try to integrate results from disparate collections into one merged list?

In the past eighteen months, in order to meet these goals, project staff at the University of Michigan Library’s Digital Library Production Service have

written the middleware to facilitate search and display of the finding aids
collected 330 finding aids from the participant institutions to populate the system
visited each of the participant institutions to install and configure the system software
coordinated the selection of a set of trial common access points
designed and refined the display of search results

The project culminated with a meeting of representatives of the participating institutions in April of 1999 (see list of participants). At that meeting, the participants revisited issues of distributed searching, resolved several outstanding questions regarding the use of Common Access Points, evaluated the methods in place for display of results and made recommendations for the final report, as well as suggested directions for further research.

Distributed searching vs. searching locally replicated databases

In our discussions throughout the project, we have leaned toward the view that the network transport issue of cross collection search is tertiary to the cross collection query expression process and the results behavior. Although we recognized the importance of studying the transportation problems raised by searching across geographically distributed collections, our belief was that we first need to devise methods for integrative searching of the collections. If we are unable to do so successfully, the transportation issues will be moot. Although this assumption was correct and necessary in order to proceed with the system development, it underestimated, as is discussed below, the importance of distribution as part of query expression and results behavior.

Initially creating local copies of the databases did not in any way preclude or exclude geographical distribution of the collections and true broadcast searching. The mechanisms used to get the databases to cooperate at a short distance should work at a long distance, but initial problems and impediments were easier to spot and resolve with locally stored copies. In addition, while developers were learning how the system should and did work, we believed local copies would be a convenience.

We proposed that each institution have the same kind of base middleware, installed and configured by the DLPS programming staff, and each institution host its own (crawler produced) copies of the collections. Moreover, each institution has control over its own search interface and display engine. The institutions were welcome to make use of the DLPS search and display prototypes, but we assumed that they would want to do some customization to meet local needs. Thus, each institution holds a current version of the institution's own database of finding aids, and a relatively recent, crawled version of the other institutions' databases, a complete vocabulary map, the general DFAS middleware, and the local customizations. Each institution also has in place local policies as to whether the local or remote databases will be searched, dependent on factors such as response time, time of day, or current demand on local resources. Local policies also govern how often the databases of other participants are replicated.

At the DFAS project meeting, a number of concerns were raised about the value of replication as opposed to a distributed search. For example, the participants from Oxford pointed out that the costs of replication (storage space, the cycles needed to re-index) may be nearly or as high as the costs of true distributed searching. The most important issue raised, however, was a methodological one: by using replication, we remove local control and the relevance of such features as CAPS (see elsewhere). While the distributed approach uses mechanisms like the CAPS to bridge local encoding practice and generalized methods for retrieval, the use of replication takes the idiosyncrasies of local encoding practices and moves them whole cloth into another institution’s server without the mediating interpretation of the originating institution. Importantly, a focus on distributed search, with mechanisms like CAPS, at least theoretically allows other search engines and indexing practices to be included (thus the importance of articulating an API for this layer of the DFAS system). The group felt, overwhelmingly, that distribution and replication were not equivalent, as had been asserted in the proposal, and that further work should focus primarily on distribution.

Nevertheless, replication has significant value where the institution that has created finding aids is not able to mount an effective system for retrieval, or where consortial interests are strong. The group posited several alternatives that might be explored to alleviate some of those costs, such as the creation of national or regional nodes of finding aids which could be indexed and searched more economically than if these activities were conducted at each institution, or the creation of mirror sites. The group discussion also led to the recognition that there is information about the finding aids that needs to be conveyed as well as the finding aids themselves. For example, flagging new and revised finding aids might save some of the cost of replicating the entirety of all the databases at each replication period. Finally, we need a means of communicating to users when replication and re-indexing last took place so that they will know whether there is new material available to their searches.

These questions are of considerable value in the pursuit of access to distributed and diverse finding aids and worthy of further consideration. While our decision to use local copies of the databases was necessary in order to focus on resolving problems of cross collection search and display, a true distributed search holds much more promise in simultaneously facilitating the diversity of practice at the local level and ensuring that users are able to search the holdings of dozens of institutions from a single interface. Both distributed searching and replication have value for providing access to finding aids at a large number of institutions; finding the best use for each is an important area of future investigation.

Common Access Points

One of the main goals of this project was to begin experimentation with the concept of distributed searching across heterogeneous collections of EAD-encoded finding aids. The idea was to see whether a Z39.50-type approach could be successful with this type of material, as opposed to the union catalog approach. Lacking a Z39.50 profile for finding aids, we began the project by defining a set of "Common Access Points" (CAPs) which would be available at each participating institution for use in the DFAS search interface. These CAPs would represent "indexes" which would be given generic, or "synthetic" names in the system but which could be called whatever seemed appropriate in the local institution’s DFAS interface. These CAPs are essential to create a generalized layer of access that was not specific to any one institution nor to any one method of encoding.

One project goal has been that each institution could define the EAD elements to be included under the various synthetic names however they saw fit (as is done in Z39.50). Initially it was necessary to prescribe which EAD elements would be included under each synthetic name. This raised the problem of having two sets of indexes in the local interface: one for the locally defined indexes and one for DFAS. In Harvard’s case this led to the existence of two access points named "Name (people and organizations)", each containing a different set of EAD elements. In an effort to identify this difference for users, Harvard changed the name of the DFAS access point. The same was true at Harvard for the "Titles" access point, since the DFAS definition of this index is all types of titles, while Harvard’s is just the titles of works found in the collections. This duality of access points and their definitions makes local interface design and user education difficult.

The trade off, however, is clear. In both scenarios the underlying finding aids differ from institution to institution in structure, content, and choice of EAD markup. If local institutions can define what they mean by one of the CAPs (by mapping it to a local index) then a DFAS search might retrieve completely different things at two different institutions. This is something which happens often in the Z39.50 world and is considered a significant flaw in that technology. The variability of underlying EAD markup makes the problem far worse for finding aids. But if local institutions are required to create special DFAS access points which are used by the DFAS interface then there is no logical difference from the union catalog case. The best option would seem to be allowing each institution to define its own contribution to each CAP, but to continue the discussion among all participants about the usefulness of each CAP and why they should or should not include various eligible EAD elements. The CAPs and the EAD elements they include could be used as a starting point for that discussion. To begin and stimulate this discussion, as well as to develop a working production system, a white paper was produced describing the issues involved in this effort, and discussing the current use of access points at the five institutions. It was discovered that even among just five institutions using the same OpenText software there was wide variation in definition. After brief discussion, a set of nine CAPs were chosen to begin with, which are listed below.

Synthetic term	Comments	Included elements
Names	names of all kinds	*NAME
Dates	dates	UNITDATE, DATE
Titles	titles of all kinds	*TITLE
Places	place names	GEOGNAME
Subjects	broad conception of subject information	*NAME, SUBJECT, CONTROLACCESS, OCCUPATION, GENREFORM, PHYSDESC
Repository		REPOSITORY
Contents	item level descriptions	C*, DENTRY
Summary	addressing overall scope of the material	ARCHDESC 1
Anywhere	not including EAD header	FINDAID (beta) or EAD (version 1) minus EADHEADER

Comments on the above:

Names

In EAD, generic identifiers which include the term NAME cover general names <NAME>, family names <FAMNAME>, personal names <PERSNAME>, corporate names <CORPNAME>, and places <GEOGNAME>. Thus place names appear in three CAPs: Names, Places, and Subjects, while all types of names show up in the Names CAP and the Subjects CAP.

It is not clear whether users will consider place names to be a type of "name" such that <GEOGNAME> should be included in the Names CAP, nor is it known whether users will distinguish names as such versus names as subjects. In an effort to accommodate all user assumptions we left this CAP very inclusive, but in order to test the utility of a more restrictive CAP the most recent version of the system will define a "Controlled Names" CAP to include the *NAME elements in a CONTROLACCESS element (usually limited to AACR2 forms of names, and only those important to the collection).

Subjects

There is little consensus about what kinds of data from the finding aid should be included. For example, the folder titles found in finding aid container lists (tagged as <UNITTITLE>) provide a sort of subject access but were not included here. In fact, almost every element of the finding aid might be considered eligible for this access point. And there is, to complicate matters, a <SUBJECT> element in the EAD for those terms or phrases that the encoder feel to be particularly important, as a subject, in the finding aid. There seem to be two approaches to this: restrictive and inclusive. In the first, the Subjects CAP would include only the <SUBJECT> elements (similar to the definition of the Places CAP to include only <GEOGNAME> elements). In the second, it would include everything that might be considered a subject by a user. In the former, there is no ambiguity but if the encoder did not choose to use the <SUBJECT> element often valuable information might be inaccessible. In the latter, Subjects becomes almost synonymous with the Anywhere CAP, and it would be difficult to educate users as to when to use one over the other. Another school of thought was to eliminate this CAP as being too ambiguous to be useful.

Currently the Subject CAP has been left in place, in an effort to give users as much access as possible. At the April meeting, we agreed to remove the GENREFORM and PHYSDESC elements from the CAP, since these seemed to be clearly outside the scope of traditional subject indexes. In order to test the usefulness of the inclusive versus restrictive Subjects CAP we will create another "Controlled Subjects" CAP which will include all SUBJECT, OCCUPATION, and *NAME elements found in the CONTROLACCESS element.

Titles

A Titles CAP should be carefully defined to include only the set of title types that fit together logically. The concept of a title is somewhat complicated in EAD, since there are four distinct types of titles to be considered: the title of the collection (<ARCHDESC><DID><UNITTITLE>), the title of the finding aid itself (<TITLEPROPER>), the title of each folder in the collection (<DSC><C><<DID><UNITTITLE>), and titles of works which may be contained in the collection (<TITLE>). Each of these has different uses, and in particular the last title type (work titles contained in the collection) would be considered by most researchers to be very different from the first three types. UNITTITLE is now excluded from this CAP, which will restrict it to titles found in the collection rather than folder titles which are largely invented by the collection processor.

Dates

Dates are indisputably valuable in searching finding aids, but until we have consistent methods for expressing or normalizing dates that value will be undermined by our practices. "Dates" is another CAP that could be eliminated to avoid confusion, or more tightly constrained to include one single (required) date in the underlying finding aids (for example, the collection dates which are found in the <ARCHDESC><DID><UNITDATE> element). "Dates" is a notoriously difficult access point to define in databases which contain "natural" language documents, as opposed to databases with tightly constrained schemas which control when dates are used and how they are represented. In finding aids dates may (or may not) appear in many forms (ex. January 1, 1900; 1/1/00; Jan 1 ‘00, 1900-01, ‘00-’01, Jan - Dec 1900, etc.). Most SGML-based systems do not even attempt to normalize dates for indexing unless they appear in a very particular way (ex. 1900-1910 might be indexed as a range, with each intervening year indexed separately). Furthermore, many finding aids do not contain useful dates in all the places where they might occur, or they contain the dates but they were not encoded as such in EAD. All of this leads to confusion by users who try to limit their searches to a particular date, year, or year range. Using all the dates found in the finding aid and tagged as such optimizes access by users to this information, but at the cost of confusion as to what is included and when. and may produce misleading results if the system didn't recognize a useful date and reported no hits to the user.

Content and Summary

These are both CAPs which include structural components of the finding aids. In the first case it covers the section of the finding aid called the "container list" or "inventory", and in the second case it covers the descriptive frontmatter of the finding aid, which puts the collection in a context and provides background information on it. Whether users will understand these structural distinctions and how well particular finding aids conform to this structure, remains to be seen. These CAPs may be eliminated after further study.

Display and navigation of results

If the CAPs facilitate successful search and retrieval of diversely encoded finding aids, the question of how to display the results of this search still remains. We would like to institute a three-tiered display of results that is based upon a set of configurable policies. Navigation of the results will be enabled by a frames-based HTML display. The three tiers will be:

Summary level: The summary level will give a high-level hits summary indicating both total number of matches in total number of finding aids and matches grouped by institution. The user will have the option of viewing all the matching finding aids from a single institution or moving into the index level for a view of all finding aids. The summary screen will appear in the left-hand navigational frame and will remain constant throughout the user’s interaction with this set of search results (i.e. until "new search" is selected).
Results list level: The results list will be a list of matches either clustered by institution or merged and ordered alphabetically. The ordering principle of the results list will be decided upon by each participating institution and configured in its middleware. Selecting a match from the results list will take the user to the record level results.
Record level: Record level retrieval returns a "hits in context" view of the finding aid. (At this point the user can choose to go directly to the pertinent part of the finding aid, retrieve an outline view of the entire finding aid, or view the full text of the finding aid).

This record level "hits in context" view is central to DFAS's commitment to enabling user retrieval of part of a finding aid. At the participant meeting it became clear that this was an area of investigation that we had not staked out thoroughly enough. We believe that this display should be embedded in a structural view of the finding aid and recognize that the system needs a way of advocating this. One of the concluding pieces of work on the system has been the configuration of the middleware so as to make this view a default that has to be over-ridden rather than an option that can be locally activated.

Key to this approach is the notion of setting policies to indicate which level of display of results is appropriate. If there is only one match on a search, we may well want to display that finding aid directly. If there are several matches but all from one institution, we will want the system to immediately provide a results list level display in the right-hand, content-bearing frame. Rather than impose arbitrary threshold numbers that may or may not make sense given variable local user needs, we would like to make the numbers configurable at the local level. For example, one institution might choose to move from results list level to summary level when the number of hits exceeds 25, while another might not make that switch until there are more than 100 hits. All institutions will have these multiple views available to them; how to employ them will be a local decision. Similarly, if an institution would like to avoid dependence on frames-based display, the system can be configured locally to choose among the display levels depending on the number of matches to a query.

In discussions at the project meeting and in some local discussions at the University of Michigan with archivists, the point has been repeatedly raised that the institution may well not be the primary point of intellectual identification for the researcher -- instead it may be the archive, particularly when an institution hosts multiple archives. EAD markup supports identification of archives, but in practice this is not widely or consistently used. We need to consider whether there is anything we can do to provide for representation by archive rather than institution and what kind of context can we provide for users to make informed choices about which institutions to search.

Display and navigation of finding aids

The display of the finding aids themselves has not been a high priority for the DFAS project. We believe that rendering in HTML will disappear as an implementation problem. Moreover, our experience has led us to believe that, though not without faults, our current methods are good enough and that problems of search and navigation are much more compelling. That said, we have considered it important to accommodate a variety of methods and local practices for display within the system.

The majority of finding aids in the DFAS system now make use of on-the-fly conversion from SGML to HTML. This display is predicated upon the belief that an effective system makes available multiple views. We also believe that is central to both intellectual and system efficiency for users to have the ability to navigate to and retrieve a selected portion of a finding aid. This is particularly important in the case of very large finding aids; for example, one of the finding aids in the test system is almost two megabytes in size, the Bentley Historical Library's Milliken finding aid is over 500 pages in print, and Oxford reports that some of their finding aids also run several hundred pages.

The key word or "hits" in context display supports retrieval of the relevant portion of the finding aid and navigation to other levels of the finding aid (the user can go up or down levels in the hierarchy). Moreover, if the user has decided through searching or browsing that an entire finding aid is useful, that user has a choice of views. An outline view displays an outline of the finding aid and the intellectual organization of the contents list. The outline view is the most suitable for browsing through a finding aid. Each of the sections of the finding aid can be unfolded as desired. So, for example, a Contents List organizes the records into Subgroups, Series and Subseries (as appropriate for a particular collection), each of which can be expanded down to the level at which the container element (box, drawer, reel etc.) appears in the finding aid.

Finally, for those users who do wish to view the entire contents of a finding aid, a full text view is available. This view retrieves the full text of the finding aid as a single document. It may contain features such as: a more formally formatted title page; the "Collection Scope and Content Note" displaying both the collection level scope and content note and compiling the series scope and content notes (which are physically located in the contents list) into a single narrative text; the Series scope and content notes also displayed in the contents list section; display of the entire list of "Controlled Access Terms." All of this can and will depend on local decisions about the full text display. At present in the DFAS system, there are no internal navigation tools in the full text view. The user must move though the finding aid with the scroll bar or page-up/page-down keys; the full text view might be improved with some internal navigation such as a linked TOC.

The DFAS system also allows for handcrafted HTML rendering of the finding aids, a method that Columbia has chosen to employ and that has the advantage of making possible a much more handsome display. This does, however, lose the ability to display the search results in context and thus allow a user to move immediately to the relevant portion of a finding aid. This is a sacrifice of functionality, but allows for a high degree of control over the display.

Finally, the DFAS system supports the display of finding aids using XSL stylesheets. At present, only Microsoft Explorer 5.0 supports this type of display; we continue to expect, however, that XML capable browsers will soon be widely available. Making use of XSL stylesheets will solve many of the problems and decisions associated with display of EAD encoded finding aids. This will also allow a high degree of flexibility. Using stylesheets, the system can provide a default style. The local institution can override this with an archive or institution specific stylesheet, and users can rebel and apply their own stylesheets, overriding all the others. Moreover, as a test implementation at Cornell demonstrates, XSL stylesheets can be effectively combined with the use of HTML to facilitate navigation.

Conclusions and directions for further research

In sum, the project has established that it is possible to implement a distributed search of diversely encoded finding aids and to obtain meaningful results. It has explored alternative methods of searching multiple databases (remote searching and replication) of finding aids and has laid out advantages and disadvantages of both methods. The system supports a flexible and multi-faceted display of results and of the finding aids themselves.

The system consists of two discernible components: 1) a distributed search management layer and 2) local indexing and access middleware. The system designers have come to believe that, as written, the distributed search management can not scale to include a large number of institutions and archives within institutions, as well as a far greater number of finding aids. As currently written (in Perl) the system is weak on elements of interprocess communication that will become increasingly important as the number of sites grows. The local indexing and access middleware is, however, very capable in handling the search and retrieval of a large number of finding aids from any one repository. It is an excellent production tool for digital libraries with collections of finding aids and we hope that it will be used in such environments. Moreover, the current system also provides a testbed for research on user expectations and behavior in interacting with finding aids and with searching across multiple institutions.

We recognize the need for further exploration of the following areas.

Collect user feedback: It is essential to know how the system works with real archivists and researchers. While we were unable to conduct user testing within the project period, Harvard has committed to a period of testing and evaluation in the next six months.
Discuss methods and rules for date normalization; exploration of the possibility of using controlled access terms for dates. Searching on dates is central to archival research and needs to be better supported.
Conduct statistical studies on retrieval dependent on different CAPs mappings (that is, which EAD elements should be mapped to which access point)
Development of middleware in language more aware of interprocess communication
Explore possibility of using nodes to facilitate "semi-distributed" searching

DLPS will continue to support the distributed search management layer for a small number of institutions and thus the system will be available for use in research and experimentation on issues such as numbers 2 and 3 above. DLPS would welcome such research and would be glad to engage in discussions with researchers as to how best we can support that. DLPS will also continue to support the DFAS middleware as a production tool, available through the revamped SGML Server Program.

Notes

1 Greg Kinney, as archivist at the University of Michigan Bentley Historical Library has noted "The Summary synthetic term has <ARCHDESC> as the sole included element. in fact, that makes it virtually the same as Anywhere (Anywhere would also include <FRONTMATTER><TITTLEPAGE>) If Summary is intended to be the narrative description portion of the finding aid the included elements should probably be <ARCHDESC> minus <DSC> and <ADD> Even that formulation poses a problem, since in some practices the series <SCOPECONTENT> notes are physically located inside the <DSC>. Maybe it will be best to positively state that the Summary synthetic term will include the top level <DID>, <ADMININFO>, <BIOGHIST>, <SCOPECONTENT> and <CONTROLACCESS>(?) elements, regardless of whether they are physically inside or outside the contents list.

Attendees at April meeting

Columbia University: David Millman
Harvard University: Mackenzie Smith
Indiana University: Perry Willett
Oxford: Richard Gartner, David Price, Lawrence Mielniczuk
University of Michigan: John Price-Wilkin, Alan Pagliere, Nigel Kerr, Maria Bonn, Greg Kinney (Bentley Historical Library)

References

DFAS Project Site
Project Proposal
Notes on project implementation and theory
White Paper on Common Access Points
Interim Progress Report, November 1998
DLPS White Paper on Display of Results
Bentley Historical Library EAD Finding Aids Project Documentation

last updated July 26, 1999

DLPS Homepage

return to top >>

Last updated: