Ideas about Metadata Enhancement

Version 1.1, 7 April, 2004

Location of this document: https://diva.cdlib.org/projects/american_west/technology/metadata_enhancement_ideas_v1.html

Author: Mike McKenna

Here's what I've found so far. You can think of the "metadata enhancement" as being an enhancement at different layers of the storage, search, retrieval, and viewing process. I haven't started yet to dive into the products targeted at the ALA .

Take a look at the "Digital Library Project" at UCB for ideas on some emerging technologies. Also check out "Resources on Knowledge Organization and Classification" and "The KMconnection Knowledge Management Product Guide" (about two years old, but great for ideas!).

User Interface

The user interface is not an area of meta data enhancement per se, since no new metadata will probably be generated, but more an area to explore to enable end users to have a more meaningful and productive experience.

Search language - how to get the information

NLP - Natural language processing, allowing the user to use more natural questions, etc. (e.g., AskJeeves). A couple of examples are the Verity search engine and Lextek's Onix.

Navigation and Visualization

Statistical - how many, when, who, where, etc. (e.g., Inxight Table Lens)

This allows you to browse by intensity, relevance, time frame, etc. Some interesting navigation techniques could come out of something like this.
Geospatial - mouse over for region-specific "hits"

Take a look at Distributed Geolibraries, NSDI (National Spatial Data Infrastructure) in the U.S., the Digital Earth project, and the Electronic Cultural Atlas Initiative (ECAI).
I like this idea - the concept, especially for the K-12 crowd, is to be able to mouse over a map and see highlighted "hits" of region specific information or resources. E.g., trace out the route of Lewis and Clark, and be able to switch on different "layers" such as

geopolitical layer
demographic layer

historical layer
+historical maps (e.g., Bancroft Library)

archeological layer

ecology/environment layer
etc.
There's a good discussion at the the Alexandria Digital Library Project at UCSB.
Chronological

This is where you query across time and space, usually using a timeline, a map, or both.

See the Perseus Digital Library at Tufts, and
Timemap, from ECAI
Semantic - topic context in ontology (see "Visualising Information Spaces")

hyperbolic browser (Inxight StarTree)

HighWire at Stanford is using this. It seems to be slow and ackward to use.

topological context (Cyber Geography)
mind map (The Brain)
Tree Map, from HCI Labd at U of Maryland (http://www.cs.umd.edu/hcil/treemap/)
Hierarchical Faceted Metadata, Metadata in Search Interface, or Guided Search (http://bailando.sims.berkeley.edu/flamenco.html)
Other methods (BAILANDO Projects)
Other Ideas

OpenDX is Opensource data visualization software that can handle many of the above tasks, especially gespatial and statistical.
Quite a few more ideas can be found at the Scientific Data Processing and Visualization Software Packages page available from SAL (Scientific Applications on Linux) pages.

Search

In the search layer, we're trying to increase performance, while giving relevant links back the end-user conducting the search. In addition to geographic cataloging listed above, the following topics are of interest:

Summarization and categorization are useful tools to reduce "swamping" caused by mixing full-text search results with straight metadata searches.

Automatic Summarization

These tools scan fulltext, focusing on the relevant key sentences contained within a document to build a summary of the conceptual content of the document. Sample applications are
the Inxight Summarizer
Brevity Document Summarizer
plus other summarization software listed with the Association for Computational Linguistics
and a few more listed with the KMconnection Knowledge Management Product Guide

Automatic categorization

Categorization and taxonomy building help to structure unstructured documents and holdings to make the results more relevant to the context of the end-user's query. A few products of interest here are

Verity's taxonomy engine
SemioTaxonomy by Entrieva
SemioTagger categorization and indexing engine
Inxight ThingFinder automatic entity extraction engine
PhraseRate, from the UC Riverside iVia platform.
And several more examples from the KMconnection
In this area, we should also look at taxonomy or ontology normalization to have consistent search results across all participating institutions and domains.

Relevance Ranking

Relevance ranking gives the end-user and idea of how well the selected resources fit what they are looking for. There are several methods out there now:

Cross-linked - how many other sites reference this one (this is how Google ranks their sites). This can cause popularity to push relevant citations low in the list.

Cross-reference - this is used in research papers and journals to point to papers that are referenced most by other papers - you can potentially miss more recent works with this method.

Word density - how often does the word or phrase appear in a document? This causes swamping with full-text searches.

Phrase importance (Inxight) - look for key context sentences to bring relevance higher.
Keyphrase extraction - PhraseRate, from the UC Irvine iVia platform.

Hierarchical results

This is something Roy likes, and it makes sense. Instead of pointing to individual pages, return top-level results and point to the top of a site. When a user clicks on that site, the list of hits (which can also have sub-levels or trees) is brought up. This fits in well with tree-based navigation and other visualization schemes.

Ingest

This area is a bit harder, since we don't or won't own the participating institutions. This is is where it would be nice if the target resources also stored enhanced summaries, categories, semantics, etc. The best we can do, I assume, is to enhance our own holdings as we progress forwards

Document image analysis - this is OCR of scanned documents. It may work well for typeset documents, but has a ways to go before being useful for handwritten documents. Tied into automatic summarization and categorization, it can make historical documents much more accessible.

Image meta data - This is meta data derived from attributes of a pure image. As far as I can tell, the current state of the technology if pretty disappointing, but may improve significantly enough to be considered for future projects.

Automatic summarization and categorization can be applied to existing TEI and web-crawled documents to enhance existing meta data.

Access Layer

When an application from a participating institution accesses the search interfaces, we need to provide hints as to where and how to get at the enhanced metadata.

Hints page - this can be a "robot" like helps file that tells the entering application where to find an HTTP, Z39.50, or other interface with expanded search capabilities.

Web services model - with this model, we provide a means to let search applications "discover" repositories and standardized search services through UDDI, with machine-readable descriptions of how to access those services.

Next Steps

Review technology areas
Prioritize
Focus of "low hanging fruit" and technologies most relevant to our plate of projects.

---

Michael McKenna
California Digital Library
University of California Office of the President
Voice: (510) 987-9655
Michael.McKenna@ucop.edu
href=http://www.cdlib.org/