Ideas about Metadata Enhancement
Version 1.1, 7 April,
2004
Location of this document:
https://diva.cdlib.org/projects/american_west/technology/metadata_enhancement_ideas_v1.html
Author: Mike McKennaHere's what I've found so far.
You can think of the "metadata enhancement" as being an enhancement at different
layers of the storage, search, retrieval, and viewing process. I haven't
started yet to dive into the products targeted at the ALA .
Take a look
at the "
Digital Library Project" at
UCB for ideas on some emerging technologies. Also check out "
Resources on Knowledge
Organization and Classification" and "
The KMconnection
Knowledge Management Product Guide" (about two years old, but great for
ideas!).
User Interface
The user
interface is not an area of meta data enhancement per se, since no new metadata
will probably be generated, but more an area to explore to enable end users to
have a more meaningful and productive experience.
Search language - how to get
the information
NLP
- Natural language processing, allowing the user to use more natural
questions, etc. (e.g., AskJeeves). A couple of examples are the Verity search engine
and Lextek's Onix.
Navigation and
Visualization
Statistical - how many,
when, who, where, etc. (e.g., Inxight Table
Lens)
This allows you to browse by intensity, relevance, time frame,
etc. Some interesting navigation techniques could come out of
something like this.
Geospatial - mouse over for region-specific
"hits"
Take a look at Distributed Geolibraries,
NSDI (National Spatial Data
Infrastructure) in the U.S., the Digital Earth project, and the
Electronic Cultural Atlas Initiative (ECAI).
I like this idea - the concept,
especially for the K-12 crowd, is to be able to mouse over a map and see
highlighted "hits" of region specific information or resources.
E.g., trace out the route of Lewis and Clark, and be able to switch on
different "layers" such as
There's a good
discussion at the the Alexandria Digital
Library Project at
UCSB.
Chronological
This is where you query across time and
space, usually using a timeline, a map, or both.
Semantic -
topic context in ontology (see "Visualising
Information Spaces")
HighWire at Stanford is using this.
It seems to be slow and ackward to use.
Other Ideas
- OpenDX is Opensource data
visualization software that can handle many of the above tasks, especially
gespatial and statistical.
- Quite a few more ideas can be found at the Scientific Data
Processing and Visualization Software Packages page available from SAL
(Scientific Applications on Linux) pages.
Search
In the search layer,
we're trying to increase performance, while giving relevant links back the
end-user conducting the search. In addition to geographic cataloging
listed above, the following topics are of interest:
Summarization and
categorization are useful tools to reduce "swamping" caused by mixing full-text
search results with straight metadata searches.
Automatic Summarization
These tools scan fulltext, focusing on the relevant
key sentences contained within a document to build a summary of the
conceptual content of the document. Sample applications are
the
Inxight
Summarizer
Brevity Document
Summarizer
plus other
summarization software listed with the Association for Computational
Linguistics
and a few more listed
with the KMconnection Knowledge
Management Product Guide
Automatic categorization
Categorization and taxonomy building help to structure
unstructured documents and holdings to make the results more relevant to the
context of the end-user's query. A few products of interest here are
In this area, we should also look at taxonomy or
ontology normalization to have consistent search results across all
participating institutions and domains.
Relevance Ranking
Relevance ranking gives the end-user and idea of how well the
selected resources fit what they are looking for. There are several
methods out there now:
- Cross-linked - how many other
sites reference this one (this is how Google ranks their sites).
This can cause popularity to push relevant citations low in the list.
- Cross-reference - this is used
in research papers and journals to point to papers that are referenced
most by other papers - you can potentially miss more recent works with
this method.
- Word density - how often does
the word or phrase appear in a document? This causes swamping with
full-text searches.
- Phrase importance (Inxight) - look for key context
sentences to bring relevance higher.
- Keyphrase extraction - PhraseRate,
from the UC Irvine iVia
platform.
Hierarchical results
This is something Roy likes, and it makes sense. Instead
of pointing to individual pages, return top-level results and point to the
top of a site. When a user clicks on that site, the list of hits
(which can also have sub-levels or trees) is brought up. This fits in
well with tree-based navigation and other visualization
schemes.
Ingest
This area is a bit harder,
since we don't or won't own the participating institutions. This is is
where it would be nice if the target resources also stored enhanced summaries,
categories, semantics, etc. The best we can do, I assume, is to enhance
our own holdings as we progress forwards
Document image
analysis - this is OCR of scanned documents. It may work well for
typeset documents, but has a ways to go before being useful for handwritten
documents. Tied into automatic summarization and categorization, it can
make historical documents much more accessible.
Image meta data - This is
meta data derived from attributes of a pure image. As far as I can tell,
the current state of the technology if pretty disappointing, but may improve
significantly enough to be considered for future projects.
Automatic
summarization and categorization can be applied to existing TEI and
web-crawled documents to enhance existing meta data.
Access Layer
When an application
from a participating institution accesses the search interfaces, we need to
provide hints as to where and how to get at the enhanced metadata.
Hints page - this can be a "robot" like helps file that tells the
entering application where to find an HTTP, Z39.50, or other interface with
expanded search capabilities.
Web services model - with this model, we provide a means to let
search applications "discover" repositories and standardized search services
through UDDI, with machine-readable descriptions of how to access those
services.
Next Steps
- Review technology areas
- Prioritize
- Focus of "low hanging fruit" and technologies most relevant to our plate
of projects.
---
Michael McKenna
California Digital Library
University of California Office of the President
Voice: (510) 987-9655
Michael.McKenna@ucop.edu
href=http://www.cdlib.org/