Click to edit Master text styles
Austin: home of the first piece of
library automation: UT 1937
How did we get into this mess?
Gernerally speaking, automation started in the
backroom with acquisitions and
cataloging. Circulation was a stand-alone system tied to barcodes (another backoffice
system for inventory), and Cutter’s rule
of the catalog was well-satisfied by the
As I explain to every new
programmer who comes to work in the
library, even the MARC record was only invented for backoffice use, namely the transfer of records to
someone who could then print you some
I am hard pressed to call what
we created with the Endeca software
anything other than a 2nd generation online catalog, since I do not believe that we have moved
forward from the first generation catalogs
that are basically online card catalogs
with poor keyword searching and insufficient authority searching thrown on top of them.
Non-library software (Endeca) doesn’t understand MARC records.
Export and use MARC4J java API to transform to flat text file(s) for ingest by
Endeca. Since we’re exporting records, we also have to keep these text files
up to date with daily changes in Unicorn. Perform nightly updates – merge text
files using Perl. That’s where Endeca’s ProFind software takes over.
It’s responsible for…
Parsing reformatted NCSU data [data foundry] and creating
Creating an HTTP service that responds to queries with result
sets. [navigation engine]
On the front NCSU is responsible for…
Building the web application that users see – this application
uses Endeca API to query the back-end HTTP service (navigation engine) and
display results. We’ve built a servlet-based Java web application, but Endeca
supports others like .NET
Data Foundry and creation of Indices are
part of nightly update, but they occur off-line, and don’t affect functioning
of online requests to NCSU Web Application (passed to Navigation Engine)
This is always live. We start a new
navigation engine instance with new indices before terminating the old so
users don’t experience any down-time
Already mentioned integration a bit, but
we just wanted to emphasize some of the many things we had to work out in
order to integrate a piece of non-library software with our ILS.
translate records stored in MARC21 format with MARC-8 character encoding to
records stored in flat text files with UTF-8 character encoding.
we’re exporting records from ILS, we also have to keep Endeca records in sync
with ILS data.
As Andrew reminded us, OPACS are good at known-item
searching. We chose to integrate the Endeca keyword searching capabilities
with Web2 authority searching (which meant figuring out how to present both
types of searches on the search page), in part b/c we didn’t know yet how well
Endeca would handle known-item searching (turns out it’s not too bad) and b/c
we haven’t yet found a good solution for bringing authority linkages into
Endeca. We also decided to retain the Web2 detailed record pages, rather than
building detailed record pages in Endeca, b/c they have lots of their own
functionality that has been built into the OPAC over time. For instance, there
is the capability to browse the shelf around this item and request checked out
materials. The lack of a good, reliable, unique identifier in our MARC records
made this linkage between Endeca and Web2 quite a challenge, which we finally
solved by placing our database keys for each record into the MARC 918 field.
screen, okay to increase font size in Firefox twice. Check this on screen at
DLF before presentation.
do a quick demo to show some of the major features: relevance ranking, results
set exploration with facets, faster
response time, natural language searching aids with did you mean, spell correction, and auto-stemming, true browsing
you can see integration of features on main search page with top keyword
search powered by Endeca and bottom begins
with authority search powered by Web2.
-Deforestation – availability limit – also good for
Subject: Topic (environmental vs. economic) and maybe LC class
-Art history – region and era dimensions, plus spread of
LCC dimension; 20th century, France, remove 20th century and pick 19th century, most
popular [see management council demo notes]
-Browse whole collection, select DVD, sort by Most
-Auto-correct: rascism (0 results vs. 2318) – also good
for Subject: Topic, ivan illych (0 results vs.
29), pride and prejidice (0 results vs. 156)
-Did you mean: e-bola
Keyword Anywhere is being heavily used,
which we weren’t sure would be the case after seeing some usability testing!
Arranged from top to bottom in the same
order as they are on the page. Availability has fewest hits, but that’s just
b/c there’s only one option to select! It comes out as the third most popular
dimension values. LC classification is followed closely by Subject Topic, the
next 2 on the page. Interesting to see that Author, the last option presented,
is being used more than Subject: Genre, the option presented second (often
above the fold) in the left-hand navigation bar. Library is also more popular
than format so far. We plan to use data like this to think about how to order
the dimensions in the future. For instance, should Subject: Genre be moved
farther down b/c it is less popular? The other dimension available is used to
allow users to browse new books. Although it is the least used dimension (and
so doesn’t show up here), it is the 4th most popular single dimension value. This is
a pretty impressive number, considering the option has been a bit buried under
our Browse tab.
So we mentioned that relevance seems to
be of critical importance to undergraduate users – we need to get good results
on the first page and preferrably in the top 5. It’s also critically important
if we want to begin feeding the top X results from a search into other areas –
like our quick site search.
Decrease in no results searches in Endeca
mostly due to spelling correction and automatic stemming. This shows that
Endeca performs 33% better at getting relevant hits into the top 5 results.
Relevance was judged subjectively.
offers user-configurable relevance ranking, and we still need to do more
research to figure out if/how we can
improve the algorithms that we have in place now.
To create relevance ranking in Endeca, we the customer
select from a variety of available modules,
ordering them based on their importance in determining relevancy.
Different search indexes can (and do) have different
strategies – Keyword Anywhere and ISBN/ISSN
indexes are searching very different types of data with different goals. B/c
Keyword Anywhere searches nearly
everything, we’ve spent most effort considering relevance for this search.
the modules available in Endeca rank results based on phrase matching, field weighting, frequency of term, or static ordering of
value in a particular field in the records. Works in a tiebreaker fashion, where all results are first ordered
according to first module, then any ties
are broken by the second module, and on down the line.
At NCSU, emphasize
query (no thesaurus, stemming, spell correction) is most relevant
phase user entered more relevant than terms occurring separately in the
relevant field that contains search terms(s). We have complete control over ordering the fields. For instance, we emphasize Title
matches as more relevant than Author matches
which are more relevant than Table of Contents matches. We still need to study
the effects of changing this ordering to
improve relevance ranking.
•How many fields in the record contain the terms? The
more fields, the more relevant the record.
on…7 total modules for Keyword Anywhere
fide complaint that we have heard is that the catalog is still just the catalog.
There are no articles, and journal
titles are still hard to find. So why
bother re-inventing what some are calling
an end-of-life product.
I want to talk briefly about two reasons why we
think that new technologies are required
to carry on.
Well, if you
think of the catalog as one piece of the resource
discovery puzzle and think of the non-integrated systems that we have now as a bunch of puzzle pieces (catalog, journal A-Z lists, A&I, full-text, and
local cgi interfaces, and random web
searching), the fact is that it is impossible
to fit these pieces together.
Rebuilding one of these things
without the larger puzzle in mind is like painting
ourselves into a new corner.
Georgia PINES, Greenstone, OCA, DLF-ERMI in a vacuum are not enough.
Sometimes even local developers
themselves can lose site of the forest for the trees.
This is sort of where my theory on dis-integrated
library system theory is
misunderstood. I do not really believe that total dis-integration is what we are after. I *do* believe that it is necessary to dismantle the systems we have
in order to re-integrate our service
rebuilding systems with new technologies will dismantle the imbedded arrogance that stands between patrons and library resources.
This reversal, which I first heard articulated by
Marshall Breeding, involves the switch
from patrons starting locally and
expanding their searches globally to a paradigm of search and retrieval that has users starting broadly
and (currently only with luck) narrowing
their searches to locally or topically
relevant areas. Lorcan Dempsey has called this the problem of the low gravitational pull
of library resources.
If the starting point of choice is going to be
Google, Amazon, or even WorldCat, then we
must make our targets more visible to
those starting points and more enjoyable
to use as landing places.
Our local collections are not silos. I wish that they were because they would be visible and we could discover
them and build bridges between them. What they really are is basements, invisible on the horizon and haphazardly connected by still invisible tunnels and hacks. Collection and service description become more important than front-end interfaces and whiz-bang feature-sets. That local interfaces and features are rich and usable is sauce for the goose.