Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Austin: home of the first piece of library automation: UT 1937
How did we get into this mess?
Gernerally speaking, automation started in the backroom with acquisitions and cataloging.  Circulation was a stand-alone system tied to barcodes (another backoffice system for inventory), and Cutter’s rule of the catalog was well-satisfied by the card catalog As I explain to every new programmer who comes to work in the library, even the MARC record was only invented for backoffice use, namely the transfer of records to someone who could then print you some cards. I am hard pressed to call what we created with the Endeca software anything other than a 2nd generation online catalog, since I do not believe that we have moved forward from the first generation catalogs that are basically online card catalogs with poor keyword searching and insufficient authority searching thrown on top of them.
Non-library software (Endeca) doesn’t understand MARC records. Export and use MARC4J java API to transform to flat text file(s) for ingest by Endeca. Since we’re exporting records, we also have to keep these text files up to date with daily changes in Unicorn. Perform nightly updates – merge text files using Perl. That’s where Endeca’s ProFind software takes over.
It’s responsible for…
Parsing reformatted NCSU data [data foundry] and creating proprietary indices
Creating an HTTP service that responds to queries with result sets. [navigation engine]
On the front NCSU is responsible for…
Building the web application that users see – this application uses Endeca API to query the back-end HTTP service (navigation engine) and display results. We’ve built a servlet-based Java web application, but Endeca supports others like .NET
Data Foundry and creation of Indices are part of nightly update, but they occur off-line, and don’t affect functioning of online requests to NCSU Web Application (passed to Navigation Engine)
This is always live. We start a new navigation engine instance with new indices before terminating the old so users don’t experience any down-time
Already mentioned integration a bit, but we just wanted to emphasize some of the many things we had to work out in order to integrate a piece of non-library software with our ILS. Nightly translate records stored in MARC21 format with MARC-8 character encoding to records stored in flat text files with UTF-8 character encoding. Since we’re exporting records from ILS, we also have to keep Endeca records in sync with ILS data. As Andrew reminded us, OPACS are good at known-item searching. We chose to integrate the Endeca keyword searching capabilities with Web2 authority searching (which meant figuring out how to present both types of searches on the search page), in part b/c we didn’t know yet how well Endeca would handle known-item searching (turns out it’s not too bad) and b/c we haven’t yet found a good solution for bringing authority linkages into Endeca. We also decided to retain the Web2 detailed record pages, rather than building detailed record pages in Endeca, b/c they have lots of their own functionality that has been built into the OPAC over time. For instance, there is the capability to browse the shelf around this item and request checked out materials. The lack of a good, reliable, unique identifier in our MARC records made this linkage between Endeca and Web2 quite a challenge, which we finally solved by placing our database keys for each record into the MARC 918 field.
On my screen, okay to increase font size in Firefox twice. Check this on screen at DLF before presentation.
Now we’ll do a quick demo to show some of the major features: relevance ranking, results set exploration with facets, faster response time, natural language searching aids with did you mean, spell correction, and auto-stemming, true browsing of collection
Note that you can see integration of features on main search page with top keyword search powered by Endeca and bottom begins with authority search powered by Web2.
DLF demo outline:
-Deforestation – availability limit – also good for Subject: Topic (environmental vs. economic) and maybe LC class
-Art history – region and era dimensions, plus spread of LCC dimension; 20th century, France, remove 20th century and pick 19th century, most popular [see management council demo notes] **
-Browse whole collection, select DVD, sort by Most Popular **
-Auto-correct: rascism (0 results vs. 2318) – also good for Subject: Topic, ivan illych (0 results vs. 29), pride and prejidice (0 results vs. 156)
-Did you mean: e-bola
Keyword Anywhere is being heavily used, which we weren’t sure would be the case after seeing some usability testing!
Arranged from top to bottom in the same order as they are on the page. Availability has fewest hits, but that’s just b/c there’s only one option to select! It comes out as the third most popular dimension values. LC classification is followed closely by Subject Topic, the next 2 on the page. Interesting to see that Author, the last option presented, is being used more than Subject: Genre, the option presented second (often above the fold) in the left-hand navigation bar. Library is also more popular than format so far. We plan to use data like this to think about how to order the dimensions in the future. For instance, should Subject: Genre be moved farther down b/c it is less popular? The other dimension available is used to allow users to browse new books. Although it is the least used dimension (and so doesn’t show up here), it is the 4th most popular single dimension value. This is a pretty impressive number, considering the option has been a bit buried under our Browse tab.
So we mentioned that relevance seems to be of critical importance to undergraduate users – we need to get good results on the first page and preferrably in the top 5. It’s also critically important if we want to begin feeding the top X results from a search into other areas – like our quick site search. Decrease in no results searches in Endeca mostly due to spelling correction and automatic stemming. This shows that Endeca performs 33% better at getting relevant hits into the top 5 results. Relevance was judged subjectively.
Endeca offers user-configurable relevance ranking, and we still need to do more research to figure out if/how we can improve the algorithms that we have in place now. To create relevance ranking in Endeca, we the customer select from a variety of available modules, ordering them based on their importance in determining relevancy. Different search indexes can (and do) have different strategies – Keyword Anywhere and ISBN/ISSN indexes are searching very different types of data with different goals. B/c Keyword Anywhere searches nearly everything, we’ve spent most effort considering relevance for this search. Some of the modules available in Endeca rank results based on phrase matching, field weighting, frequency of term, or static ordering of value in a particular field in the records. Works in a tiebreaker fashion, where all results are first ordered according to first module, then any ties are broken by the second module, and on down the line.
At NCSU, emphasize
Original query (no thesaurus, stemming, spell correction) is most relevant
Exact phase user entered more relevant than terms occurring separately in the field
Find most relevant field that contains search terms(s). We have complete control over ordering the fields. For instance, we emphasize Title matches as more relevant than Author matches which are more relevant than Table of Contents matches. We still need to study the effects of changing this ordering to improve relevance ranking. How many fields in the record contain the terms? The more fields, the more relevant the record.
And on…7 total modules for Keyword Anywhere
One bona fide complaint that we have heard is that the catalog is still just the catalog.  There are no articles, and journal titles are still hard to find.  So why bother re-inventing what some are calling an end-of-life product.
I want to talk briefly about two reasons why we think that new technologies are required to carry on.
Well, if you think of the catalog as one piece of the resource discovery puzzle and think of the non-integrated systems that we have now as a bunch of puzzle pieces (catalog, journal A-Z lists, A&I, full-text, and local cgi interfaces, and random web searching), the fact is that it is impossible to fit these pieces together.  Rebuilding one of these things without the larger puzzle in mind is like painting ourselves into a new corner.
Georgia PINES, Greenstone, OCA, DLF-ERMI in a vacuum are not enough.  Sometimes even local developers themselves can lose site of the forest for the trees.
This is sort of where my theory on dis-integrated library system theory is misunderstood.  I do not really believe that total dis-integration is what we are after.  I *do* believe that it is necessary to dismantle the systems we have in order to re-integrate our service options.
Furthermore, rebuilding systems with new technologies will dismantle the imbedded arrogance that stands between patrons and library resources. This reversal, which I first heard articulated by Marshall Breeding, involves the switch from patrons starting locally and expanding their searches globally to a paradigm of search and retrieval that has users starting broadly and (currently only with luck) narrowing their searches to locally or topically relevant areas.  Lorcan Dempsey has called this the problem of the low gravitational pull of library resources.
If the starting point of choice is going to be Google, Amazon, or even WorldCat, then we must make our targets more visible to those starting points and more enjoyable to use as landing places.
Our local collections are not silos.  I wish that they were because they would be visible and we could discover them and build bridges between them.  What they really are is basements, invisible on the horizon and haphazardly connected by still invisible tunnels and hacks.  Collection and service description become more important than front-end interfaces and whiz-bang feature-sets.  That local interfaces and features are rich and usable is sauce for the goose.