TEAM: More than a dozen people at CDL were involved. Peter initiated the project....Peter do you want to say a few more words about how this project came about?

This project has centered on 5 focal areas for exploration. I’ll talk briefly about what we’ve been doing in each of these areas, leaving room for demo and open discussion at the end. But please don’t hesitate to ask questions throughout...

The first area of exploration was using a text-based indexing and query system, rather than the standard relational database approach. We chose to use a system based on Lucene, which is an open-source text search engine. Essentially, each bibliographic records is treated as a very small document, rather than a collection of fields in a database. One very significant advantage is that this allowed us to apply ranking methods that are typically used in text retrieval systems. This capability is built into Lucene.

To build our testbed, we extracted UCB and UCLA data (9 million records total)

Considered broader/narrower datasets

- (e.g. all UCLA/UCB, circulating only, w/ISBNs only)

- UCLA approximates union catalog mix, avoids merge

-AND most important, we have good circulation data, used in other portions of the project.

We did need to tinker a bit....initially converted to MODS XML and indexed from there, but some basic needs weren’t met...e.g. formats (book, cd).

Now indexing without that step. Also creating display from the indexes, rather than from stored documents.

Have only touched lightly on this aspect of the project

AJAX:

•Using it to make calls to services outside of the core retrieval system

•Don’t have to wait for a response before rendering the page

•Very useful, keeps things speedy and flexible

•If there is time to demonstrate at the end, you’ll see it in action

The idea of faceted browse is to allow the user to continually narrow the focus along several defined axes, each of which is a hierarchy. An item is defined by several facets...e.g. location, material, time period....and they can be browsed as a whole, not separately.

As the user drills down, the display keeps track of what terms have been chosen, and offers further refinement into each facet. The chosen terms can be removed in any order, so navigation within the faceted hierarchy is fluid.

There are numerous examples of full or partial implementation of this out in the world...Endeca supports it.

•Our dataset is very large, and the overall “shape” of the metadata is not well suited to browsing.

“Did you mean....”

Work done by Martin Haye.

Modifications:

- adjust to recognize that transpositions and insertions are the most common errors

- applied “double metaphone” algorithm used in aspell (captures similar words based on phonetics)

- boosted based on word frequencies in index

Work done by Martin Haye.

Modifications:

- adjust to recognize that transpositions and insertions are the most common errors

- applied “double metaphone” algorithm used in aspell (captures similar words based on phonetics)

- boosted based on word frequencies in index

So....we get good suggestions for the first two.

The third...will depend on whether I intended to spell “freshman” or “freeman”...ostensibly the latter word is more common in the index.

The fourth...well here’s where the problems with multilingual environments crops up. Valid word in several nordic languages.

And the fifth....this is an issue with very large document sets...there will be misspellings, which are indexed just like any other word.

Statistical NLP approach looking at bi-gram frequency

The current Melvyl system orders records LIFO....more recently catalogued items come to the top. Not terribly useful for very large result sets, which are common.

Interested in ranking by relevance based on the content of records, which has been applied successfully in many other settings. Will it work here, on bibliographic records?

Second, interested in considering whether additional “intentional data”....circulation, holdings...could provide even better results from the perspective of an academic user.

Idea is to boost certain documents to augment Lucene’s capability. Just reordering, not affecting what is retrieved.

Boosts are calculated in advance using summary tables constructed from holdings and circulation data in mySQL. Applied at query time, not index time, so we are able to switch at will and compare.

The UCLA circulation dataset particularly valuable, as we’ll see later in this discussion

- retained anonymized but persistent patrons IDs

Massaging circulation data:

- absorbed much more time than working out the algorithms.

- diffs in data structure, numbering systems between the two data sets

- needed to create linkages to the Melvyl records.

Holdings:

•Considered a set offered by RLG (FRBRized and linked by ISBN)...poor coverage.

•Obtained a set from OCLC linked by OCLC number....much better coverage.

•Weighed the use of World-cat wide vs. UC-wide....UC collections very different from WC-wide in terms of what is highly held

Yep, indeed the result sets are reordered. But the key question is: which of these methods seems to work best for our users?

Asking the question of relevance through the lens of academic practice.

Not a trec-style competition...user focused. Sample size of 10.

Mostly subject searching, but one known-item task.

Independent variable = ordering method...rotated through on each query, assigned in random order for each participant. Happening in the background, users were not aware of the difference from query to query.

Expertise grouped by naive or expert....roughly undergraduate vs. graduate students (all UC Berkeley)

show task, assessment interface examples

Caveat: still in the midst of analysis

Some questions to be settled before we can make strong statements. E.g.:

•How to account for very large and very small result sets?

•Sys_id, used for generating “unranked” sets, may not be completely random.

None of these observations are surprising....

Mention Amazon tests....not available for large swaths of collection. Other sources?

Not really CF. CF approach requires

- qualitative ratings of items by multiple people

- knowledge of the user preferences and how those relate to the preferences of other patrons, in order to make good predictions.

Can’t assume check-out => positive “rating”.

- Availability/proximity

- First choice or just convenient?

- Proved relevant or useful?

- Can’t see in-library use, completely digital materials.

NO RATINGS

INCOMPLETE (at best) picture of patron preferences.

Started to experiment

Item => checked out by four different patrons.

Follow links => set of related items.

How to narrow to usable recommendations? Sets often hundreds of items. User studies: 3 -5 recommendations are optimal. How do we prioritize?

Let's look again at the set of items related to this one.

We can still follow the linkages from each patron. This time we’ll limit the sets to items in the same general subject area. And we’ll keep track of which item was checked out by each of the patrons.

The first patron has checked out 6 items.

The second patron has checked out 8 items, 4 new ones and 4 that were checked out by the first patron.

The third patron has checked out 6 items, adding a few new items, but most overlapping with those checked out by the first two.

The fourth patron checked out 4 items, most of which we've already seen at least once in the set of related items.

We can now sort based on frequency: the number of times we found that linkage from the original item to each one in the set. Those linked most often are returned at the top.

Then we need to trim the set to a usable number.

Cut-off isn't always tidy.

Some sets mainly consisting of frequencies of 1, so there’s very little overlap between patrons.

Sub-sort by total number of circulation transactions. (Another approach would be to subsort by a secondary category (e.g. arts); on the list for future tweaks.)

So...how well did this experiment work?

There are a few interesting things worth noting here.

Only the fourth item shares subject headings with the original item; only the last 2 have call numbers reasonably close to it. You would never find the first two items, which seem to be very good recommendations, by following subject headings or browsing call numbers.

Again, we see that the very first recommendation shares no author, subject headings, call number...this item wouldn’t easily be found by browsing from “Assembling California”. The middle two may be textbooks; the last is another popular geology text by the same author.

Focusing only on circulation-based recommendations.

Task-based, again using graduate and undergraduate students at UC Berkeley as participants.

Instead of considering patron habits, similar items are found by creating a query using key pieces of the content of the original bibliographic record. Items found with that query are presented as recommendations.

We’ll look at similarity-based recommendations for the Dorothea Lange item.

This time, the recommended items all share subject headings with the initial item, and call numbers are closer together...as you might expect given the method. They are more biographical in nature; the recommended items are much “more like this...!” Nothing unexpected.

And we’ll look at “Assembling California...” again, we see more shared subject headings, closer call numbers. Notice, however, that unlike the original item, which is written for a popular audience, these items appear to be much more technical in nature.

This method is still under development, will likely change.

Alternative methods: not mutually exclusive, and all may be of interest for different reasons and to support different tasks. We’re considering how all of these methods could intersect in terms of presentation to the user.

The circulation-based method seems to generate recommendations that start to harness the “collective intelligence” of the scholarly community, in ways that the other methods do not. But => less than 25% of the collection (the part that circulates). MUST generate recommendations that delve into the “long tail.” Also, long-term viability of circulation data is unclear. a) availability; b) access increasingly digital. Are there other data sources that can make those interesting connections that the metadata cannot make for us?

Book bags for personal collections?

Questions; integration of these pieces, with each other and within context of CDL’s production environment.