1	"The Melvyl Recommender Project" The Melvyl Recommender Project originally presented as a CNI Project Briefing 2006 April 3 by Colleen Whitney California Digital Library and UC Berkeley School of Information Peter Brantley California Digital Library DLF reprise Brian Tingle California Digital Library
2	Links: http://recommend-dev.cdlib.org/
3	Background Fundamental changes in user needs and expectations Library catalogs do not meet these needs Exploratory project
4	Outline Text-based discovery system User interface strategies Spelling correction Enhanced relevance ranking Recommending
5	Text-Based Discovery eXtensible TextFramework (XTF): built on Lucene, Saxon Open source, standards-based (XML, XSLT, Java servlets) Very different from relational approach Built-in ranking capability
6	Testbed Bibliographic records MARC export from Melvyl ~4.2 million UCLA records used in the current prototype Experimented with using UCB records as well, for a total of 9 million
7	For Further Exploration Scalability and Performance Successfully indexed and searched up to 9 million records How will it do with 35 or 40 million records?
8	Outline Text-based discovery system User interface strategies Spelling correction Enhanced relevance ranking Recommending
9	AJAX Asynchronous JavaScript And XML Using to call to additional services from outside the core system Render the page, then update portions as data arrives Adds flexibility while maintaining speed
10	FRBR Functional Requirements for Bibiliographic Records Work (Hamlet) Expression (in French) Manifestation (Presses universitaires de France, 1987) Item (UCLA’s copy of it) Researching existing implementations Analyzing how we would apply the concepts, and how we would implement
11	Faceted Browse
12	Faceted Browse Underlying mechanics in place For effective browse, will require: substantial metadata enhancement significant UI design work
13	Outline Text-based discovery system User interface strategies Spelling correction Enhanced relevance ranking Recommending
14	Spelling Correction Goal: 90% correct on first try Dictionary-based (aspell) vs. index-based Proper nouns Multilingual environment
15	Spelling Correction Chose index-based strategy “N-gram” speller from Lucene: “primer” => pri prim rime imer mer form query from n-grams retain top 100, rank by closeness to original word Modified in several ways adjust for transpositions and insertions use metaphones boost on word frequencies Tested successfully on Wikipedia and aspell datasets
16	Examples “Mexaco” => “Did you mean...Mexico?” “Javasript” => “Did you mean... Javascript” “frehman” => “Did you mean...freeman” “flod” => “Your search for flod in keywords returned 12 result(s).” “Cailfornia” => “Your search for cailfornia in keywords returned 1 result(s).”
17	For Further Exploration Relative benefits of this approach: increase in indexing time (construction of bi-grams) Consider when to intervene...only on 0 results? Multi-word correction
18	Outline Text-based discovery system User interface strategies Spelling correction Enhanced relevance ranking Recommending
19	Ranking Using built-in Lucene capability “Boosted” with circulation data ~9 million UCLA circulation transactions September 1999 – May 2005 Data from two systems: Taos, Voyager “Boosted” with holdings data For 10 UC campuses, provided by OCLC
20	Examples
21	Examples
22	Examples
23	Examples
24	Assessment Small-scale user test in March Key questions: Which ranking method works best for our academic users? How do academic users evaluate relevance? Is there a difference based on subject matter expertise?
25	Assessment Task-based, facilitated and observed Rotated through 4 ordering methods: Content ranking only Content ranking boosted by circulation Content ranking boosted by holdings Unranked, sorted by system id Grouped by naive vs. expert
26	Assessment
27	Assessment
28	Preliminary Results In general, all 3 content ranking methods beat unranked in returning “Very Useful” items All 3 content ranking methods put more “Very Useful” items in top quartiles Preferences differed by expertise No clear-cut advantage to a single ranked method More queries per task using unranked method
29	Results Additional observations: All users place strong emphasis on title and publication date in assessing relevance Expert users rely heavily on author Many commented that term highlighting helps them assess matches against the query
30	For Further Exploration Content-based ranking appears well worth pursuing, but consider.... Adjustments to field weights, given observations? Relative costs of incorporating boosts? Sources of expanded metadata, which helps users assess relevance.
31	Outline Text-based discovery system User interface strategies Spelling correction Enhanced relevance ranking Recommending
32	Recommending Exploring multiple methods: Circulation-based Similarity-based (more like this…) Same author, subject, call number
33	Circulation-based
34
35
36
37
38
39
40
41
42
43	Examples
44	Examples
45	Does this method work?
46	Similarity-based Generated from content of the record. “More like this...”
47	Examples
48	Examples
49	For further exploration Integrate several methods Author, subject linkages Call number “shelf browse” “More like this...” Circulation-based recommendations Limitations of circulation-based method Identify other data rich in human-generated linkages....citations, reading lists...
50	Timeline Completing user tests on circulation-based recommendations this month. Wrapping up in June.
51	Many thanks to... Mellon Foundation RLG OCLC UCLA Library UC Berkeley Library CDL Team (Peter Brantley, Lynne Cameron, Rebecca Doherty, Randy Lai, Jane Lee, Martin Haye, Erik Hetzner, Kirk Hastings, Patricia Martin, Felicia Poe, Michael Russell, Lisa Schiff, Roy Tennant, Brian Tingle, Steve Toub...)
52	Links: Project Home: http://www.cdlib.org/inside/projects/melvyl_recommender/ Prototype: http://recommend-dev.cdlib.org/melrec/
53	Questions?