Notes
Slide Show
Outline
1
Dynamic De-duplication of Bibliographic Data
for
User Services
  • Thorsten Schwander, Herbert Van de Sompel
  • Los Alamos National Laboratory, Research Library
  • Digital Library Research & Prototyping Team
2
Common De-duplication Scenarios
  • Union Catalogues
  • FRBR-izing catalogues
  • MetaSearch engines


3
LANL De-duplication Problem
  • LANL Research Library locally hosts a large data collection
    • A&I databases: ISI Citation Databases, Inspec, BIOSIS, Engineering Index, …
    • Full-text collections: Elsevier, Wiley, APS, IOP, …

  • Duplicates in LANL data collection:
    •  amongst bibliographic records
    •  between bibliographic records and citations
    •  amongst citations

  • De-duplication need:
    • join records from several databases that describe the same work
    • find works that cite a given work

4
 
5
 
6
Current LANL De-duplication Approach
  • Strategy:
    • Batch processing
    • Bibliographic key matching
    • Complex heuristics

  • Issues:
    • Extensive processing time
    • Scalability problem in light of growing data collection
    • Revision of heuristics requires reprocessing of collection


  • Explore alternative:
    • On-the-fly de-duplication
    • De-duplication approach that is appropriate for citation matching
    • Flexibility regarding revision of matching approach




7
Netrics Software
  • Netrics in the literature:
    • C. Lee Giles Steve Lawrence Kurt D. Bollacker. 1998. CiteSeer: an automatic citation indexing system.  International Conference on Digital Libraries. Proceedings of the third ACM conference on Digital libraries Pittsburgh, Pennsylvania. Pages: 89 – 98. DOI 10.1145/276675.276685.
    • C. Lee Giles Steve Lawrence Kurt D. Bollacker. 1999. Autonomous Citation Matching. International Conference on Autonomous Agents. Proceedings of the third annual conference on Autonomous Agents, Seattle, Washington. Pages: 392 – 393. DOI 10.1145/301136.301255
    • Peter N. Yianilos. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. Symposium on Discrete Algorithms. Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, Austin, Texas. Pages: 311 – 321.
    • Various papers at http://www.netrics.com/products/prod_papers.shtml





8
Netrics Software
  • Netrics elevator pitch:


    • Netrics technology is a set of scalable linear-time algorithms that model the human notion of similarity in order to match related information. Netrics algorithms compute optimal weighted bipartite matching of letters and polygraphs. This bipartite matching approach captures a more flexible, more "human" notion of similarity than that provided by traditional approaches to inexact matching, such as string edit-distance, dictionary/speller correction, automaton-based methods, fuzzy search and probabilistic algorithms.



9
Netrics Software
  • Demonstration sites:
    • First Things [search: epistemolology]
    • Prints doc com [search: barnspouse]
    • The Swiss Colony [search: hollowene]
    • Oscar winning actors database






10
Netrics Software
  • Netrics properties:
    • Forgiving with respect to errors in dataset
    • Forgiving with respect to errors in query
    • Compares strings like humans do
    • Response can be optimized for specific datasets: machine-learning module
    • Performance scales well with growing dataset
    • RAM-based index





11
LANL Netrics database setup
12
 
13
LANL Netrics database setup
14
 
15
 
16
 
17
 
18
 
19
Netrics LANL queries
  • Query:
    •  query key (citation or bib) sent to broker: fielded search
    •  broker sends requests to all appropriate processes
    •  thesaurus used for stitle lookups
    •  broker collects/merges responses


  • Response:
    •  list :
    • key || likelihood that key is a hit || identifiers of records from which key was extracted
    •  likelihood ~ Netrics match per field & weight accorded per field
    • client application decides on the cut-off point between hit and non-hit
20
 
21
Optimizing likelihood
  • Optimize likelihood scores in function of the dataset
  • Machine learning: create model that accords weights to fields of the key
  • Librarians:
    • Were presented with a total of 3,000 pairs of keys
    • Had to decide whether or not both keys of a pair represented the same work
    • Result: clearer cut-off point between matches and non-matches
22
 
23
 
24
 
25
 
26
Netrics @ LANL : Conclusions
  • Results look much better than those of batch de-duplication approach ~ Netrics matching + training by librarians
  • Can ‘de-dup’ external data against local data
  • No batch processing, but on-the-fly de-duplication
  • Possibility to retrain the system to optimize responses without data reprocessing: machine learning module
  • Modularity of solution accommodates growth of dataset
  • Netrics module can be used by multiple applications, not just one


  • Positive collaboration with Netrics (machine learning, cache)


  • Will be plugged into new search environment that is being created
  • Will be applied to full-content collection



27
Contact information
  • LANL:
    • Thorsten Schwander <schwander@lanl.gov>
    • Herbert Van de Sompel <herbertv@lanl.gov>


  • Netrics:
    • Stefanos Damianakis <snd@netrics.com>
    • Anthony Faulise <af@netrics.com>