Subject:  NetFirst Deduping 10/3/96 meeting notes
From:     Gary Perlman
To:       Dave Duckworth
Date:     October 7, 1996

Who:      Dave Duckworth
          Ed O'Neill
          Jim Vickroy
          Luke Wagner
          Gary Perlman

Dave,

I think you would need about a month FTE to use the Vector Space
approach, but it could be useful for many aspects of NetFirst,
including new search methods for end users.  See Salton's book:
	Automatic Text Processing:
	The Transformation, Analysis, and Retrieval
	of Information by Computer
	Gerard Salton
	Addison-Wesley
	1989
	ISBN 0-201-12227-8; QA 76.9 T48 S25 1989

For the issue of checking a URL against an old stored version,
I think the "diff" method below holds some promise,
and its labor requirements are on the order of hours.

Gary

Two types of similarity checking needed:
 1. existing URL, check against new version
	looking for differences to see if description should be updated
 2. potential URL, check against database
	looking for URLs that are already in database
	 * similar URL (trailing slash, optional port, IP number)
	 * mirror site
A possible third type is
 3. scan the NetFirst database for candidate duplicates
	combinatorial explosion issues, although
	Luke Wanger knows about an O(n) solution

Vector Space Approach
	= content-based representation of entries
	= entries represented by vector of weights of discriminating terms
		- TFij = Term Freq of term Tj in doc Di
		- DFj = Doc Freq of term in N docs
		- IDF = Inverse Document Freq = logN/DFj (0 .. LogN)
		- TF*IDF = TFij*LogN/DFj
	= various measures of "distance" cosine most common
	= pioneered by Gerard Salton of Cornell,
		Mike McGill (lives in Worthington Hills,
		works in Ann Arbor) may have ideas
	= 1992 code available at ftp://ftp.cs.cornell.edu/pub/smart/
	= contact person Chris Buckley chrisb@cs.cornell.edu
  Appropriateness to checking:
	1. should be useful as it would look for different terms
	2. would get a ranked list of matches for new page
		by searching in the vector space for "close" entries.
		If no "duplicate" was detected, the close matches
		might be useful for indexing because the close
		items would have similar index categories.
	3. probably too expensive, unless the space could be
		partitioned so each item was compared to
		only a few close items, not the entire set.
  Feasibility:
	Implementation is probably a costly proposition.
	The Cornell SMART code is notoriously hard to install,
	and the code would still have to be adapted.
	I have asked a friend at Cornell if there is better code.
	The effort would be on the order of 1-2 months FTE.

Diff File Approach
	= ad hoc textual approach checks number of added+deleted lines
	= run diff on stored page and current URL
	= compute number of lines changed (wc -l of output of diff)
	= if many lines added, deleted, or both (i.e., changed),
		the page is a candidate for A&I update
	= criterion could be based only on the number of diff lines,
		or on some function of the number added vs. deleted,
		the logic being that 10 added lines is more important
		than 5 lines that changed.
  Appropriateness to checking:
	1. could be a useful additional piece of information,
		certainly more information that an absolute diff
	2. would not be useful to compare potential URLs to database
	3. would not be useful, as this type of deduping requires 2.
  Feasibility:
	In the most simple form, the check would only take
	a few minutes to implement is a shell script.
	In more complex forms, all that would be needed
	would be to count the number of ">" and "<" lines,
	or keep track of the line numbers that actually change.
	These are all doable with simple scripts.