Subject: NetFirst Deduping 10/3/96 meeting notes From: Gary Perlman To: Dave Duckworth Date: October 7, 1996 Who: Dave Duckworth Ed O'Neill Jim Vickroy Luke Wagner Gary Perlman Dave, I think you would need about a month FTE to use the Vector Space approach, but it could be useful for many aspects of NetFirst, including new search methods for end users. See Salton's book: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer Gerard Salton Addison-Wesley 1989 ISBN 0-201-12227-8; QA 76.9 T48 S25 1989 For the issue of checking a URL against an old stored version, I think the "diff" method below holds some promise, and its labor requirements are on the order of hours. Gary Two types of similarity checking needed: 1. existing URL, check against new version looking for differences to see if description should be updated 2. potential URL, check against database looking for URLs that are already in database * similar URL (trailing slash, optional port, IP number) * mirror site A possible third type is 3. scan the NetFirst database for candidate duplicates combinatorial explosion issues, although Luke Wanger knows about an O(n) solution Vector Space Approach = content-based representation of entries = entries represented by vector of weights of discriminating terms - TFij = Term Freq of term Tj in doc Di - DFj = Doc Freq of term in N docs - IDF = Inverse Document Freq = logN/DFj (0 .. LogN) - TF*IDF = TFij*LogN/DFj = various measures of "distance" cosine most common = pioneered by Gerard Salton of Cornell, Mike McGill (lives in Worthington Hills, works in Ann Arbor) may have ideas = 1992 code available at ftp://ftp.cs.cornell.edu/pub/smart/ = contact person Chris Buckley chrisb@cs.cornell.edu Appropriateness to checking: 1. should be useful as it would look for different terms 2. would get a ranked list of matches for new page by searching in the vector space for "close" entries. If no "duplicate" was detected, the close matches might be useful for indexing because the close items would have similar index categories. 3. probably too expensive, unless the space could be partitioned so each item was compared to only a few close items, not the entire set. Feasibility: Implementation is probably a costly proposition. The Cornell SMART code is notoriously hard to install, and the code would still have to be adapted. I have asked a friend at Cornell if there is better code. The effort would be on the order of 1-2 months FTE. Diff File Approach = ad hoc textual approach checks number of added+deleted lines = run diff on stored page and current URL = compute number of lines changed (wc -l of output of diff) = if many lines added, deleted, or both (i.e., changed), the page is a candidate for A&I update = criterion could be based only on the number of diff lines, or on some function of the number added vs. deleted, the logic being that 10 added lines is more important than 5 lines that changed. Appropriateness to checking: 1. could be a useful additional piece of information, certainly more information that an absolute diff 2. would not be useful to compare potential URLs to database 3. would not be useful, as this type of deduping requires 2. Feasibility: In the most simple form, the check would only take a few minutes to implement is a shell script. In more complex forms, all that would be needed would be to count the number of ">" and "<" lines, or keep track of the line numbers that actually change. These are all doable with simple scripts.