On 2008-04-10 is the 10th anniversary of the creation of the search service at the HCI Bibliography. Before the search service, the HCI Bibliography was a database that could be downloaded for local use. Peter Foltz, while a graduate student at the University of Colorado at Boulder, provided a search service for the first 3000 records using latent semantic indexing (LSI) technology from Bellcore. Tom Tullis at Fidelity provided a front end using Visual Basic. A slow email-based search service was provided at the HCI Bibliography, but in April of 1998, a web-based search went into service. Even today, the database is still provided by others, such as the Collection of Computer Science Bibliographies.
For the first eight years, the search service used the Glimpse search engine. Glimpse was freely available on the hcibib.org server (hosted by ACM SIGCHI). Under the name "gs", Glmpse search was quick, but it featured an obscure search syntax (AND=semi-colon, OR=comma, NOT=missing, grouped with curly-braces), and it had some problems handling even moderately-complex queries. John Pane, then a graduate student at CMU, helped design a matrix form that utilized the richness of the syntax while hiding it from users. Over the years, requests for some search features dribbled in: paginate the results, sort the results, to name a couple. My answer to these requests, displayed in the hcibib.org FAQ, started with "Thank you for volunteering to work on the hcibib.org search service." There simply wasn't time to add these features, even though I did not doubt their utility.
During the summer of 2006, the hcibib.org server was upgraded but during the process, the Glimpse software disappeared. After a different version of Glimpse was restored, the search service did not work properly, and in the autumn of 2006, the search service was discontinued on 2006-10-19. I wondered if anyone would care enough to ask to have the search service back. Sure enough, a few people did, some of whom were prominent researchers in the field. Unknown to them, after the demise of Glimpse during the server upgrade, I had started to cobble a perl-based search engine in my spare time, and by the end of November, 2006, I was ready to release the new search engine, "bs", for "Bibliography Search".
The new search engine had several features people had requested, pagination, sorting, more record formats (including export to EndNote and RefWorks). The search syntax was more conventional (&=AND, |=OR, -=NOT, *=wildcard, grouped with parentheses), not that I expected people to be able to use it directly. More importantly, bs worked more reliably. A syntax was added for authors to allow searches for variations in names (e.g., carroll_j* matches J., John, Jack, or John M. Carroll). Other new features included faceted search results (Subject, Author, Date, and Source), and query analysis (maybe only enjoyed by search engine builders).
Although it may have seemed inappropriate, the new search service, like the old, continued to present many options on the search form. Users of the service were doing research on HCI, so they were assumed to be familiar with the web, complex forms, and query formulation.
Although invisible to users, one of the biggest changes was the addition of detailed statistics on feature usage. Statistics had been kept, but more were added and they were better organized, and therefore were easier to analyse. These new stats quickly verified that most people (over 99%) do not sort results (from the new default of showing the newest publications first), change the pagination (from the default 25 items) or even view a second page of results (maybe because 25 is enough). People almost never use features like Boolean connectors, parentheses, or wildcards. That's not to say those features are unused; dozens of robots visit hcibib.org daily, checking every link of various pages of links, ultimately performing hundreds or thousands of searches. Over a period of a few months, statistics were added to reach the current set of variables:
Over the life of Glimpse search, a little over two million searches were handled (2,050,633 to be exact). That's about two hundred and fifty thousand a year or about 700 per day (more on weekdays than weekends). Hindered by some server limitations, the old stats for gs never allowed me to get a measure of how many users were doing the searches. It would be optimistic to think the service was being used by 700 people per day or by many thousands of people per year. SIGCHI conferences attracted about 2500 people, and had maybe twice that many members, few of whom would search the HCIBIB more than a few times a year. Still, there were some days when the number of searches was in the tens of thousands, and the obvious conclusion was that some automated system was checking out links on pages like the list of top authors in the bibliography (which has almost 1500 links that perform searches for one author's publications). Any search results had hot links to search for keywords and authors, so any recursive process would find plenty of searches to follow (despite the robot.txt instructions indicating that the link pages should be indexed but not followed and that the search results should neither be indexed or followed).
Because robots visited hcibib.org, sometimes performing tens of thousands of searches in a day, I wanted to have data that would not be biased by robots. In particular, I wanted to know what real people were doing, so that the search service could be improved. The first step in reducing the effects of robots was to look at the data for individual IP addresses (i.e., the internet identifiers for computers like 132.174.17.21). How many IP addresses performed searches? How many IP addresses got some zero-hit searches? How many IP addresses got no search results at all?
There are some easy ways to identify a cooperative robot using a service: Many robots identify themselves by the user-agent string they send in the http request (e.g., Slurp, this-bot, that-spider, another-crawler, etc.). But then there are less cooperative robots that are harder to identify. They are first identified by their behaviors. Requests with no user-agent, no locale, no referring page, and which come at a rate of more than one every few seconds help identify an IP address or range of IP addresses that are used for robot activities.
Unlike humans, who tend to use search forms, robots tend to do searches that are hotlinks on a page (e.g., top authors, hot links to authors in a table of contents). With many more robot searches than human searches, rather than try to block robots, it is better to simply analyse robot and human data separately. Trying to block robots can be futile, by the way, because badly behaved robots are good at spoofing IP addresses.
There are many ways to define failure, but for the HCI Bibliography, a search that produced no results (zero hits) is primary. Worse than a single query that produce no hits are multiple zero-hit queries, although zero-hit queries in the context of queries matching material may not be so bad. The worst failure is a user who never saw any results, especially if they tried more than one search.
For many months in 2007, a search on Google for "bibliography" would show a page that included, *sigh*, the HCI Bibliography. Google would also suggest a link for how to write a bibliography which would not include a link to hcibib.org. Many of those poorly-directed users, probably students, found themselves on hcibib.org, apparently so eager to learn about citation styles that they did not notice the SIGCHI Curriculum Development Group's definition of HCI right next to the searchbox (I admit I added the definition after the first few bibliography queries). Eventually, I added a special record to the database, dated so it came out first in the default newest-first order, titled "The HCI Bibliography is NOT about Bibliographies". To make it easier to match, the record included 15 misspellings of bibliography and included about the same number of phrases used to search for the topic. It also included a link to Google where users would find more about creating bibliographies.