Charting human curiosity by archiving queries to search engines

Wouldn't it be interesting to know what other people are curious about? Here is one way of doing it.  

Established 7 September 1998.  Last update 5 February 1999  (Minor cleanup on 4 September 2006) Robin Whittle rw@firstpr.com.au  Back to the main First Principles site

 
An excellent source of information about search engines is Danny Sullivan's site http://searchenginewatch.com

There is a search engine called MetaCrawler  http://www.metacrawler.com which acts as a kind of gateway to several other search engines, including AltaVista, InfoSeek, and Excite (but not SinfoSeek or HotBot).  MetaCrawler has a special feature - MetaSpy, which displays some of the search terms that people enter into MetaCrawler.  There is a filtered (no naughty words) version and an unfiltered version.  Take a look at MetaSpy here:  http://www.metaspy.com/

The unfiltered version uses frames, one of which points to a URL which causes itself to be reloaded every fifteen seconds.  That URL is: http://www.metaspy.com/spymagic/Spy?filter=false  It shows ten search terms, each of which links directly to the MetaCrawler search engine - so normally you have fifteen seconds to read them and decide whether to click on one to see what it is all about.

http://searchenginewatch.com/facts/mcterms.html

A listing of sites with search phrase lists:
http://www.dwoz.com/default.asp?Pr=122
Among the sites listed at the above page is one which lists the top 100 most requested search terms.  It's URL?  http://www.searchterms.com of course! 

MetaSpy (and other recycling schemes) provides a unique window into the curiosity of Internet users.  The URL above only shows a very small number of the search terms which MetaCrawler processes, and it seems that those terms can appear several times if you access this URL often enough, or let it reload itself often enough.

I want to archive some of these search terms - as a record of human curiosity.

On my Red Hat Linux machine at my home office I created a script file which is run every hour as a cron job.  It works together with a C program which extracts the search terms from the MetaSpy HTML.  Every hour, it gets 50 pages of MetaSpy, which is 500 search terms - some of which will be repeats of others.  Then it extracts the search terms, adds them to a text file, sorts the text file and removes duplicate lines.  So every hour, my text file grows by about 380 search terms.  Since I am paying AUD$0.19 per megabyte, this activity is costing me 2.5 cents an hour.

Here is the result of my first days day's trawling of MetaSpy on 7 September 1998 - a 178 k byte text file with 10,761 search terms.

txt/metaspy.txt

Take a look!

Here is the result of 15 days: 7 to 22 September.  A 2.2 meg file with 123,831 search terms:

txt/metaspy-15-days.txt

Here is the result of a month's milking MetaSpy:

txt/metaspy-7-sept-98-to-4-oct-98.txt

The directory /curiosity/txt/ includes other files, such as a gzipped version of the longer files, and other longer files which I may put there without updating this page, so take a look there.

Here is a short 3k file in which I have listed some of the terms which I find more intriguing:

txt/terms-1.txt

By 5 February 1999, I had collected 19 megabytes of terms, and I decided to turn my system off for a while, since it was costing me money each month.  If you run a Linux/Unix system and would like my scripts and C program so you can milk Metaspy yourself, let me know.

The gzipped version of my collection as at 5 February 1999 is the 7.5 megabyte:

txt/metaspy-7-sept-98-to-6-jan-99.txt.tgz
 

The search terms give some insight into:

  • What people are interested in that they cannot find by going to sites they already know about.
  • How they mis-spell things.
  • What proportion of search terms are related to sex in some way. (Answering this question by looking at the search terms will test your own thinking about what you consider is related to sex!)
  • There are some potential privacy problems with MetaSpy and my archiving of search terms here.
  • MetaSpy users are not necessarily aware that their search terms will be recycled in this way.
  • The search terms could include some personal names of people who don't want their name made available in such a way.
  • The search terms could be deliberately generated to attract the attention of MetaSpy users for whatever reason.
  • If you see a search term in the above metaspy.txt file, or any other similar file I put here, which you think is best not available in this form, please let me know.  I have used the robots.txt arrangement to stop search engines indexing files in the sub-directory in which I keep the files of search engine terms.
     

    Here are a few interesting search engine links:

    http://searchenginewatch.com/ Danny Sullivan's excellent sitesite with the latest and up-to-date-est on lots of statistics and commercial attributes of major search engines.  More information and a newsletter are available to subscribers.

    http://google.stanford.edu/  (This link points to the archive.org repository of these old pages.) Interesting search engine, with results ranked in terms of a page's importance which is a function of how many other pages (especially important pages!) link to it.  Also keeps pages on its hard disks, and gives you the option of reading the page from there, rather than the original site.  This could be an interesting means of finding pages which have since changed or been deleted.  Those cached pages have an "http: source" line written into them so that their enclosed graphics etc. come from the original server.  The Google site also has research papers on search engine techniques, and pictures of their impressive servers and disk farms.
     


    Robin Whittle  rw@firstpr.com.au   Back to the main First Principles site