Issues in Science and Technology Librarianship | Spring 1998 |
---|
DOI:10.5062/F4R78C6M |
URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed. |
The first part of the paper provides an overview of search engine structure. The second part presents the methodology used in evaluating the search engines. The third section explores the results of some of the sample searches, includes a table that compiles the evaluative information, and discusses strategies which may be useful in locating earth sciences information when using Internet search engines.
Search engines have developed into three main categories. The first category contains catalog or directory search engines, arranged by subject or material type. Examples include Yahoo!, a subject-based catalog with a keyword search aid; the Argus Clearinghouse, a set of subject-based search engines; DejaNews, a search engine devoted to Usenet information; and Magellan, a subject-based catalog of reviewed web sites. The second category is keyword or crawler search engines. These are indices of Internet material compiled by robot or spider programs. The programs regularly navigate through data tags, links and the text of web pages for new and updated information. Examples of web crawlers include HotBot, which uses a program that indexes web-pages word for word, and Infoseek, which culls information through data tags and links. The third category of search engines are multi-threaded search engines, or meta-crawlers, which search multiple search engine databases concurrently and present the combined results. Examples include MetaCrawler which uses keywords to search six indices concurrently, and Ask Jeeves which uses natural language queries and an expert system to search five keyword search engines concurrently.
Within the three main categories of search engines are cross-over technologies. For instance, most of the catalog and directory search engines have keyword searchable indices in addition to browsable subject trees, like Yahoo!, Galaxy and the Internet Sleuth. Also many keyword or crawler search engines provide hierarchical subject channels to the material in their databases, like Excite, Lycos and Infoseek.
For the most relevant and precise results searchers should be aware of several important criteria. The "help", "how to search" or "about" links on the search engine homepage should help determine the answers to these questions:
Precision was used to measure the usefulness of the search engines, and is based on the ratio of the number of relevant records within the first 10 to 15 records retrieved. This ratio is broken down into three measures: high, average, and low. Search engines which returned relevant, working links to information related to the sample queries within the first 10 to 15 records were given a high rating. Search engines which returned marginal links, (such as information on copper in other countries) were given an average precision rating. Search engines which returned an overwhelming number of inactive and completely unrelated links (such as ENSO as a company name rather than a meteorological phenomena or links containing information about Madrid, Spain) in the first 10 to 15 records were given a low precision rating. The precision rating in the tables is an interpretation of the results rather than an actual statistical measure of the ratio. This study is not a statistical evaluation of the precision of search engine results but rather an interpretive exploration of tools and their usefulness in the earth sciences.
The results were considered relevant if the information they provided was unique, provided factual data and could be used in a reference transaction. The quality of the returned links were reviewed for their perceived accuracy, authority, coverage, timeliness and uniqueness (Rettig, 1996; Tate & Alexander, 1996). For instance, pages which provided verifiable facts that were directly related to the query received high ratings. Pages that provided marginal, unverifiable and duplicate information or which required the user to do additional searching were given average ratings. Search engines which returned completely unrelated or inactive links were given low ratings. While other studies have gone to great lengths to take into account the bias of the searcher when determining the relevancy of the returned links (Leighton and Srivastava, 1996), this is an interpretive study in a focused subject area, so no efforts were made to compensate for potential bias by the searcher in evaluating the results.
The size of search engines was also classed into three categories. Big search engines contain over 25 million URLs or web pages, medium sized search engines contain between one million and 25 million URLs or web pages, and small search engines contain less than one million URLs or web pages. Some size figures are approximate.
Browsing multiple directory or catalog engines can prove frustrating since few of the engines use a controlled vocabulary. The Librarian's Index and WWW Virtual Catalog both categorize data via Library of Congress subject classes. (Not by LC class number however.) Yahoo! developers are proud of their intuitive subject classification scheme and Look Smart proclaims a "16,000-plus subject index." Catalog search engines limit their usability for serious research, however, by not providing and using name and subject thesauri. This is especially true for science categories which can be difficult to find when browsing or keyword searching. Earth science and other science categories are often hidden under headings such as "education", "reference" or "learning." If a useful heading is not uncovered after browsing through one or two layers, query the database with keywords. The two most detailed earth science-related subject trees are found at Yahoo! and the WWW Virtual Library.
Using the search tools on the catalog search engines can result in greater precision than simply browsing subject categories. In addition these search tools provide greater flexibility for searching the contents of the database, which is very handy in the absence of controlled vocabularies or subject thesauri for the catalog databases. The Go2 search engine had the most precise search tool. (The Argus Clearinghouse and InfoMine tools also had highly precise results for the sample queries. However, when using these tools the searcher needs to be aware that their databases are small and that precision is dependent on whether or not the subject of the query is included in the database. For instance the "ENSO" query had much more precise results in the Argus and InfoMine engines than did the queries on the "New Madrid fault zone" and "copper production in Brazil.")
The default Infoseek search for "copper production in Brazil" had low precision while the search on the "New Madrid fault zone" had average precision. (Precision did increase with case sensitive searching, e.g. "New Madrid fault zone" had more precise results than "new madrid fault zone.") The most precise Infoseek search was the keyword search on "ENSO." The use of Infoseek refinement options did improve results in all searches. For example, the "pipe" command, which looks for related records within a larger set of records, "copper | Brazil" led to more relevant material than the standard search. Following the Infoseek "related topic" links did not locate many additional relevant links.
Excite did better with the phrase and multiple-concept searches than it did with the keyword query. The sample query on "copper production in Brazil" found the Copper Development Association page, entitled "Copper: Market and Data Statistics," press releases and annual reports from companies with copper mines in Brazil. Excite returned several duplicate links within all the sample queries, for instance in the "ENSO" search the NOAA-CIRES ENSO page appeared under both the http://www.cdc.noaa.gov/enso/index.html URL and the http://www.cdc.noaa.gov/enso/ URL. The "More like this" option in Excite did not retrieve any additional relevant links in any of the sample queries.
Excite and Infoseek results for the "copper production in Brazil" query were similar, primarily pointing to press releases, annual reports and technical reports for mining and production companies. The most consistent false hits for this query were on copper as a dietary mineral supplement. (Surprisingly "brazil nuts" are a good source for copper.) The most common false hits in the "ENSO" query were to companies named ENSO, while the most common false hits in the "New Madrid fault zone" query were links to sites about unrelated seismic zones and faults.
Northern Light is another useful keyword search engine. Northern Light sorts results into folders by domain, and subject. Some of the folders created with the sample search on Brazilian copper production included, "Commercial sites," "Mining industry," "Metals Industry," "Coal" and "Toxicology," among others. Folders for the search on "ENSO" included "Personal pages," "Climatology," and "www.coaps.fsu.edu," among others. Northern Light also searched several online "Special Collection" databases which located journal articles. These articles could be purchased from Northern Light for document delivery fees varying between two to six dollars depending on the length and source of the article. This hybrid of Internet and literature databases is a trend to watch for on the web.
Planet Search, though it earned an average precision rating, had one of the best displays for search results. The Planet Search results include a bar graph depicting the relevance for each search term in the query. The bar graphs show not only the relevance of each term for each record located, but also show the overall results for the entire search. Each record also contains a "Find similar" option, the records date, and the number of words in the record. Planet Search also allows the searcher to create custom directories for search results and bookmarks. Planet Search had many repeat hits within its results such as including mirror sites for the Southern California Alphabetic Fault Index in the "New Madrid fault zone" results and three links to the ENSO Newsletter Homepage in the "ENSO" results.
Lycos did above average with phrase and multiple concept searches, but results for keyword query was low. WebCrawler and Magellan (a catalog/directory-type search engine) had identical results for all three queries. In addition the WebCrawler and Magellan results were the most imprecise of just about any engine used, regardless of type. For instance, the first site listed in the "New Madrid fault zone" query was for a map of Madrid, Spain while the fifth link returned was for ESPN SportsZone: Soccer. HotBot had average precision with the default "all the words" search, but precision did increase slightly when "the exact phrase" mode was used with phrase and multiple concept queries. What-U-Seek had low precision for phrase and multiple concept searches, but had highly precise results for the keyword search on "ENSO." Alta Vista results were average on default searches, but precision did increase slightly with the use of the "refine" option. With Alta Vista the results for all queries contained duplicate links.
Most multiple-threaded search engines had average results as shown in Table 1. There was no one best multiple-threaded search engine that emerged from the sample queries. Rather some engines did better with keyword searches while others returned more useful results with phrase or multiple-concept queries. For instance, Mamma, Profusion and Metacrawler did better with the phrase query for the "New Madrid fault zone" and the multiple-concept query on "copper production in Brazil." Inference Find and Ask Jeeves had more precise results for the keyword search, "ENSO."
The interface for many of the multiple-threaded search engines allow the user to refine or direct the search at the top level. For example, Metacrawler and Savvy Search allow the user to look for "all" or "any" of their search terms as well as "as a phrase." ProFusion offers a default mode, a Boolean mode, or a phrase mode, while Mamma allows the user to search for their terms "as a phrase" or to limit the search for their terms to "document titles" only.
A few of the engines like MetaFind and Inference Find cluster the results of the searches by keyword. Other engines, such as Ask Jeeves and Savvy Search, group the results by the tool which returned the link. Most commonly, results are displayed by relevance ranking based on a ratio of where and how often search terms appear.
The Internet Sleuth, a catalog or directory-type search engine, can also be used as a multiple-threaded search engine. The Internet Sleuth homepage provides access to 21 subject categories, which can easily be expanded to show sub-categories. The science category has nine sub-categories including one for "Earth Sciences." The "Earth Sciences" sub-category provides search engines for over eleven different earth science resources such as Volcano World and the SPE Technical Papers Index. While "Earth Sciences" in Internet Sleuth does not yield an exhaustive list, the links provide access to some high-quality full-text resources. This access to subject-based search engines is unique. In addition to the subject-based search engines, the Internet Sleuth homepage also provides the opportunity to search the entire web simultaneously from up to six major search engines (Alta Vista, Excite, Infoseek, Lycos, WebCrawler and Yahoo!.) Searchers can also view multiple reviewed, news, business & finance, software and Usenet engines.
Ask Jeeves which uses natural language queries yielded above average precision with the keyword query on "ENSO" but below average precision with the phrase and multiple-concept queries. Search queries are fed through an expert system which not only suggests alternate strategies to the original search, but also sends the query out to Excite, HotBot, WebCrawler, Alta Vista and Infoseek. The sample query on "ENSO" resulted in the six additional queries in Figure 1. The alternative search strategies returned were quite relevant to the original query, and provided the user the opportunity to focus the search on a particular aspect of the search term. Ask Jeeves also returned ten resources from each of the five search engines that it queried. The Ask Jeeves results from the engines queried were consistent with the results from the individual search engines (see Table 1).
Figure 1. Ask Jeeves Expert System Suggested Alternative
Queries
What is the latest news coverage on El Nino?
What is an El Nino?
Where can I find information on the 1997-98 El Nino?
What is the latest news coverage on California storms?
Where can I learn about the meteorology topic El Nino?
Where can I find general scientific information on El Nino?
Highway61 had above average precision for phrase and multiple-concept searches. Highway61 sends queries to six search engines: Yahoo!, Alta Vista, Lycos, WebCrawler, Infoseek and Excite. The number of results displayed is determined by the searcher who chooses how long the search engine can look for results as well as the number of results to display. Results on the sample query on "copper production in Brazil" found several unique company reports, and also found the most web sites from the .br (Brazil) domain.
When using keyword and multiple-threaded engines, notice what sections of the pages the engine is searching and develop a precise search statement. The volume of information available on the web necessitates the use of "advanced" or "refine" options for more accurate search results. In addition, searchers should keep in mind the advice offered from an Infoseek tip, "Longer queries work better." Use a series of specific and unique terms for more precise search results. This advice holds true for locating earth science information on the web, as well as any subject specific search.
As Leighton and Srivastava stated, "True precision, the ratio of relevant elements returned to the total number of elements returned, is too arduous to calculate, because it would mean examining all the links returned by a service, which may number in the thousands or millions." (1997, {http://www.winona.msus.edu/library/webind2/wi2pt2.htm#EVALCRIT}, p. 3 of 8). Recognizing the limitations of this study, it is hoped that the results can still serve as a guide when using Internet search engines to locate earth science information on the World Wide Web.
Table 1. Search Engines Reviewed
Name and URL | Size | Precision | Notes |
Catalog or Directory-type search engines | |||
All in One {http://www.albany.net/allinone/} | small | average* |
Common interface to many smaller search engines which user must search
one at a time. Not much science. *Precision varies by tool. |
Argus Clearinghouse {http://www.clearinghouse.net/} | small | high* |
Reviewed sites. Science links found under the "Environment"
heading and the "Math & Sciences" heading which contains an
"Earth Sciences" sub-category. *Only if subject is included in the Clearinghouse. |
C|Net's Search.com http://www.search.com/ | big | average | Site search is powered by Infoseek. Users can choose from 11 search engines when searching the "entire web." There is a "Science" sub-category under the main "Learning" category. Use of "related links" can increase precision. |
EINet Galaxy {http://www.galaxy.com} | small | average | "Geoscience" sub-category is found under the main category "Science". Found zero hits for phrase and multiple concept queries. |
Go2 (formerly the World Wide Web Worm) {http://www.overture.com/} | small | high | 500 categories listed in random order. Provides last crawled date with descriptions. Users can "rate" the located sites. |
G.O.D. (Global Online Directory) http://www.god.co.uk/ | small | low | "Science" sub-category located under the main category "Community and Education". |
HandiLinks {http://www.ahandyguide.com:80/} | small | low | There are no science "Hot Areas", but using alphabetic jump bars locates links for subjects like "geology", "meteorology", etc. |
Hot Lava {http://hotlava.erupt.com/} | small | low* |
"Earth Sciences" sub-category found under "Health and
Sciences" main category. Very small database. Similar to Yahoo!
*Sends keyword queries simultaneously to six search engines, with average precision results. |
InfoMine {http://infomine.ucr.edu/} | small | high | Provides subject, title or keyword access. Descriptions provide links to related sites. "Earth Science" category located in the "Physical Sciences, Engineering, Computing and Math" main category. Searches can be limited to individual categories. |
Internet Sleuth {http://www.isleuth.com/} | small | average* |
"Earth Science" category found in the "Science"
category which provides links to specialized search engines. *Keyword queries of the Internet Sleuth database with sample queries resulted in zero hits. Specialized search engine precision results varied by tool. |
Librarians' Index to the Internet {http://lii.org/} | small | low | Uses Library of Congress subject classes. No overall earth science category but there are sub-categories for "Earthquakes", and "Environment". The browsable subject list contains "Geology" as a subject heading, but the category only contains three links. |
Look Smart {http://search.looksmart.com/} | small | average | For "Earth and Environment" category look under "Reference and Education" then the "Science and Nature" categories. |
Magellan {http://web.webcrawler.com/} [Ed. note: Magellan is now WebCrawler] | small | average | Subject categories access reviewed sites. Can also do keyword searches of the "entire web." "Science" category contains a "Planet Earth" sub-category for earth science-related links. |
Power Search {http://www.power-search.com/} | big** | average* |
Distributed links to over 100 specialized and general search engines. The
"Power Search" option inserts the search strategy into the
search box for each tool, but searches still need to be completed
individually for each tool. *Precision varies by tool. **Tools included are in the big range, but the site itself only links to 100 search tools. |
SciCentral {http://www.scicentral.com/} | small | low | Relatively new. Maintained by professionals in the fields covered. "Earth and Space Science" category contains nine sub-categories. |
WWW Virtual Library {http://vlib.org/} | small | average | Distributed servers. Geosciences housed at University of Calgary, Meteorology housed at Penn State, etc. |
Yahoo! http://www.yahoo.com/ | medium | low* |
Comprehensive. "Earth Sciences" sub-category is located in the main
"Sciences" category. *Sample queries resulted in zero hits in Yahoo! for phrase and multiple concept queries, and found only two (of 49) relevant links for the keyword query. Queries forwarded to Alta Vista yielded average precision. |
Name and URL | Size | Precision | Notes |
Keyword or Crawler-type Search Engines | |||
AliWeb {http://aliweb.emnet.co.uk/} | small | low | Archie-like, dynamic indices. Current focus is on academic and technical sites. Search interface provides many search refinement options. |
Alta Vista {http://www.altavista.com/} | big | average | Use of "refine" option clusters results by theme which user can then choose or exclude in order to increase precision. Alta Vista subject channels are based upon the Look Smart database. |
Excite http://www.excite.com/ | big | high | "More like this" links useful for locating related sites. No science categories or sub-categories were found in the Excite channels. "Power Search" option increased precision. |
HotBot {http://www.hotbot.com/} | big | average | "Hip Pocket Guide" categories, include an "Earth and Environment" sub-category under the main "Reference and Education" category, and "Science and Nature" sub-category (Similar to Look Smart). Searches can be limited by date, geographic location and domain. |
Infoseek {http://www.go.com/} | big | high | "Earth Science" sub-category located under the main "Careers and Education" category, then follow "Fields of Study" to "Science." Results can be refined with new terms. The pipe search looks for narrower terms within a larger concept. |
Lycos http://www.lycos.com/ | big | average | To find "Earth Sciences" in the Lycos subject categories look under the "Education" sub-category, in the main category "Knowledge". Search terms can be limited to titles, URLs and within specified sites. |
Northern Light http://www.northernlight.com/ | big | high* |
Access to full-text articles in Special Collections. Description includes
creation date. *Use of custom search folders increased precision. |
Planet Search http://www.planetsearch.com/ | big | average* |
Many customization options. Graphic display of search term relevance for
each link. *"Find Similar" option increased precision. |
Web Crawler {http://www.webcrawler.com/} | medium | average | No science-related channels. Supports natural language queries. Can choose from link only or brief summary display. |
What-U-Seek http://whatuseek.com/ | medium | low* |
Fast. "Science and Technology" category contains 50
sub-categories. *Higher precision for keyword searches than for phrase or multiple concept searches. |
Name and URL | Size | Precision | Notes |
Multiple-threaded or Meta-crawler type search engines (number in parentheses is number of search engines searched) | |||
Ask Jeeves {http://www.ask.com/} | big | average | Searches 5 general Internet search engines. Uses natural language queries. Expert system helps guide searchers to related information. Results from concurrently searched engines similar to "refined" results retrieved in searches of the individual engines. |
CUSI - Configurable Unified Search Index {http://cusi.emnet.co.uk/} | medium | average* |
Search by type of search engine (category, keyword, Usenet, etc.) through
a common interface. Tools are searched one at a time, but users can
choose from over 18 different search engines. *Results vary by tool. |
DOGPILE http://www.dogpile.com/ | big | low | Searches 14 Internet search engines as well as 5 Usenet, 2 FTP and 3 news search engines. Similar to MetaFind. Searches are automatically configured with commands such as "+new +madrid +fault +zone." Results are clustered by tool which returned link. Duplicates are not removed. |
Highway 61 {http://www.highway61.com/} | big | average* |
Searches 6 Internet search engines. Provides contemplative quotes while
waiting for results. *Phrase and multiple-concept queries yielded higher precision. |
Inference Find {http://www.inference.com/infind/} | big | average | Searches 6 Internet search engines. Clusters results by domain and removes duplicates. |
Mamma http://www.mamma.com/ | big | average* |
Searches 6 Internet search engines as well as 5 financial and 5 news
search engines. Clusters results by which search engine returned the
link. *Phrase and multiple-concept queries yielded higher precision than keyword queries. |
MetaCrawler http://www.metacrawler.com/ | big | average* |
Searches 6 Internet search engines. "Metaspy" link allows users
to see what and how other users are searching. *Phrase and multiple-concept yielded higher precision. |
MetaFind {http://search.metafind.com/} | big | average | Searches 6 Internet search engines. Results can be clustered by keyword, domain or alphabetically. Sorting by domain was often the most useful. Similar to Dogpile. No "stop" words, searched all words, including "in" in sample queries |
Profusion {http://www.profusion.com/} | big | low | Searches 9 Internet search engines. Can choose to limit search to the "3 best" or "3 fastest" search engines available. Offers three search modes: default, phrase or Boolean. Displays results by relevance ranking. |
Savvy Search {http://www.cs.colostate.edu/~dreiling/} | big | average | Searches up to 19 Internet search engines in over 20 languages. Users can integrate results and limit by type of material and domain. Did not remove duplicates. |
(2) The overall rating for the Tomailou and Parker study found Alta Vista 9.3, Infoseek 8.3 and Lycos 8.1. An analysis of the earth science-related queries, found Infoseek to be 9.5, Lycos 8.7 and Alta Vista 8.3.
Ding, Wei and Marchionini, Gary. 1996. A comparative study of web search service performance. In: American Society for Information Science 1996 Annual Conference Proceedings, 33; Global complexity: Information, chaos and control; Baltimore, Maryland, October 21-24, 1996. (Edited by Steve Hardin), pp. 136-142. Information Today, Medford, NJ.
Lebedev, Alexander. 17 May 1997. Best search engines for finding scientific information in the web. [Online]. {http://www.chem.msu.su/eng/comparison.html} [27 November 1997].
Leighton, Vernon, H. and Srivastava, J. 16 June 1997. Precision among World Wide Web search services (Search engines): Alta Vista, Excite, Hotbot, Infoseek, Lycos. [Online]. {http://www.winona.msus.edu/library/webind2/webind2.htm}
Rettig, James. 1996. Beyond cool: Analog models for reviewing digital resources. [Online]. {http://www.onlineinc.com/onlinemag/SeptOL/rettig9.html} [30 April 1998].
Singh, Amarendra and Lidsky, David. 1996. "All-out search." PC Magazine 15(21): 213-249.
Tate, Marsha and Alexander, K. 1996. "Teaching critical evaluation skills for World Wide Web resources." Computers in Libraries 16(10): 49-55.
Tomaiuolo, Nicholas G. and Packer, Joan G. 1996. Quantitative analysis of five WWW "search engines." [Online]. {http://neal.ctstateu.edu:2001/htdocs/websearch.html} [1 December 1997].
Webster, Kathleen and Paul, Kathryn. 1996. Beyond surfing: Tools and techniques for searching the web. [Online]. {http://magi.com/~mmelick/it96jan.htm} [26 November 1997].