Thursday, March 6, 2008

How Internet Search Engines Work

Introduction

Internet search engines are special websites designed to help people find information stored on other sites. A top search engine indexes billions of pages and responds to billions of queries every day.

A typical search engine performs three tasks:
  • Crawling the Internet and analyzing important words on web pages
  • Keeping an index of the words it finds and where it finds them
  • Allowing users to look for words or combinations of words found in that index
This article explains how Internet search engines work based on these three tasks: (1) web crawling, (2) indexing and (3) searching.

1. Web Crawling

Before a search engine can tell you where a file or document is, it must be found. To find information on billions of web pages, a search engine employs special software called spiders. A spider takes a web page’s content and identifies important information that online users are most likely looking for. It begins with a popular site and follows every link found within the site to other sites.

The process that a spider uses to visit and analyze web pages is called web crawling. Each search engine uses different crawling approaches to make the spider operate faster. How fast and efficient the spider crawls affects the performance and reliability of a search engine.

Once the spiders have completed the task of finding information on web pages, they build a list of words and where they were found and send the list to the search engine’s indexing software.

2. Indexing

Similar to an index in a book, a search engine’s index makes the process of information retrieval quick and efficient. Each commercial search engine has a different formula for optimizing the index, called weighting. A search engine extracts information and builds the index based on its own system of weighting. This is one of the reasons that a search for the same word on different search engines produces different lists, with pages presented in different orders. How the indexes are built and used is crucial to the success of a search engine.

Indexes are updated periodically as new content is crawled. Some search engines build a dictionary of all words that are available for searching, which can also help in correcting mistyped words by showing the corrected words in the search result.

Once the indexing is completed, the results are stored in the search engine’s database in a sorted order for users to access. Figure 1 illustrates how a search engine crawls through web pages and build index.



Figure 1

3. Searching

When a user enters a query consisting of keywords, the search engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. Most search engines support the use of the Boolean operators, such as AND, OR and NOT, to further specify the search query. Figure 2 illustrates how a search engine responds to a user’s keyword search.



Figure 2

Some search engines provide an advanced feature, called natural language search. The nature language search allows users to ask search engine questions in addition to key words. The search engine uses a programmed logic to determine the keywords and then searches information that match.

The overall performance of a search engine depends on the quality of search results and the response time of a query.

Summary

Internet search engines work by storing information about a large number of web pages that are retrieved from the Internet. The mission of search engines is to provide quality search results efficiently over a rapidly growing World Wide Web. This depends on the search engine’s ability of crawling, efficiency of indexing, and accuracy of retrieval.

No comments: