Crawler Based Search Engine for Software Professionals
According to Internet World Stats survey, as on March 31, 2008, 1.407 billion people Use the Internet. The vast expansion of the internet is getting more and more day by day. The World Wide Web (commonly termed as the Web) is a system of interlinked Hypertext documents accessed via the Internet. With a Web browser, a user views Web pages that may contain text, images, videos, and other multimedia and navigates between them using hyperlinks .
Difference between Web and Internet
One can easily get confused by thinking that both World Wide Web and the Internet is the same thing. But the fact is that both are quite different. The Internet and the World Wide Web are not one and the same. The Internet is a collection of interconnected computer networks, linked by copper wires, fiber-optic cables, wireless connections, etc. In contrast, the Web is a collection of interconnected documents and other resources, linked by hyperlinks and URLs. The World Wide Web is one of the services accessible via the Internet, along with various others including e-mail,
File sharing, online gaming and others described below. However, “the Internet” and “the Web” are commonly used interchangeably in non-technical settings.
1.1 Web Search: origins, today’s usage, problems
In the beginning, there was a complete directory of the whole World Web. These were the times when one could know all the existing servers in the web. Later, other web directories appeared. Some of them are Yahoo, AltaVista, Lycos and Ask. These newer web directories kept a hierarchy of the web pages based on their topics. Web directories are human-edited, thus making them very hard to maintain when the web is growing up so fast. As a result, information retrieval techniques that had been developed for physical sets of documents, such as libraries, were put into practice in the web.
The first web search engines appeared on 1993. Those web search engines did not keep information about the content of the web pages; instead, they only indexed information about the title of the pages. It was in 1994, when web search engines started to index the whole web content, so that the user could search into the content of the web pages, not only in the title.
On 1998, Google appeared and this changed everything. The searches done by this search engine got better results than the previous search engines would get. This new search engine considered the links structure of the web, not only its contents. The algorithm used to analyze the links structure of the web was called Page Rank. This algorithm introduced the concept of “citation” into the web: the more citations a web page has, the more important it is; furthermore, the more important is the one who cites, the more important the cited is. The information about the citations was taken from links in the web pages.
Nowadays, web search engines are widely used, and their usage is still growing. As of November 2008, Google performed 7.23 billion searches.
Web search engines are today used by everyone with access to computers, and those people have very different interests. But search engines always return the same result, regardless of who did the search. Search results could be improved if more information about the user was considered .
1.2 Aim of the thesis
Web search engines have, broadly speaking, and three basic phases. They are crawling, indexing and searching. The information available about the user’s interest can be considered in some of those three phases, depending on its nature. Work on search personalization already exists.
In order to solve the problems of ignorance in relation to the user and his interests, we have developed a system only for the Software Professionals that searches over fixed number of seed hosts and generates results using our own algorithm and some prediction.
1.3 Web Search Engine
A web search engine is designed to search for information on the World Wide Web. The search results are usually presented in a list of results and are commonly called hits. The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input .
An Internet search engine is an information retrieval system, which helps us find information on the World Wide Web. World Wide Web is the universe of information where this information is accessible on the network. It facilitates global sharing of information. But WWW is seen as an unstructured database. It is exponentially growing to become enormous store of information. Searching for information on the web is hence a difficult task. There is a need to have a tool to manage, filter and retrieve this oceanic information. A search engine serves this purpose .
In the contest of the Internet, search engines refer to the World Wide Web and not other protocols or areas. Furthermore search engines mine data available in newspapers, large databases, and open directory like DMOZ.org. Because the data collection is automated, they are distinguished from Web directories, which are maintained by people .
A search engine is program designed to help find files stored on a computer, for example a public server on the World Wide Web, or one’s own computer. The search engine allows one to ask for media content meeting specific criteria (typically those containing a given word or phrase) and retrieving a list of files that matches criteria. A search engine often uses a previously made, and regularly updated index to look for files after the user has entered search criteria .
The vast majority of search engine are run by private companies using proprietary algorithms and closed databases, the most popular currently being Google (with MSN Search and Yahoo! Closely behind). There have been several attempts to create open-source search engines, among which are Htdig, Nutch, Egothor, and OpenFTS.
On the Internet, a search engine is a coordinated set of programs that includes :
- A spider (also called a “crawler” or a “bot”) that goes to every page or representative pages on every Web site that wants to be searchable and read it, using hypertext links on each pages to discover and read a site’s other pages
- A program that creates a huge index (sometimes called a “catalog”) from the pages that have been read
- A program that receives our search request, compares it to the entries in the index, and returns results to we
An alternative to using a search engine is to explore a structured directory of topics. Yahoo, which also lets we use its search engine, is the most widely-used directory on the Web. A number of Web portal sites offer both the search engine and directory approaches to finding information.
Are Search Engines and Directories The Same Thing?
Search engines and Web directories are not the same thing; although the term “search engine” often is used interchangeably. Search engines automatically create web site listings by using spiders that “crawling” web pages, index their information, and optimally follows that site’s links to other pages. Spiders return to already-crawled sites on a pretty regular basis in order to check for updates or changes, and everything that these spiders find goes into the search engine database. On the other hand, Web directories are databases of human-compiled results. Web directories are also known as human-powered search engines.
1.4 Different Search Engine Approaches
- Major search engines such as Google, Yahoo (which uses Google), AltaVista, and Lycos index the content of a large portion of the Web and provide results that can run for pages – and consequently overwhelm the user.
Specialized content search engines are selective about what part of the Web is crawled and indexed. For example, TechTarget sites for products such as the AS/400 (http://www.search400.com) and CRM applications (http://www.searchCRM.com) selectively index only the best sites about these products and provide a shorter but more focused list of results.
• Ask Jeeves (http://www.ask.com) provides a general search of the Web but allows us to enter a search request in natural language, such as “What’s the weather in Seattle today?”
• Special tools and some major Web sites such as Yahoo let us use a number of search engines at the same time and compile results in a single list.
• Individual Web sites, especially larger corporate sites, may use a search engine to index and retrieve the content of just their own site. Some of the major search engine companies’ license or sell their search engines for use on individual sites .
1.4.1 Where to Search First
The last time we looked, the Open Directory Project listed 370 search engines available for Internet users. There are about ten major search engines, each with its own anchor Web site (although some have an arrangement to use another site’s search engine or license their own search engine for use by other Web sites). Some sites, such as Yahoo, search not only using their search engine but also give the results from simultaneous searches of other search indexes. Sites that let us search multiple indexes simultaneously include :
• Yahoo (http://www.yahoo.com)
• search.com (http://search.com)
• Easy Searcher (http://www.easysearcher.com)
Yahoo first searches it own hierarchically-structured subject directory and gives those entries. Then, it provides a few entries from the AltaVista search engine. It also launches a concurrent search for entries matching our search argument with six or seven other major search engines. We can link to each of them from Yahoo (at the bottom of the search result page) to see what the results were from each of these search engines.
A significant advantage of a Yahoo search is that if locate an entry in Yahoo, it’s likely to lead to a Web site or entire categories of sites related to our search argument.
A search.com search primarily searches the Info seek index first but also search the other major search engines as well.
Easy Searcher lets us choose from either the popular search engines or a very comprehensive list of specialized search engine/databases in a number of fields.
Yahoo, search.com, and Easy Searcher all provide help with entering our search phrase. Most Web portal sites offer a quickly-located search entry box that connects us to the major search engines.
1.4.1 How to Search
By “How to Search,” we mean a general approach to searching: what to try first, how many search engines to try, whether to search USENET newsgroups, when to quit. It’s difficult to generalize, but this is the general approach we use at whatis.com :
- If we know of a specialized search engine such as Search Networking that matches our subject (for example, Networking), we’ll save time by using that search engine. We’ll find some specialized databases accessible from Easy Searcher 2.
- If there isn’t a specialized search engine, try Yahoo. Sometimes we’ll find a matching subject category or two and that’s all we’ll need.
- If Yahoo doesn’t turn up anything, try AltaVista, Google, Hotbot, Lycos, and perhaps other search engines for their results. Depending on how important the search is, we usually don’t need to go below the first 20 entries on each.
- For efficiency, consider using a ferret that will use a number of search engines simultaneously for us.
- At this point, if we haven’t found what we need, consider using the subject directory approach to searching. Look at Yahoo or someone else’s structured organization of subject categories and see if we can narrow down a category our term or phrase is likely to be in. If nothing else, this may give us ideas for new search phrases.
- If we feel it’s necessary, also search the Usenet newsgroups as well as the Web.
- As we continue to search, keep rethinking our search arguments. What new approaches could we use? What are some related subjects to search for that might lead us to the one we really want?
- Finally, consider whether our subject is so new that not much is available on it yet. If so, we may want to go out and check the very latest computer and Internet magazines or locate companies that we think may be involved in research or development related to the subject.
1.5 Historical Search Engine Information
During the early development of the web, there was a list of web servers edited by Tim Berners-Lee and hosted on the CERN web server. One historical snapshot from 1992 has remained. As more web servers went online the central list could not keep up. On the NCSA site new servers were announced under the title “What’s New!”
The very first tool used for searching on the Internet was Archie. The name stands for “archive” without the “v.” It was created in 1990 by Alan Emtage, a student at McGill University in Montreal. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; however, Archie did not index the contents of these sites.
The rise of Gopher (created in 1991 by Mark McCahill at the University of Minnesota) led to two new search programs, Veronica and Jughead. Like Archie, they searched the file names and titles stored in Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead (Jonzy’s Universal Gopher Hierarchy Excavation and Display) was a tool for obtaining menu information from specific Gopher servers. While the name of the search engine “Archie” was not a reference to the Archie comic book series, “Veronica” and “Jughead” are characters in the series, thus referencing their predecessor.
In the summer of 1993, no search engine existed yet for the web, though numerous specialized catalogues were maintained by hand. Oscar Nierstrasz at the University of Geneva wrote a series of Perl scripts that would periodically mirror these pages and rewrite them into a standard format which formed the basis for W3Catalog, the web’s first primitive search engine, released on September 2, 1993.
In June 1993, Matthew Gray, then at MIT, produced what was probably the first web robot, the Perl-based World Wide Web Wanderer, and used it to generate an index called ‘Wandex’. The purpose of the Wanderer was to measure the size of the World Wide Web, which it did until late 1995. The web’s second search engine Aliweb appeared in November 1993. Aliweb did not use a web robot, but instead depended on being notified by website administrators of the existence at each site of an index file in a particular format.
Jump Station (released in December 1993) used a web robot to find web pages and to build its index, and used a web form as the interface to its query program. It was thus the first WWW resource-discovery tool to combine the three essential features of a web search engine (crawling, indexing, and searching) as described below. Because of the limited resources available on the platform on which it ran, its indexing and hence searching were limited to the titles and headings found in the web pages the crawler encountered.
One of the first “full text” crawler-based search engines was WebCrawler, which came out in 1994. Unlike its predecessors, it let users search for any word in any webpage, which has become the standard for all major search engines since. It was also the first one to be widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was launched and became a major commercial endeavor.
Soon after, many search engines appeared and vied for popularity. These included Magellan, Excite, Info seek, Inktomi, Northern Light, and AltaVista. Yahoo! was among the most popular ways for people to find web pages of interest, but its search function operated on its web directory, rather than full-text copies of web pages. Information seekers could also browse the directory instead of doing a keyword-based search.
In 1996, Netscape was looking to give a single search engine an exclusive deal to be their featured search engine. There was so much interest that instead a deal was struck with Netscape by five of the major search engines, where for $5Million per year each search engine would be in a rotation on the Netscape search engine page. The five engines were Yahoo!, Magellan, Lycos, Info seek, and Excite.
Around 2000, the Google search engine rose to prominence The Company achieved better results for many searches with an innovation called Page Rank. This iterative algorithm ranks web pages based on the number and Page Rank of other web sites and pages that link there, on the premise that good or desirable pages are linked to more than others. Google also maintained a minimalist interface to its search engine. In contrast, many of its competitors embedded a search engine in a web portal.
By 2000, Yahoo was providing search services based on Inktomi’s search engine. Yahoo! acquired Inktomi in 2002 and Overture (which owned AlltheWeb and AltaVista) in 2003. Yahoo! switched to Google’s search engine until 2004, when it launched its own search engine based on the combined technologies of its acquisitions.
Microsoft first launched MSN Search in the fall of 1998 using search results from Inktomi. In early 1999 the site began to display listings from Look smart blended with results from Inktomi except for a short time in 1999 when results from AltaVista were used instead. In 2004, Microsoft began a transition to its own search technology, powered by its own web crawler (called msnbot).
Microsoft’s rebranded search engine, Bing, was launched on June 1, 2009. On July 29, 2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search would be powered by Microsoft Bing technology.
According to Hit box, Google’s worldwide popularity peaked at 82.7% in December, 2008. July 2009 rankings showed Google (78.4%) losing traffic to Baidu (8.87%), and Bing (3.17%). The market share of Yahoo! Search (7.16%) and AOL (0.6%) were also declining.
In the United States, Google held a 63.2% market share in May 2009, according to Nielsen Net Ratings. In the People’s Republic of China, Baidu held a 61.6% market share for web search in July 2009 .
1.6 Challenges faced by search engines
· The web is growing much faster than any present-technology search engine can possibly index.
· Many web pages are updated frequently, which forces the search engine to revisit them periodically.
· The queries one can make are currently limited to searching for key words, which may results in many false positives.
· Dynamically generated sites, which may be slow or difficult to index, or may result in excessive results from a single site.
· Many dynamically generated sites are not index able by search engines; this phenomenon is known as the invisible web.
· Some search engines do not order the results by relevance, but rather according to how much money the sites have paid them.
Some sites use tricks to manipulate the search engine to display them as the first result returned for some keywords. This can lead to some search results being polluted, with more relevant links being pushed down in the result list .
1.7 Types of Search Engines
In the early 2000s, more than 1,000 different search engines were in existence, although most Web masters focused their efforts on getting good placement in the leading 10. This, however, was easier said than done. InfoWorld explained that the process was more art than science, requiring continuous adjustments and tweaking, along with regularly submitting pages to different engines for good or excellent results. The reason for this is that every search engine works differently. Not only are there different types of search engines—those that use spiders to obtain results, directory-based engines, and link-based engines—but engines within each category are unique. They each have different rules and procedures companies need to follow in order to register their site with the engine.
The term “search engine” is often used generically to describe crawler-based search engines, human-powered directories, and hybrid search engines. These types of search engines gather their listings in different ways, through crawler-based searches, human-powered directories, and hybrid searches .
1.7.1 Crawler-based search engines
Crawler-based search engines, such as Google, create their listings automatically. They “crawl” or “spider” the web, then people search through what they have found. If web pages are changed, crawler-based search engines eventually find these changes, and that can affect how those pages are listed. Page titles, body copy and other elements all play a role.
The life span of a typical web query normally lasts less than half a second, yet involves a number of different steps that must be completed before results can be delivered to a person seeking information. The following graphic (Figure 1.7.1) illustrates this life span:
3. The search results are returned to the user in a fraction of a second.
|1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book – it tells which pages contain the words that match the query.|
|2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result.|
Figure 1.7.1: The life span of a typical web query
Steps of Crawler Based Search-engines :
1. Web–Crawling: Search-Engines use a special program called Robot or Spider which crawls (travels) the web from one page to another. It travels the popular sites and then follows each link available at that site.
2. Information Collection: Spider records all the words and their respective position on Visited web-page. Some search-engines do not consider common words such as articles ( ‘a’, ’an’, ’the’); prepositions (‘of’, ’on’).
3. Build Index: After collecting all the data, search-engines build an index to store that data
So that user can access pages quickly. Different search-engines use different approach for indexing. Due to– this fact the different search-engines give different results for the same query. Some important considerations for building indexes include: the frequency of a term of appearing in a web-page, part of a web-page where that term appears, font-size of a term (whether capitalized or not). In fact, Google ranks a page higher if more number of pages vote (having links) to that particular page.
4. Data Encoding: Before storing the indexing information in databases, it is encoded into reduced size to speed up the response time of particular search-engine.
5. Store Data: the last step is to store this indexing information into databases.
1.7.2 Human-powered directories
A human-powered directory, such as the Open Directory Project depends on humans for its listings. (Yahoo!, which used to be a directory, now gets its information from the use of crawlers.) A directory gets its information from submissions, which include a short description to the directory for the entire site, or from editors who write one for sites they review. A search looks for matches only in the descriptions submitted. Changing web pages, therefore, has no effect on how they are listed. Techniques that are useful for improving a listing with a search engine have nothing to do with improving a listing in a directory. The only exception is that a good site, with good content, might be more likely to get reviewed for free than a poor site .
Open Directory is one such directory and submission depends on a human to actually submit a website. The submitter must provide website information including a proper title and description. Open Directory’s editors may write their own description of our site, or re-write the information submitted. They have total control over our submission.
When we submit a website to a human-powered directory I must follow the rules and regulations set forth by that specific directory. While following its directives, we must submit the most appropriate information needed by potential internet users. A good site with good content has a greater chance of being reviewed and accepted by a human-powered directory.
Human Search Method
From Bessed’s perspective, a human-powered search engine finds useful sites, attempts to rank them by usefulness, and attempts to find answers for “long-tail” searches that a directory never would. A human-powered search engine also doesn’t care about hierarchies — there’s no infrastructure that says we have to drill down to Business and Industry_Apparel_Shoes_Crocs in order to find sites that sell Crocs. We just create a list of sites where I can find Crocs, which is all I want from the searcher perspective. Also, our goal is to update searches to weed out dated material that would sit in a directory forever. And we would never charge for inclusion .
1.7.3 Hybrid search engines
Today, it is extremely common for crawler-type and human-powered results to be combined when conducting a search. Usually, a hybrid search engine will favor one type of listings over another. For example, MSN Search is more likely to present human-powered listings from Look Smart. However, it also presents crawler-based results, especially for more obscure queries .
1.7.4 Meta-search engines
A meta-search engine is a search tool that sends user requests to several other search engines and/or databases and aggregates the results into a single list or displays them according to their source. Meta-search engines enable users to enter search criteria once and access several search engines simultaneously. Meta-search engines operate on the premise that the Web is too large for any one search engine to index it all and that more comprehensive search results can be obtained by combining the results from several search engines. This also may save the user from having to use multiple search engines separately. The term “meta-search” is frequently used to classify a set of commercial search engines, see the list of search engines, but is also used to describe the paradigm of searching multiple data sources in real time. The National Information Standards Organization (NISO) uses the terms Federated Search and Meta-search interchangeably to describe this web search paradigm .
Figure 1.7.4: Architecture of a Meta search engine
Meta-search engines create what is known as a virtual database. They do not compile a physical database or catalogue of the web. Instead, they take a user’s request, pass it to several other heterogeneous databases and then compile the results in a homogeneous manner based on a specific algorithm.
No two meta-search engines are alike. Some search only the most popular search engines while others also search lesser-known engines, newsgroups, and other databases. They also differ in how the results are presented and the quantity of engines that are used. Some will list results according to search engine or database. Others return results according to relevance, often concealing which search engine returned which results. This benefits the user by eliminating duplicate hits and grouping the most relevant ones at the top of the list.
Search engines frequently have different ways they expect requests submitted. For example, some search engines allow the usage of the word “AND” while others require “+” and others require only a space to combine words. The better meta-search engines try to synthesize requests appropriately when submitting them.
Quality of results
Results can vary between meta-search engines based on a large number of variables. Still, even the most basic meta-search engine will allow more of the web to be searched at once than any one stand-alone search engine. On the other hand, the results are said to be less relevant, since a meta-search engine can’t know the internal “alchemy” a search engine does on its result (a meta-search engine does not have any direct access to the search engines’ database).
Meta-search engines are sometimes used in vertical search portals, and to search the deep web.
1.8 How web search engines work
A search engine operates, in the following order 
- Web crawling
Web search engines work by storing information about many web pages, which they retrieve from the html itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link on the site. Exclusions can be made by the use of robots.txt. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called Meta tags). Data about web pages are stored in an index database for use in later queries. A query can be a single word. The purpose of an index is to allow information to be found as quickly as possible. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered to be a mild form of link rot, and Google’s handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere.
When a user enters a query into a search engine (typically by using key words), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document’s title and sometimes parts of the text. The index is built from the information stored with the data and the method by which the information is indexed. Unfortunately, there are currently no known public search engines that allow documents to be searched by date. Most search engines support the use of the Boolean operators AND, OR and NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The engine looks for the words or phrases exactly as entered. Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords. There is also concept-based searching where the research involves using statistical analysis on pages containing the words or phrases we search for. As well, natural language queries allow the user to type a question in the same form one would ask it to a human. A site like this would be ask.com.
All search engines go by this basic process when conducting search processes, but because there are differences in search engines, there are bound to be different results depending on which engine we use .
- The searcher types a query into a search engine.
- Search engine software quickly sorts through literally millions of pages in its database to find matches to this query.
- The search engine’s results are ranked in order of relevancy.
What follows is a basic explanation of how search engines work .
- Keyword Searching
- Refining Our Search
- Relevancy Ranking
- Meta Tags
Search engines use automated software programs know as spiders or bots to survey the Web and build their databases. Web documents are retrieved by these programs and analyzed. Data collected from each web page are then added to the search engine index. When we enter a query at a search engine site, our input is checked against the search engine’s index of all the web pages it has analyzed. The best URLs are then returned to we as hits, ranked in order with the best results at the top.
This is the most common form of text search on the Web. Most search engines do their text query and retrieval using keywords.
What is a keyword, exactly? It can simply be any word on a webpage. For example, I used the word “simply” in the previous sentence, making it one of the keywords for this particular webpage in some search engine’s index. However, since the word “simply” has nothing to do with the subject of this webpage (i.e., how search engines work), it is not a very useful keyword. Useful keywords and key phrases for this page would be “search,” “search engines,” “search engine methods,” “how search engines work,” “ranking” “relevancy,” “search engine tutorials,” etc. Those keywords would actually tell a user something about the subject and content of this page.
Unless the author of the Web document specifies the keywords for her document (this is possible by using Meta tags), it’s up to the search engine to determine them. Essentially, this means that search engines pull out and index words that appear to be significant. Since search engines are software programs, not rational human beings, they work according to rules established by their creators for what words are usually important in a broad range of documents. The title of a page, for example, usually gives useful information about the subject of the page (if it doesn’t, it should!). Words that are mentioned towards the beginning of a document (think of the “topic sentence” in a high school essay, where we lay out the subject we intend to discuss) are given more weight by most search engines. The same goes for words that are repeated several times throughout the document.
Some search engines index every word on every page. Others index only part of the document.
Full-text indexing systems generally pick up every word in the text except commonly occurring stop words such as “a,” “an,” “the,” “is,” “and,” “or,” and “www.” Some of the search engines discriminate upper case from lower case; others store all words without reference to capitalization.
The Problem with Keyword Searching
Keyword searches have a tough time distinguishing between words that are spelled the same way, but mean something different (i.e. hard cider, a hard stone, a hard exam, and the hard drive on our computer). This often results in hits that are completely irrelevant to our query. Some search engines also have trouble with so-called stemming — i.e., if I enter the word “big,” should they return a hit on the word, “bigger?” What about singular and plural words? What about verb tenses that differ from the word we entered by only an “s,” or an “ed”?
Search engines also cannot return hits on keywords that mean the same, but are not actually entered in our query. A query on heart disease would not return a document that used the word “cardiac” instead of “heart.”
Most sites offer two different types of searches–“basic” and “refined” or “advanced.” In a “basic” search, I just enter a keyword without sifting through any pull down menus of additional options. Depending on the engine, though, “basic” searches can be quite complex.
Advanced search refining options differ from one search engine to another, but some of the possibilities include the ability to search on more than one word, to give more weight to one search term than we give to another, and to exclude words that might be likely to muddy the results. We might also be able to search on proper names, on phrases, and on words that are found within a certain proximity to other search terms.
Some search engines also allow us to specify what form we’d like our results to appear in, and whether we wish to restrict our search to certain fields on the internet (i.e., Usenet or the Web) or to specific parts of Web documents (i.e., the title or URL).
Boolean AND means that all the terms we specify must appear in the documents, i.e., “heart” AND “attack.” We might use this if we wanted to exclude common hits that would be irrelevant to our query.
Boolean OR means that at least one of the terms we specify must appear in the documents, i.e., bronchitis, acute OR chronic. We might use this if we didn’t want to rule out too much.
Boolean NOT means that at least one of the terms we specify must not appear in the documents. We might use this if we anticipated results that would be totally off-base, i.e., nirvana AND Buddhism, NOT Cobain.
Not quite Boolean + and – Some search engines use the characters + and – instead of Boolean operators to include and exclude terms.
NEAR means that the terms we enter should be within a certain number of words of each other. FOLLOWED BY means that one term must directly follow the other. ADJ, for adjacent, serves the same function. A search engine that will allow us to search on phrases uses, essentially, the same method (i.e., determining adjacency of keywords).
Phrases: The ability to query on phrases is very important in a search engine. Those that allow it usually require that we enclose the phrase in quotation marks, i.e., “spaces the final frontier.”
Capitalization: This is essential for searching on proper names of people, companies or products. Unfortunately, many words in English are used both as proper and common nouns–Bill, bill, Gates, gates, Oracle, oracle, Lotus, lotus, Digital, digital–the list is endless.
1.8.3 Relevancy Rankings
Most of the search engines return results with confidence or relevancy rankings. In other words, they list the hits according to how closely they think the results match the query. However, these lists often leave users shaking their heads on confusion, since, to the user; the results may seem completely irrelevant.
Why does this happen? Basically it’s because search engine technology has not yet reached the point where humans and computers understand each other well enough to communicate clearly.
Most search engines use search term frequency as a primary way of determining whether a document is relevant. If we’re researching diabetes and the word “diabetes” appears multiple times in a Web document, it’s reasonable to assume that the document will contain useful information. Therefore, a document that repeats the word “diabetes” over and over is likely to turn up near the top of our list.
If our keyword is a common one, or if it has multiple other meanings, we could end up with a lot of irrelevant hits. And if our keyword is a subject about which we desire information, we don’t need to see it repeated over and over–it’s the information aboutthat word that we’re interested in, not the word itself.
Some search engines consider both the frequency and the positioning of keywords to determine relevancy, reasoning that if the keywords appear early in the document, or in the headers, this increases the likelihood that the document is on target. For example, one method is to rank hits according to how many times our keywords appear and in which fields they appear (i.e., in headers, titles or plain text). Another method is to determine which documents are most frequently linked to other documents on the Web. The reasoning here is that if other folks consider certain pages important, we should, too.
If we use the advanced query form on AltaVista, we can assign relevance weights to our query terms before conducting a search. Although this takes some practice, it essentially allows us to have a stronger say in what results we will get back.
As far as the user is concerned, relevancy ranking is critical, and becomes more so as the sheer volume of information on the Web grows. Most of us don’t have the time to sift through scores of hits to determine which hyperlinks we should actually explore. The more clearly relevant the results are, the more we’re likely to value the search engine.
Some search engines are now indexing Web documents by the Meta tags in the documents’ HTML (at the beginning of the document in the so-called “head” tag). What this means is that the Web page author can have some influence over which keywords are used to index the document, and even in the description of the document that appears when it comes up as a search engine hit.
This is obviously very important if we are trying to draw people to our website based on how our site ranks in search engines hit lists.
There is no perfect way to ensure that we’ll receive a high ranking. Even if we do get a great ranking, there’s no assurance that we’ll keep it for long. For example, at one period a page from the Spider’s Apprentice was the number- one-ranked result on AltaVista for the phrase “how search engines work.” A few months later, however, it had dropped lower in the listings.
There is a lot of conflicting information out there on meta-tagging. If we’re confused it may be because different search engines look at Meta tags in different ways. Some rely heavily on Meta tags; others don’t use them at all. The general opinion seems to be that Meta tags are less useful than they were a few years ago, largely because of the high rate of spam-indexing (web authors using false and misleading keywords in the Meta tags).
It seems to be generally agreed that the “title” and the “description” Meta tags are important to write effectively, since several major search engines use them in their indices. Use relevant keywords in our title, and vary the titles on the different pages that make up our website, in order to target as many keywords as possible. As for the “description” Meta tag, some search engines will use it as their short summary of our URL, so make sure our description is one that will entice surfers to our site.
Note: The “description” Meta tag is generally held to be the most valuable, and the most likely to be indexed, so pay special attention to this one.
In the keyword tag, list a few synonyms for keywords, or foreign translations of keywords (if we anticipate traffic from foreign surfers). Make sure the keywords refer to, or are directly related to, the subject or material on the page. Do NOT use false or misleading keywords in an attempt to gain a higher ranking for our pages.
The “keyword” Meta tag has been abused by some webmasters. For example, a recent ploy has been to put such words “mp3” into keyword Meta tags, in hopes of luring searchers to one’s website by using popular keywords.
The search engines are aware of such deceptive tactics, and have devised various methods to circumvent them, so be careful. Use keywords that are appropriate to our subject, and make sure they appear in the top paragraphs of actual text on our webpage. Many search engine algorithms score the words that appear towards the top of our document more highly than the words that appear towards the bottom. Words that appear in HTML header tags (H1, H2, H3, etc) are also given more weight by some search engines. It sometimes helps to give our page a file name that makes use of one of our prime keywords, and to include keywords in the “alt” image tags.
One thing we should not do is use some other company’s trademarks in our Meta tags. Some website owners have been sued for trademark violations because they’ve used other company names in the Meta tags. We have, in fact, testified as an expert witness in such cases. We do not want the expense of being sued!
Remember that all the major search engines have slightly different policies. If we’re designing a website and meta-tagging our documents, we recommend that we take the time to check out what the major search engines say in their help files about how they each use meta tags. We might want to optimize our Meta tags for the search engines we believe are sending the most traffic to our site.
Search Engine with Web Crawling
In the previous chapter we briefly discussed about the vast expansion occurring in the World Wide Web. As the web of pages around the world is increasing day by day, the need of search engines has also emerged. In this chapter, we explain the basic components of any basic search engine along with its working. After this, the role of web crawlers, one of the essential components of any search engine, is explained.
2.1 Basic Web Search Engine
The plentiful content of the World-Wide Web is useful to millions. Some simply browse the Web through entry points such as Yahoo, MSN etc. But many information seekers use a search engine to begin their Web activity . In this case, users submit a query, typically a list of keywords, and receive a list of Web pages that may be relevant, typically pages that contain the keywords. By Search Engine in relation to the Web, we are usually referring to the actual search form that searches through- databases of HTML documents.
Crawler based search engines use automated software agents (called crawlers) that visit a Web site, read the information on the actual site, read the site’s meta tags and also follow the links that the site connects to performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. The frequency with which this happens is determined by the administrators of the search engine.
2.2 Structure & Working of Search Engine
The basic structure of any crawler based search engine is shown in figure 2.2. Thus the
Main steps in any search engine are -
Figure 2.2: Generic Structure of a Search Engine
2.2.1 Gathering also called “Crawling”
Every engine relies on a crawler module to provide the grist for its operation. This operation is performed by special software; called “Crawlers” Crawlers are small programs that `browse’ the Web on the search engine’s behalf, similarly to how a human user would follow links to reach different pages. The programs are given a starting set of URLs, whose pages they retrieve from the Web. The crawlers extract URLs appearing in the retrieved pages, and give this information to the crawler control module. This module determines what links to visit next, and feeds the links to visit back to the crawlers.
2.2.2 Maintaining Database/Repository
All the data of the search engine is stored in a database as shown in the figure 2.2.All the searching is performed through that database and it needs to be updated frequently. During a crawling process, and after completing crawling process, search engines must store all the new useful pages that they have retrieved from the Web. The page repository (collection) in Figure 2.2 represents this possibly temporary collection. Sometimes search engines maintain a cache of the pages they have visited beyond the time required to build the index. This cache allows them to serve out result pages very quickly, in addition to providing basic search facilities.
Once the pages are stored in the repository, the next job of search engine is to make an index of stored data. The indexer module extracts all the words from each page, and records the URL where each word occurred. The result is a generally very large “lookup table” that can provide all the URLs that point to pages where a given word occurs. The table is of course limited to the pages that were covered in the crawling process. As mentioned earlier, text indexing of the Web poses special difficulties, due to its size, and its rapid rate of change. In addition to these quantitative challenges, the Web calls for some special, less common kinds of indexes. For example, the indexing module may also create a structure index, which reflects the links between pages.
This sections deals with the user queries. The query engine module is responsible for receiving and filling search requests from users. The engine relies heavily on the indexes, and sometimes on the page repository. Because of the Web’s size, and the fact that users typically only enter one or two keywords, result sets are usually very large.
Since the user query results in a large number of results, it is the job of the search engine to display the most appropriate results to the user. To do this efficient searching, the ranking of the results are performed. The ranking module therefore has the task of sorting the results such that results near the top are the most likely ones to be what the user is looking for. Once the ranking is done by the Ranking component, the final results are displayed to the user. This is how any search engine works.
A spider, also known as a robot or a crawler, is actually just a program that follows, or “crawls”, links throughout the internet, grabbing content from sites and adding it to search engine indexes.
Spiders only can follow links from one page to another and from one site to another. That is the primary reason why links to the site (inbound links) are so important. Links to the website from other websites will give the search engine spiders more “food” to chew on. The more times they find links to the site, the more times they will stop by and visit. Google especially relies on its spiders to create their vast index of listings.
Spiders find Web pages by following links from other Web pages, but we can also submit our Web pages directly to a search engine or directory and request a visit by their spider. In fact, it’s a good idea to manually submit our site to a human-edited directory such as Yahoo, and usually spiders from other search engines (such as Google) will find it and add it to their database. It can be useful to submit our URL straight to the various search engines as well; but spider-based engines will usually pick up our site regardless of whether or not we’ve submitted it to a search engine .
Figure 2.3: “Spiders” take a Web page’s content and create key search words that enable online users to find pages they’re looking for.
2.3.1 A Survey of Web Crawlers
Web crawlers are almost as old as the web itself. The first crawler, Matthew Gray’s Wanderer, was written in the spring of 1993, roughly coinciding with the first release of NCSA Mosaic. Several papers about web crawling were presented at the first two World Wide Web conferences. However, at the time, the web was three to four orders of magnitude smaller than it is today, so those systems did not address the scaling problems inherent in a crawl of today’s web. Obviously, all of the popular search engines use crawlers that must scale up to substantial portions of the web. However, due to the competitive nature of the search engine business, the designs of these crawlers have not been publicly described. There are two notable exceptions: the Google crawler and the Internet Archive crawler. The original Google crawler  (developed at Stanford) consisted of five functional components running in different processes. A URL server process read URLs out of a file and forwarded them to multiple crawler processes. Each crawler process ran on a different machine, was single-threaded, and used asynchronous I/O to fetch data from up to 300 web servers in parallel. The crawlers transmitted downloaded pages to a single Store Server process, which compressed the pages and stored them to disk. The pages were then read back from disk by an indexer process, which extracted links from HTML pages and saved them to a different disk file. A URL resolve process read the link file; the URLs contained therein, and saved the absolute URLs to the disk file that was read by the URL server. Typically, three to four crawler machines were used, so the entire system required between four and eight machines. Research on web crawling continues at Stanford even after Google has been transformed into a commercial effort.
The Internet Archive also used multiple machines to crawl the web. Each crawler process was assigned up to 64 sites to crawl, and no site was assigned to more than one crawler. Each single-threaded crawler process read a list of seed URLs for its assigned sites from disk into per-site queues, and then used asynchronous I/O to fetch pages from these queues in parallel. Once a page was downloaded, the crawler extracted the links contained in it. If a link referred to the site of the page it was contained in, it was added to the appropriate site queue; otherwise it was logged to disk. Periodically, a batch process merged these logged “cross-site” URLs into the site-specific seed sets, filtering out duplicates in the process.
2.3.2 Basic Crawling Terminology
Before we discuss the working of crawlers, it is worth to explain some of the basic terminology that is related with crawlers. These terms will be used in the forth coming chapters as well .
18.104.22.168 Seed Page: By crawling, we mean to traverse the Web by recursively following links from a starting URL or a set of starting URLs. This starting URL set is the entry point though which any crawler starts searching procedure. This set of starting URL is known as “Seed Page”. The selection of a good seed is the most important factor in any crawling process.
22.214.171.124 Frontier (Processing Queue): The crawling method starts with a given URL (seed), extracting links from it and adding them to an un-visited list of URLs. This list of un-visited links or URLs is known as, “Frontier”. Each time, a URL is picked from the frontier by the Crawler Scheduler. This frontier is implemented by using Queue, Priority Queue Data structures. The maintenance of the Frontier is also a major functionality of any Crawler.
126.96.36.199 Parser: Once a page has been fetched, we need to parse its content to extract information that will feed and possibly guide the future path of the crawler. Parsing may imply simple hyperlink/URL extraction or it may involve the more complex process of tidying up the HTML content in order to analyze the HTML tag tree. The job of any parser is to parse the fetched web page to extract list of new URLs from it and return the new un-visited URLs to the Frontier.
2.4 Working of Basic Web Crawler
From the beginning, a key motivation for designing Web crawlers has been to retrieve web pages and add them or their representations to a local repository. Such a repository may then serve particular application needs such as those of a Web search engine. In its simplest form a crawler starts from a seed pageand then uses the external links within it to attend to other pages. The structure of a basic crawler is shown in figure 2.4(a). The process repeats with the new pages offering more external links to follow, until a sufficient number of pages are identified or some higher level objective is reached. Behind this simple description lies a host of issues related to network connections, and parsing of fetched HTML pages to find new URL links.
Figure 2.4(a): Components of a web-crawler
Common web crawler implements method composed from following steps:
· Acquire URL of processed web document from processing queue
· Download web document
· Parse document’s content to extract set of URL links to other resources and update processing queue
· Store web document for further processing
The basic working of a web-crawler can be discussed as follows:
· Select a starting seed URL or URLs
· Add it to the frontier
· Now pick the URL from the frontier
· Fetch the web-page corresponding to that URL
· Parse that web-page to find new URL links