Enhancing the performance of web Focused CRAWLer using ontology

: The enormous growth of the World Wide Web in the recent years has made it important to perform resources discovery efficiently. The rapid growth of World Wide Web poses (Doubles in size approximately every eight months) unprecedented scaling challenges for general purpose crawler and search engine. Finding useful information from the web which has a large and distributed structure required efficient search strategies. As ontology plays an important role in providing controlled vocabulary of concepts, each with an explicitly defined and machine process able semantics. In this paper ,we propose the novel concept of intelligent crawling of Ontology based content focused crawling , the new approach that analyses it crawl boundary to find the links that are likely to be the most relevant for the crawl while a boundary irrelevant region of the web. Through our new focused crawling technique we solve the polysemy (refer to word with multiple meaning) and synonymy (refers to multiple word having the same meaning) semantic net problem. Also instead of searching in the whole web, our proposed technique will search in the ontology build by us that is updated periodically after a very short interval than instead of displaying all the information that is not related to the user need, we will display only relevant and related information. Our purposed work give us two fold benefit , firstly only focused result are retrieved which reduce the number of results entreated and secondly, due to focused searching irrelevant result are pruned which reduce the time.


I. Introduction
A search engine is an information retrieval system designed to help to minimize the time required to find information over the vast Web of hyperlinked documents. It provides a user interface that enables the users to specify criteria about an item of interest and searches the same from locally maintained databases. The criteria are referred to as a search query. In the case of text search engines, the search query is typically expressed as a set of words that identify the desired concept that one or more documents may As compared to traditional document collections which reside in physical warehouses such as the college's library; the information available on WWW is distributed over the Internet. In fact, this huge repository is growing rapidly without any geographical constraints.
Therefore, a component used crawler is employed by the search engine which visits the Web pages, collects them and categorizes them. The crawler retrieves web pages commonly for use by a search engine. It traverses the web by downloading the documents and following embedded links from page to page.Formally, crawlers may be defined as "Software programs that traverse the World Wide Web information space by following the hypertext links extracted from hypertext documents".

II. Focused Crawler
A focused crawler or topical crawler is a web crawler that attempts to download only web pages that are relevant to a pre-defined topic or set of topics. They attempt to download pages that are similar to each other. The concepts of topical and focused crawling were first introduced by Chakrabarti et.al [3,9]. The main problem in focused crawling is that in the context of a Web crawler, we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton in the first web crawler of the early days of the Web. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched, and a focused crawling usually relies on a general Web search engine for providing starting points.
Ontology always includes a vocabulary of representational concept labels to describe a shared domain. These concept labels are usually called terms (lexical references) and are associated with entities (non lexical referentsthe concepts) in the universe of discourse. Formal axioms are also introduced to constrain their interpretation and well-formed use. Ontology is in principle a formalization of a shared understanding of a domain that is agreed upon by a number of agents described by Spyns P. et al [25]. In order for this domain knowledge to be shared amongst agents, they must have  [22]. This includes axioms about properties of objects and how they are related, also called the semantic relationships of the ontology.

III. Related work
Ontology has been used to improve the effectiveness of focused crawling. Hiep Phuc loung et al [29] Ontology based Crawling and A. Ardo [8] web crawler estimates the semantic content of the link of the URL in a given set of documents based on the domain dependent ontology, which in turn strengthens the metric that is used for prioritizing the URL queue. The link representing concepts in the ontology knowledge path is given higher priority. Hong -Wei Hao et al [30] considers an ontology-based algorithm for page relevance computation. After preprocessing, entities (words occurring in the ontology) are extracted from the page and counted. Relevance of the page with regard to user selected entities of interest is then computed by using several measures on ontology graph (e.g. direct match, taxonomic and more complex relationships). The harvest rate is improved compared to the baseline focused crawler (that decides on page relevance by a simple binary keyword match).
Chakrabarti et al [9] proposed that the next generation of the Semantic Web focuses on supporting a better cooperation between humans and machines. In this approach, ontology plays an important role as a backbone for providing and accessing knowledge sources. Since manual building of ontology is costly, time-consuming, error-prone and inflexible to change, it is hoped that an automated process will result in a better ontology construction and create ontology that better match a specific application represented by A.Maeche et al [13]. Ontology learning approaches can be distinguished by the type of input used for learning, e.g., they can learn from text, from a dictionary, from a knowledge base, from a semi structured schemata, or from relational schemata described in A.Gomez, M. Samsfard [10,16]. Currently, few projects attempt to support the entire ontology learning process including automated support for tasks such as retrieving documents, classifying, filtering and extracting relevant information for the ontology enrichment. Most existing approaches for ontology learning require a large number of input documents for accurate results as in B. Omelayenko [15]. With the enormous growth of the Web, it is important to develop document discovery mechanisms based on intelligent techniques such as focused crawling in T.joachims [11] to make this process easier for a new domain. Focused crawlers go a step further than classic crawlers in order to be able to quickly collect Web pages about a particular topic or domain of the Web.
Gómez-Pérez et al. [10] presents a good summary of several ontology learning projects that are concerned with knowledge acquisition from a variety of sources such as text documents, dictionaries, knowledge bases, relation schemas, semi-structured data, etc. Many of these existing approaches employ ontology learning from text documents, although only a few deal with ontology enrichment from documents collected from the Web. Omelayenko [15] has discussed the applicability of machine learning algorithms to learning of ontology from Web documents and also surveys the current ontology learning and other closely related approaches.
Similar to our approach, J.M park et al in [22] introduces an ontology learning framework for the Semantic Web which proceeds through ontology import, extraction, pruning, refinement, and evaluation giving the ontology engineers a wealth of coordinated tools for ontology modeling. However, they do not mention any automated support to collect the domain documents from the Web or how to automatically identify domain relevant documents needed by the ontology learning process. In another approach similar to ours, S.T. Dumais et al [17] has presents an automatic method to enrich very large ontology, e.g., World Net that uses documents retrieved from the Web. But, they do not apply any filtering techniques to verify that the retrieved documents are truly on-topic.
Many ontology learning approaches require a large collection of input documents in order to enrich the existing ontology as in B. Omelayenko [15]. A common way to get these documents from the Web is to use general purpose crawlers and search engines, but this approach faces problems with scalability due to the rapid growth of the Web. In contrast, focused crawlers overcome this drawback, i.e., they yield good recall as well as good precision, by restricting themselves to a limited domain [18].Devashis hati et al [34] describe a new hypertext resource discovery system with the purpose of selectively seeking out pages that are relevant to a pre-defined set of topics. Ester et al [18] also introduce a generic framework for focused crawling consisting of two major components: (i) specification of the user interest and measuring the resulting relevance of a given web page; and (ii) a crawling strategy. In order to improve accuracy of the learned ontology, the documents retrieved by focused crawlers may need to be automatically filtered by using some text classification technique such as Support Vector Machines (SVM), k-Nearest Neighbors, Linear Least-Squares Fit, TF-IDF, etc. A thorough survey and comparison of such methods and their complexity is presented in J.Qin [20] and C.C aggrwal et al [1] conclude that SVM to be most accurate for text classification and fast training. M.Ehrig et al [18] and T. Joachims [11] described SVM as a machine learning model that finds an optimal hyper plane to separate text classification and fast training and then classifies data into one of two classes based on the side on which they are located.
We also adopt the meta-search method proposed by J.Qin [20] in our framework. Other works more related to ours mostly adopt a certain semantic model in crawling. S.Charkrabarti et al [9] uses thesaurus to process predefined documents associated with the specified topic. S.M Pahlevi et al [19] combine the taxonomy-based search engines and a machine learning technique for adaptive Web search. Ester et al [18] uses a complex ontology and associated instance elements to build the focused crawler. Hong-wei Hao [33] also defines the topic focus as an ontology, which is used for automated subject classification. J.Graupmann et al [21] builds a search engine which crawl semantic markups in HTML, XML, etc.

IV. Present Problem
Ontology provides a base framework for knowledge representation, and the methodology of ontology construction is one of the most important research topics in the ontology community. Many methodologies have been proposed, and some of them have been along with constructing engineering ontology. However, the previous methodologies are mostly top-down approaches which do not maximize the benefits of bottom-up approaches.
There are few bottom-up approaches, but they do not utilize the full resources of knowledge such as engineering documents.
A critical look on the available literature reveals that the existing work needs to include the following issues: There is a need of search engine which cover the two major issues of information retrieval i.e. Polysemy and Synonymy that to simultaneously. Polysemy refers to words with multiple meanings, i.e. how the same phonological form (word) has different semantic mappings (meanings). If the two meanings are unrelated, as in the word pen meaning both writing instrument and enclosure, they are considered homonyms. Synonymy refers to multiple words having the same meaning. As the name implies, synonyms are words that mean the same or have similar meanings in context. Synonyms are used in a variety of situations not only for variety, but to express thoughts or ideas in another, often more emphatic manner.
To make web searching specific and fast, an appropriate ontology construction plays the most important role as the ontology serves as a starting edge structure for knowledge representation, and the procedure of ontology construction is one of the most critical research topics in the ontology processing.

V. Proposed Work
As context based searching is still not prevalent, so the main emphasize of our research is on that domain. Many popular search engines display all the information needed by the user without filtering anything. It also displays what is not required by the user and the result of any search goes up to lakhs. Our goal to make the user search more concise by displaying only information that is required by the user and discarding all that is irrelevant. So that the results displayed are focused results, i.e. only those information that is required by the user.
We divide our algorithm into three steps: Step 1: In first step, we will construct ontology from the web repository.
Step 2: In second step, we will integrate this ontology with the semantic nets so that a focused document group can be created.
Step 3: In third step, we will accept the keywords to be searched and make search more concise by pruning the unwanted data and display the results based upon that along with its related context with the help of topic map that uses the ontology designed by us. Our proposed work gives two fold benefit, firstly, only focused results are retrieved which reduces the number of results extracted and secondly, due to focused searching irrelevant results are pruned which reduces the time.

A. Ontology Construction
As the main objective of our research is to optimize the searching, by making changes in the way the user send his search keywords. Instead of searching in the whole web, our algorithm will search in the ontology built by us that is updated periodically. So before the actual websearching starts, we should have a web-repository for the development of ontology (Structured knowledge about English word) in parallel.

Figure 2:-Ontology construction
For building ontology we are using XML (Extensible Markup Language) which is a platform independent plain ASCII text file used as data description language. We have decided to use XML as it could be easily integrated with any of the web development language and it is very easy to use. To build a dynamic XML file, which could be automatically updated we have used C#.net language provided by Microsoft.
The advantage of this ontology is that once build, it could be used by any search engine to improve their performance in terms of results. Figure 1 Shows how the ontology commit, last step of ontology construction.

B. Building Topic Map
Based upon the keyword entered by the user, we will create a topic map using ontology build in step one. For doing so, we will again use C#.net that will retrieve the keyword along with its multiple contexts and its related topics. Thereafter displaying them on a web page in graphical form for making it easier for the user to extract what is desired by the user.

C. Pruning the Results
The main process on which our basic architecture relies to make the searching more focused and fast is pruning of the semantic network based on the ranking of context given by the user. Based on the ranking the network gets pruned displaying a specific topic map based result. For doing this we need a web repository from which result could be extracted, so we have used the web repository of Google, which is considered as the largest and fastest web repository. Using ASP.net we have customized the existing search technique to display more focused i.e. relevant results.

D. Proposed Architecture
Figure shows the flow chart of our proposed architecture for focused crawling using ontology, which accept user query and after query preprocessing context is retrieved from ontology. If multiple contexts are available then accept desired context and display relevant topic map. From the relevant topic map we retrieve the desired ontology and on the basis of that ontology we retrieve the qualifying URLs form web page repository. Then we rank URLs based on relevance ratio and display the results.

VI. Conclusion
Our proposed model helps in the providing the solution to the most critical problem of information retrieval, Synonymy and Polysemy. This study proposes the systematic methodology to develop the ontology in a bottom-up style from engineering documents, called DocOnto (Document-based Ontology). Our methodology is mainly composed of three phases such as defining ontology, integrating the ontology with semantic networks and pruning the ontology for practically usage. This ontology can be updated and generalized using much easier process and is less time consuming and has specific definition of each word in the form of attributes.
It reduces the number of results extracted. Through focused searching irrelevant results are pruned which reduces the time. Displaying the multiple contexts and its related topic on a web page in graphical form, making it easier for the user to extract what is desired by the user. The advantage of our ontology is that once build, it could be used by any search engine. So it improved searching performance in terms of precision & relevance.