Introduction To Web Search Engine

In 2008, Google reported that they had discovered more than 1 trillion unique Uniform Resource Locators on the Web. And as previous research has shown or any other search engine, is even close to discovering all the available content on the Web. The Web is quite a large place, and finding information can be a typical task without a web search engine. So web search engine is very useful to find information on the web.
Regular Web Search: The most popular type of search which searches the index based on any type of web page. Other on-line textual resources like PDFs and Microsoft Office formats are also available through regular web search.
News Search: It searches only news-related websites. Typically the search results are ordered based on age of the story.
Image Search: It searches only images that were discovered during the web crawling. Images are normally indexed by using the image's filename and text surrounding the image. Artificial intelligence techniques for trying to discover what is actually pictured in the image are slowly emerging.
Video Search: It searches the text accompanied by videos on the Web. Like image search, there is heavy dependency on people to input text which accurately describes the video.
Search starts with the web. It's made up of over large no. of individual pages and it's constantly growing. Google navigates the web by crawling. Crawling means links from page A to page B. Site owners wish their sites are crawled. Pages are then sorted by content and other factors and the index keep track of it all. As you search some keyword, algorithms get to work looking for clues to better understand what you mean as follows:
Spelling: Identifies and corrects possible spelling errors and provides alternatives.
Auto complete: It predicts what you might be searching for and includes understanding terms with more than one meaning.
Google Instant: It displays immediate results as you type.
Query Understanding: It gets to the deeper meaning of the words you type.
Synonyms: It recognizes words with similar meanings.
Based on these clues, the relevant documents from the index are pulled. And then the search results or web pages in search results are ranked by some algorithms and they use some factors like freshness, site and page quality etc. These algorithms are constantly changing. These changes comes as ideas in the minds of engineers. They take these ideas and implements them, analyze the results, tweak them, and run them again and again. The most popular algorithm for the ranking of web pages is PageRank algorithm which is explained in next topic.
'

Overview of PageRank Algorithm [1]
Page Rank was proposed by Sergey Brin and Larry Page, students at Stanford University and the founders of Google. According to Google, PageRank considers the number of links and quality of links to a web page to estimate importance of the website. The underlying assumption is that more important websites are likely to receive more links from other websites.
PageRank is a link analysis algorithm and it assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element A is referred to as the PageRank of A and denoted by PR(A). A PageRank results from a mathematical algorithm based on the webgraph. Webgraph is created by all World Wide Web pages as nodes and hyperlinks as edges. The rank value indicates an importance of a particular page. Number of hyperlinks to a page is considered as a vote of support. The PageRank of a web page is defined recursively and depends on the number and PageRank metric of all pages that link to it. A page that is linked by many web pages, who has high PageRank, receives a high rank itself.
A web page is important if it is pointed to by other important web pages. Google calculates a page's importance from the votes cast for it. Importance of each vote is taken into account when a page's Page Rank is calculated. Page Rank is Google's way of deciding a importance of web page. It matters because it is one of the factors that determine a page's ranking in the search results. PageRank Notation- 'PR'.
'
Objective
The main objective of the research is to provide an efficient algorithm for sorting the web pages generated as search results in web search procedure. This provides the efficient and better search results to users relevant to their query. Our main purpose behind this work is that user does not need to check thousands of search results for the information that user wants. The proposed algorithm ranks the pages according to user's query and most appropriate web page with the content that best match with user query gets the highest pagerank value and according to that pages have been sorted and result have been displayed. So the algorithm saves the user's time and give better result in shortest duration.
'
Motivation
As world-wide web is developing rapidly, Internet has become the world's richest and most dense source of information. Users face the problem which is how to get relevant and useful information from the large number of disorder information [2]. Search engines become an important tool for users to retrieve information for a particular query. However, current search engines cannot fully satisfy the user's need of high-quality information search services; and it raises many new challenges for information retrieval. So this motivated me to enhance the pagerank calculation.
'

CHAPTER 2
Theoretical Background & Literature Review

'
2.1 Theoretical Background
2.1.1 Pagerank Algorithm [1]
Formula for pagerank calculation
PR(A)=(1-d)/N+ d (PR(B)/(L(B))+PR(C)/(L(C))+PR(D)/(L(D))) [1]
Where,
PR(A) - pagerank of page A.
N- total pages accounted.
d - Damping Factor.
L(B) - Total number of outbound links of page B.
So any page's PageRank is derived in large part from the PageRank of other pages. The damping factor takes the derived value downward. Sometimes following formula used, which has led to some confusion:
PR(A)=1-d+ d (PR(B)/L(B) +PR(C)/L(C) +PR(D)/L(D) ) [1]
There is a difference between the above two formula. In first formula, PageRank values sums to one. In second formula each PageRank is multiplied by N and the sum becomes N. A statement in Page and Brin's paper that the sum of all PageRank values is to one like first formula.
Step :1 At , an initial probability distribution is assumed,
Usually, PR(p_i,0)=1/N
Step :2 At each time step, the computation, as detailed above, yields is the set of pages that link to pi.
PR(p_i,0) =(1-d)/N+ d '_(p_i'M(p_i))'(PR(p_i))/(L(p_i))
where p_1, p_2, ', p_N are the pages under consideration,
M(p_i) is the set of pages that link to p_i.
L(p_j)is the number of outbound links on page p_j,and N is the total number of pages.
The calculations do not work if they are performed just once. Accurate values are obtained through much iteration.
Suppose we have 2 pages, A and B, which link to each other, and neither have any other links of any kind. Page Rank of A depends on Page Rank value of B and Page Rank of depends on Page Rank value of A. We can't work out A's Page Rank until we know B's Page Rank, and we can't work out B's Page Rank until we know A's Page Rank. But performing more iterations can bring the values to such a stage where the Page Rank values do not change. Therefore more iterations are necessary while calculating Page Ranks.
Simple Example for calculating Pagerank:
Consider following Figure:

d=0.85
No. of outlinks from A and B are same i.e = 1.
PR(A) = (1-d)/n + d(PR(B)/1)
PR(B) = (1-d)/n + d(PR(A)/1)
i.e
PR(A) = 0.15/2 + 0.85(0.5/1) = 0.5
PR(B) = 0.15/2 + 0.85(0.5/1) = 0.5
Why we need damping factor?
Assume the following scenario where A, B, C and D are web pages and their respective linking as shown in graph :
Figure 1 Inlinks and Outlinks for web pages A, B, C and D

After calculation it is ok with rank of page A, B and C ,but how come D has rank? It has no backlinks.

Figure 2 PageRank values for web pages A, B, C and D

Answer relies in formula.
PR(D)= 0.15/4+0.85(0)= 0.038
So every page has at least a PR of 0.15/n to share out.

'
2.2Literature Review
In this context we have discussed some papers, which introduce and analyze various ranking algorithms. Some papers analyze the pagerank algorithm, limitations of the algorithm and propose the improved version of the pagerank which improve the pagerank algorithm and giving the better results. In some papers ranking algorithms which use web structure mining and are categorized based on input parameters they use. Some algorithms use weights of inlinks, some use weight of outlinks, some use weight of inlinks and outlinks both and some ranking algorithms use weight of inlinks, outlinks and no. of user's click information for calculating pagerank values. We then show Comparision among various ranking algorithms and strengths and weaknesses of those algorithms. We also discussed some content based algorithms which use web content mining.
In the next section we discuss those research papers.

'
2.2.1 RatioRank: Enhancing the impact of Inlinks and Outlinks [3]
Author Name: Ranveer Singh and Dilip Kumar Sharma
Published Year: 2012
Topic Addressed:
In this paper various ranking algorithms are analyzed and explained, in which different users use different parameters for ranking algorithm. Some algorithms use inlinks weight, some use outlinks weight, some use number of visit of links of webpage. Some use both inlink weight as well as outlinks weight. They are compared for knowing advantages and disadvantages of them. Then the new algorithm is presented which is called RatioRank, in which the inlink weights and outlink weights are used with the consideration of number of visit count and is compared with some algorithms by using certain parameters.
Table 1 Comparison of various ranking algorithms
Parameters/
Algorithm Technique Strength Limitations
Page Ranking Algorithm Algorithm based on Web graph of the
webpages and considers the inbound
links to rank the pages Ranking is done on the basis of
importance of the pages Ranking is at indexing time
HITS Algorithm Hubs and Authority score is used to
rank the web pages Relevancy of the pages is high Problem of theme drift and all
links are considered of same
important
Weighted Page Ranking
Algorithm Based on the popularity of inlinks and
outlinks if the page ranking is done Higher relevancy because popularity
of the links is considered Theme Drift
Weighted Page Ranking(VOL) Rank on the basis of visit count of any
inlinks and popularity of the inlinks
are considered User input is considered hence
relevancy is higher Theme drift and surfer jamming
Page Ranking Based on visit of link
of pages Extend the standard page ranking by
considering the visit count of links of
webpages User input is considered hence the
relevancy of the pages is higher Theme Drift
Improved Page Ranking
Algorithm Replaces the rank of pages by mean of
the rank values of all pages and go for
another round and carry on until the
value of two consecutive pages is
equal Reduces the computational
complexity Relevancy is low and theme drift

Above table compare various ranking algorithms. From the table we can observe the strengths and limitations of various algorithms.
New RatioRank Method:
In the RatioRank, both the inlinks and outlinks are being considered for computing the page ranking, and the number of times the user visit the particular link.
Figure 3 Shows inlinks and Outlinks between pages A, B and C

As shown in the figure from page B, there is one inlink and one outlink so the number of times the inlinks is visited is the fact of interest and outlinks to it. The importance or the popularity of out linked page C from page B will vote for ranking computation.
The equation for the proposed algorithm is as follows:
RR(u)=(1-d)+d '_(v'B(u))'((V_u*x* W_((v,u))^in +y*W_((v,u))^out )RR(v))/(TL(v))

Where RR(u) and RR(v) are ranking of the webpages u and v respectively
W_((v,u))^in and W_((v,u))^outare used to record the popularity of the inlinks and the outlinks and calculated as per Weighted Page Ranking Algorithm
d is the dampening factor
V_u is the number of visits of link which points from v to u
TL(v) is the total number visit of all links present on v,
B(u) are the pages which points to webpage u, x is the ratio of inlink weight and y is the ratio of outlink weight.
The values of x and y will be set between 0 to 1 on the basis of empirical results, while the ratio of inlink weights will be higher than that to the ratio of outlink weights because inlinks are considered to be more important to rank the webpages.

Algorithm: how actually the RatioRank works as follows:
Step 1: Initially assign 1 as the rank of all the retrieved WebPages.
Step 2: In the next round compute the rank again using the initialization made in step 1, using the ranking formula as given in equation.
Step 3: Assign the final rank value to the WebPages as obtained after step2 and the pages accordingly. As the inlinks of any WebPages votes more to increase the popularity of the page less of that by outlinks, that's why we assign more weightage to the inlink weights as that of outlink weights.
In this paper the parametric comparison of the algorithms are explained as follows.

Table 2 Parametric Comparison of Various Page Ranking Algorithms
Parameters/
Algorithm Description Mining Technique
Used I/P Parameters Relevancy of the Pages
RatioRank(Propos
ed Algorithm) Static computation of page using the inlinks
and outlinks with number of VOL as the user
input Web Structure Mining Inlinks, outlinks and the
number of VOL from user More than weighted page
ranking algorithm
Page Ranking
Algorithm Computes scores at indexing time Web Structure Mining Inbound links of the pages Medium
HITS
Algorithm It computes the hubs
and authority of the
relevant pages. Web Structure Mining,
Web Content Mining Content, Back and
Forward links Less than PR
Weighted Page
Ranking
Algorithm Static computation of page ranks according
to the importance level a page is observed Web Structure Mining Backlinks, forward links Higher than PR

Conclusion:
In this paper the proposed algorithm considers the inlinks, outlinks and the number of times a user visits the inlink of a webpage. The algorithm will give better results in terms of relevance of web pages, because no other then the RatioRank page Ranking algorithm uses all the features together as the inlinks and outlinks of any webpage and the visit of link count.

'
2.2.2 Enhanced-RatioRank: Enhancing Impact of Inlinks and Outlinks [4]
Author Name: Ranveer Singh and Dilip Kumar Sharma
Published Year: 2013
Topic Addressed:
In this paper a few of link based page ranking algorithm are revised and a new page ranking algorithm is proposed named Enhanced-RatioRank in which the inlinks, outlinks and the number of times user visits the link of the webpages. The relevancy of the web pages returned is more because the user behavior is also considered to rank the webpages.This Paper also consider ratio of weight of the inlinks and weight of outlinks for calculation and check which ratio gives the best result which ratio helps to give better relevancy of the web pages. The problem of theme drift exist (some link may not give the data about the query), and surfer jamming are the major limitations of the page ranking algorithm, HITS Algorithm considers all links of same importance are the major limitations of the HITS algorithm. The problem of topic drift was also exist in this. In Weighted pagerank algorithm [5] theme drift and surfer jamming exist which are counted as the limitations of this algorithm. Limitation of weighted page ranking algorithm based on visits of links of users is that it only uses the inlinks and the similar ranking to the non-visited webpages.New Enhanced algorithm is proposed to remove some limitations as follows:

Enhanced RatioRank Algorithm
In the Enhanced-RatioRank both the inlinks and the outlinks are being considered for computing the page ranking, and the number of times the user visit the particular link.

Figure 4 Shows inlinks and outlinks in web pages X, Y, Z and V

As shown in the figure from page V,X,Y and Z there exist different visit count and so the number of times the inlinks are visited is the fact of user interest.
The mathematical equations of the inlinks and outlinks weights are given as follows:
W_((v,u))^in = I_u/('_(p=R(v))'I_u ) ' W'_((v,u))^out = O_u/('_(p=R(v))'O_p )
The equation for the proposed algorithm is as follows:

RR(u)=(1-d)+d '_(v'B(u))'((V_u*.7* W_((v,u))^in +.3*W_((v,u))^out )RR(v))/(TL(v))

Where RR(u) and RR(v)are ranking of the webpages u and v respectively,
d is the dampening factor,
V_u is the number of visits of link which points from v to u,
TL(v) is the total number of visits of all links present on v
B(u) are the pages which points to webpage u.
As in equation only 70 percent of the weight of inlinks and the 30 percent of the weight of the outlinks is being used. Compare to other this ratio gives the better result.

Algorithm: how actually the RatioRank works as follows:
Step 1: Take the link structure of the retrieved webpages from the crawler [6].
Step 2: Obtain the webgraph from the link structure of the retrieved webpages.
Step 3: Assign 1 as the initial ranking to all the webpages.
Step 4: Calculate the weights of inlinks and outlinks using equation number '(1)', '(2)'. ??
Step 5: Apply the proposed algorithm (RatioRank) as in the equation '(3)'.
Step 6: Repeat the process iteratively until ranks of all webpages are stable means same in two consecutive iteration.

Disadvantage: The problem of theme drift still exist in this algorithm
Conclusion:
By using all three parameters for computing the pagerank and taking particular ratio of weight of inlinks and outlinks gives the better relevancy of web pages.

'
2.2.3 Improved PageRank Algorithm based on feedback of userclicks [7]
Author Name: Zhou Cailan and Chen Kai, Li Shasha
Published Year: 2011
Topic Addressed:
This paper analyzes the PageRank algorithm, find out its shortcoming and put forward an improved PageRank algorithm. Based on the behaviour of user clicks, this algorithm on the one hand collects the clicks of result pages and the recently click time of result pages, on the other hand set the weight of clicks and click time as formula parameters of PageRank algorithm, and verify this algorithm does improve the precision ratio of search engines through experiments.

They analyzes some limitations of Pagerank in this paper:
PageRank algorithm emphasize the old page: The number of links points to the page plays a decisive role on the PR value in PageRank algorithm. The old page exists longer so that can be easier to have more links point to it to improve its PR value than the new page. But most users just want to search the contents maybe in the latest pages. So the PR values of pages do not necessarily reflect the real importance of the page.
Topic shift problems: The PageRank algorithm is not dependent on the keywords, the information of user feedback, and cannot judge whether a hyperlink is related to the subject of the page or users inquiry. So it's easy to have the problem of topic shift. Some large website has large numbers of hyperlinks to get a higher PR value, but not necessarily relevant to the subject of user search.
Page cheating: Some rogue websites deliberately add a lot of popular keywords which are not related to the content in the title or content of the page to cheat search engines. PageRank algorithm does not identify the content of the pages. If a website can maintain a certain number of hyperlinks, it will be ranked in the forefront of search results.

New Improved Method:
This paper figures the feedback information of user, optimize formula of PR value and adjust PR value of page appropriately. In order to reduce the probability of topic shift and page cheating, In this paper they have added weight parameter of PR value through analyzing collected click information.
S(i) = ?? ln(N+1) + '??
Where,
S(i) =Clicks weight
N =Clicks of page i in a certain period of time
?? =Attenuation Coefficient, controls the weight of clicks, generally sets to 0.3. '??=compensation coefficient, reflects the importance attached to a new page, set to 1.

The weight of click time reset as follows:
T(i) = {'(1 ,T'1@1+ ??T ,T>1 )', in which T = {'(T_now-T_last ,T_last 'NULL@ T_now-T_update ,T_last=NULL)'
Where,
T = Time interval of page i.
Tnow = Time of user real-time search.
Tlast = Last click time of webpage i.
Tupdate = Modify time of page i.
?? =Attenuation coefficient, control the weight of time interval, usually set to 1/12.

The New proposed formula by them :
PR(I) = PR(i)*S(i) ' T(i)

They have done experimental analysis. They took open source search engine Notch to capture and analyze pages on the http://news.qq.com, and get 181348 valid pages. Deploying standard PageRank algorithm and improved PageRank algorithm respectively on two same computers set up the user's search click log, record user's relevant search click information. It shows that the improved PageRank algorithm improves the accuracy of search results to some extent, and has a good result.
Conclusion:
This paper starts from the point of view of user, based on user search click feedback, adds clicks weight and click time weight as two parameters to classic PageRank algorithm. After experimental analysis and comparison, the improved PageRank algorithm can properly enhance the page's PR value and get ranking results which better satisfy user.
'
2.2.4 Web Page Indexing through Page Ranking for Effective Semantic Search [8]
Author Name: Robin Sharma, Ankita Kandpal, Priyanka Bhakuni, Rashmi Chauhan, R.H. Goudar and Asit Tyagi
Published Year: 2013
Topic Addressed:
The Semantic Web realization is based on the availability of a significant mass of metadata for the web content, associated with the suitable information about the world. Searching the large internet information for small particular information leads to many deviations in the results of present searching techniques such as imprecision, large irrelevant search result, unable to interpret user's query and so on. Author has proposed architecture of semantic information retrieval to enhance the relevancy of search results. An algorithm is proposed to calculate the PageRank of the documents and then semantic indexing [13] is being performed by these ranked web pages. In this algorithm, author is considering the two factors for computing the page rank. One is the frequency of the keyword occurred in the web page. Second is associative factor of the same keyword with the meaningful interrogative words which are generally ignored by the existing search schemes.
Rank of web page is calculated as follows:
Rank [WPj, Keywords[K], Qp]=(11 - Disp)+(Freqk / Wordsj)
Where,
WPj = jth Web Page
Keywords[K] = kth main keyword of query
Qp = pth interrogative word
Disp = Distance between main keyword and pth interrogative word
Freqk = frequency of kth keyword

Conclusion:
The proposed architecture of information retrieval presents the vision of semantic search. To achieve the goal of providing the most relevant results to the user, content is matched to user's query and relevancy is taken in consideration. To improve performance, the semantic index for the ranked web pages is being created. To compute the page rank, the algorithm is proposed considering the associative factor with the interrogative words and frequency factor of the keywords so as to provide the relevant and precise result. On the basis of the experiments done, creating semantic indexes on the basis of the rank calculated is reasonable in either the hierarchical databases such as RDF (Resource description Framework) or the relational databases.'
2.2.5 PageRank optimization applied to spam detection [9]
Author Name: Olivier Fercoq
Published Year: 2012
Topic Addressed:
In this paper new link spam detection and PageRank demotion algorithm called MaxRank is introduced. Like TrustRank and Anti-TrustRank, it starts with a seed of hand-picked trusted and spam pages. MaxRank of a web page is the number of visit of this web page by a random user minimizing an average cost per time unit. On a given page, the random surfer selects a set of hyperlinks and clicks with uniform probability on any of these hyperlinks. The cost function is used to penalize spam pages and to remove hyperlinks. The goal is to determine a hyperlink that has minimum score. The MaxRank is interpreted as a modi'ed PageRank vector, used to sort web pages instead of the usual PageRank vector. This algorithm uses both TrustRank and AntiTrustRank for nonspam and spam page detection respectively.
P = ??S + (1 - ??) ez
Where,
P = Transition Matrix
e = Column Vector with all entries equal to 1
?? = damping factor
S = 0 or 1
Z = Zapping or Teleportation vector [Row vector]
Conclusion:
As in TrustRank [10] and AntiTrustRank [11], we start with a seed of trusted pages and known spam pages. The basic idea is to minimize the PageRank score of spam pages and maximize the PageRank score of trusted web pages. Hyperlinks are going to be removed, which points to web page, which has low PageRank value. Hence, the set of known spam pages will get less inlinks and it will have a lower PageRank.

'

CHAPTER 3
Problem Statement

'
3.1 Motivation
The World-Wide Web was invented by Tim Berners-Lee circa 1991. By the late 1990s, the amount of content on the Web had exploded, but searching this content was becoming an increasingly pressing problem. Human-generated indices (e.g. Yahoo!) could not keep up, and automatically generated indices (e.g. AltaVista) were sometimes of low quality. The PageRank algorithm essentially solved the web search problem and was the basis for Google's success. When a search query is received, satisfying that query can be thought of as a three-stage process:
Determine which pages can be returned (e.g. which ones contain the query string).
Rank these pages in order of importance. PageRank solves this stage.
Format the ranked list and display it to the user.
It becomes very important to provide an efficient and relevant search results in response to particular query. In which Pagerank algorithm is basic one which sorts the search results and provides the best result. Pagerank algorithm has some limitations. To get advance algorithm, many enhancements are done so far. But sometimes web page does not show content which we want. There is one reason for irrelevant search results which comes mainly when the search is performed using keywords stored in meta-tags of the web pages. Today internet has become the main source of retrieving information. To find the relevant information regarding a particular topic from the huge amount of data and information present on internet, an efficient searching technique is required. The traditional web searching techniques like Keyword-based search adopted by many web search engines are having some advantages like simple, quick and easy to implement but on the other hand they have shortcomings like less precision in results, large set of search results, which is time consuming for user to find relevant information out of it and cannot judge the meaning of the user's query. Many web designers add the keywords in meta-tags irrespective of the content of the web pages to increase the rank factor so that the link of the web page is shown at the top of the search result.

'
3.2 Analysis of Gap
Many improvements to Pagerank algorithm are done so far.
Some algorithm use content based Information Retrieval (web content) [14] approach to calculate Pagerank value and some algorithms use link based (web structure) algorithms to calculate Pagerank value.
Content-based Information Retrieval approach does not consider the pages link by the page while ranking the page and hence affect the quality of web documents.
Link based approach has the limitation of the topic drift.
'
Problem definition
'Web Page Indexing through Page Ranking for Effective Semantic Search' [8] is content based algorithm in which main keywords, interrogative words and web pages are taken as the input parameters for calculating the Pagerank value. They gave better accuracy in terms of relevancy of content as compared to the standard pagerank algorithms but some improvement like in terms of topic theme still exist in these algorithm. So our work is to reduce the problem of topic theme exist in these algorithms. For this work I am going to use web content mining for ranking. We have to calculate content relevance of web pages to our query. Relevance of the web pages is calculated by taking new parameters TrustRank [10] and Anti-TrustRank [11]. TrustRank of web page is increased when keywords are found in metadata and content of web page. Anti-TrustRank of web page is increased when keywords are found in metadata but not in content of web page.

Rank of web page is calculated as follows:
Rank [WPj, Keywords[K], Qp]= (11 - Disp) * (Freqk / Wordsj)
Where,
WPj = jth Web Page
Keywords[K] = kth keyword of query
Qp = pth keyword of word
Disp = Distance between kth keyword and pth keyword
Freqk = frequency of kth keyword

PageRank [WPj] = PageRank [WPj] + TrustRank[WPj] ' AntiTrustRank[WPj]
Where,
TrustRank[WPj] = TrustRank of jth Web Page
AntiTrustRank[WPj] = Anti-TrustRank of jth Web Page'

CHAPTER 4
Proposed Method
'
4.1 Flow chart of the Proposed Algorithm

Figure 5 Flow chart of proposed algorithm


4.2 Proposed Algorithm
Input: Keywords[], WP[], Q[] , AntiTrustRank, TrustRank
Output: PageRank [WPj]
Algorithm:
for each WPj of set WP do,
for each Keywords[k] of set Keywords do,
Freqk = 0
for each WPj[i] do,
increase Wordsj by 1
if (Keyword[k] = WPj[i])
increase Freqk by 1
for each Qp of set Q do,
Disp = '
for (l=i-5 to i+5) do,
if(Qp= WPj[l])
tempp = WPj[i] ' WPj[l]
Disp=minimum(Disp, tempp)
End for
for each Disp do,
find closeness
End for
End for
End for
for each closep do,
if(closep !=0)
Rank [WPj, Keywords[K], Qp] = closep*(Freqk / Wordsj)
End for
End for
if(spamPage) Increase AntiTrustRank[WPj]
else Increase TrustRank[WPj]
PageRank [WPj] = calculateFinalPageRank()
End for
'

CHAPTER 5
Implementation and Results
'
5.1 Implementation

Figure 6 Architecture used in implementation

We have used MVC architecture in our implementation. MVC means Model View and Controller. In this implementation, Database is as model, JSP is as view and Servlet is as controller. User query is submitted using jsp page to servlet. In servlet all logic is implemented. Servlet make database connection using JDBC. We have used apache as web server and tomcat as container.
In our algorithm, we have added two new parameters TrustRank [10] and Anti-TrustRank [11]. TrustRank is measurement of trust of page. Anti-TrustRank is measurement of antitrust of page. These parameters are stored in database. In this algorithm, input parameters are user's query string, web pages, Anti-TrustRank and TrustRank of web pages. Output is PageRanks for web pages. TrustRank and Anti-TrustRank values are stored in database for each page. If page contains one of keyword of user's query string in metadata then that page is related to user's query. If keywords are found in web page then web page is not spam page [15]. So TrustRank is increased inversely proportional to number of words in web page. If none keywords are found in web page then web page is spam page. So Anti-TrustRank is increased inversely proportional to number of words in web page. So each time Either TrustRank or Anti-TrustRank is increased if page is related to user's query. Modified TrustRank and Anti-TrustRank values are stored in database again. So it keeps effect of past results. Owner of web page can insert only related keywords in metadata.
Let say a keyword K is selected from the set of keywords. For K we select a web page WP on which searching is done. So we scan the whole web page WP comparing each word with K. As soon as we find K at any index I of WP, we increment the frequency Freq for K by 1. If keyword K is found at index I, we try to find other keywords from Q from index i-5 to i+5. If keyword from Q is found at index l then distance (l-i) is measured. For whole web page calculate these distances and find closeness for each keyword in Q. We also calculate the total number of words for web page WP as Words. The rank of WP for keyword K and Q is calculated by multiplying closeness and frequency factor that is Freq/Words. If none keyword K is found in webpage WP then its PageRank value is 0. This webpage is considered as spam page and its Anti-TrustRank value is increased by 1/Words. If even single keyword is found in web page then web page is not spam page and its TrustRank is increased by 1/Words. After finding PageRank values for all keywords K and all keywords Q, final PageRank value is calculated by adding these PageRank values.
Minimum(Disp, tempp) function has two integer arguments. Here Dispp is minimum distance between K and Q if keyword Q is found. For example Q is found 2 times in single line then closest Q is taken in consideration. Closeness is reversal of distance. If distance is high means closeness is low. If distance is low means closeness is high. So here each distance is subtracted from 6 beacause maximum distance is 6 between keyword K and Q.
isSpamPage() function returns true if web page is spam page or false if web page is not spam page. Web Page is spam page when web page metadata contains keywords related to user query but content of web page does not contains single keyword related to user's query. If web page is spam page then its Anti-TrustRank value is increased. If web page is not spam page then its Trust-Rank value is increased.'
5.2 Results
Table 3 Result of modified algorithm for 5 web pages and 5 queries
Query sachin tendulkar sachin tendulkar sachin tendulkar football sachin tendulkar
Sachin.htm PageRank 0.67511 0.67529 0.67548 0 0.67548
TrustRank 0.00018 0.00036 0.00054 0.00054 0.00072
Anti-TrustRank 0 0 0 0.00018 0.00018
SachinBiography.htm PageRank 0.6565 0.65811 0.65971 0 0.65971
TrustRank 0.00161 0.00321 0.00482 0.00482 0.00642
Anti-TrustRank 0 0 0 0.00161 0.00161
SachinESPN.htm PageRank 0.08554 0.08586 0.08618 0 0.08618
TrustRank 0.00032 0.00065 0.00097 0.00097 0.00129
Anti-TrustRank 0 0 0 0.00032 0.00032
SachinYahoo.html PageRank 0.14089 0.14195 0.14301 0 0.14301
TrustRank 0.00106 0.00212 0.00318 0.00318 0.00424
Anti-TrustRank 0 0 0 0.00106 0.00106
SachinInfo.html PageRank 0.94634 0.94688 0.94743 0 0.94743
TrustRank 0.00054 0.00108 0.00163 0.00163 0.00217
Anti-TrustRank 0 0 0 0.00054 0.00054

All results are displayed in above table. This table contains result of 5 executions. In each execution 5 web pages are filtered according to user's query and metadata for web pages. In table, PageRank values are displayed for 5 executions for 5 different web pages. TrustRank and Anti-TrustRank values are displayed for 5 executions. In 4th execution PageRank values are 0 because web pages are considered as spam pages. So Anti-TrustRank values are increased, too.'

CHAPTER 6
Analysis
'
6.1 Analysis
For analysis, we have tested existing and modified algorithm 5 times. Existing algorithm gives same value for same query and modified algorithm gives different value according to previous queries due to modification in TrustRank and Anti-TrustRank parameters. Results are displayed in table 1. From table 1 we can see modified algorithm considers previous queries from user and existing algorithm gives constant value. Figure 1 is graphical presentation of Table 1. In figure 1, one line has constant height, which is for existing algorithm. Other line has increasing height, which is for modified algorithm. Here PageRank value is increasing for same query for same web page because TrustRank value is increasing each time whenever web page is not spam page. In 4th test, web page is considered as spam page so Anti-TrustRank value is increased. So 3rd and 5th test has same PageRank value because in 3rd and 4th test TrustRank and Anti-TrustRank value are increased respectively.
We have tested modified algorithm for 5 different pages 5 times. Results are shown in Table 2. Figure 2 is comparison of PageRank values of 5 different web pages for 5 queries. TrustRank and Anti-TrustRank parameters were 0 for all web pages before analysis.

'
Table 4 Comparison of Existing algorithm and Modified algorithm
Query Modified Algorithm Existing Algorithm
sachin tendulkar 0.675113122 0.675113122
sachin tendulkar 0.675294118 0.675113122
sachin tendulkar 0.675475113 0.675113122
Football 0 0
sachin tendulkar 0.675475113 0.675113122

Figure 7 Comparison of Existing algorithm and Modified algorithm

'
Table 5 PageRank for 5 different web pages for 5 Queries
Query Sachin.htm SachinBiography.htm SachinESPN.htm SachinYahoo.html SachinInfo.html
sachin tendulkar 0.675113122 0.656500803 0.08553906 0.140889831 0.9463415
sachin tendulkar 0.675294118 0.658105939 0.08586185 0.141949153 0.9468835
sachin tendulkar 0.675475113 0.659711075 0.08618464 0.143008475 0.9474255
Football 0 0 0 0 0
sachin tendulkar 0.675475113 0.659711075 0.08618464 0.143008475 0.9474255

'
Figure 8 Comparison of PageRank values

'
During analysis, TrustRank and Anti-TrustRank values were changed, which are shown in Table 2 below. For example for query 'sachin tendulkar' for web page 'Sachin.htm' TrustRank and Anti-TrustRank values are 0.00018 and 0 respectively. Here 'football' is stored in metadata for all web pages. But it is not found in contents of the page. All pages are considered as Spam pages and Anti-TrustRank values were increased for all web pages.

Table 6 TrustRank and Anti-TrustRank values for 5 different web pages for 5 Queries
Query Sachin.htm SachinBiography.htm SachinESPN.htm SachinYahoo.html SachinInfo.html
sachin tendulkar 0.00018
/
0 0.00161
/
0 0.00032
/
0 0.00106
/
0 0.00054
/
0
sachin tendulkar 0.00036
/
0 0.00321
/
0 0.00064
/
0 0.00212
/
0 0.00108
/
0
sachin tendulkar 0.00054
/
0 0.0048
/
0 0.00097
/
0 0.00318
/
0 0.00163
/
0
Football 0.00054
/
0.000181 0.0048
/
0.0016 0.00097
/
0.00032 0.00318
/
0.00106 0.00163
/
0.00054
sachin tendulkar 0.00072
/
0.000181 0.00642
/
0.0016 0.00129
/
0.00032 0.00424
/
0.00106 0.00217
/
0.00054

'

CHAPTER 7
Conclusion and Future Work

'
7.1 Conclusion
In this work we have analyzed various PageRank algorithms. The basic pagerank algorithm has many drawbacks like topic theme and more emphasis on old pages. Then pagerank algorithm is to be analyzed and some limitations of the algorithm are improved. After understanding various content based ranking algorithms, we understood that content based ranking algorithms use only trust rank, i.e. some pages does not gives result related to user's query. So some content matching method is needed to reduce this problem. So we have added two new parameters TrustRank and Anti-TrustRank in existing algorithm. If page is not spam page then its TrustRank value is increased. If page is spam page then its Anti-TrustRank value is increased. By proper analysis of this algorithm we conclude that these parameters make results more efficient. This algorithm finds relatedness of two keywords in the page. Anti-TrustRank is measurement of anti trust of page. TrustRank is measurement of trust of page.

7.2 Future work
As a future work, this algorithm can be improved accuracy of algorithm by providing synonyms. Algorithm can take in consideration synonym of user's query. Search results can be filtered more accurately. Algorithm can be modified to use indexing to improve performance. Meta-tags can be used instead of metadata. This algorithm uses text as content. Algorithm can be enhanced to media files, too.'

References

'
Papers:
S. Brin and L. Page, 'The Antonomy of a Large Scale HyperTextual Web Search Engine,'7th International WWW Conference Proceedings, Australia, April 1998.

Wei Huang and Bin Li, 'An Improved Method for the Computation of PageRank', Mechatronic Science, Electric Engineering and Computer (MEC), 2011 International Conference

Ranveer Singh and Dilip Kumar Sharma,' RatioRank: Enhancing the Impact of Inlinks and Outlinks' 3rd IEEE International Advance Computing Conference (IACC) 2013.

Ranveer Singh and Dilip Kumar Sharma,' Enhanced-RATIORANK: Enhancing Impact of Inlinks and Outlinks' IEEE Conference on Information and Communication Technologies 2013.

Wenpu Xing and Ali Ghorbani,' Weighted PageRank Algorithm', Proceedings of the Second Annual Conference on Communication Networks and Services Research (CNSR'04),IEEE 2004.

Ling Zhang and Zheng Qin, 'The Improved Pagerank in Web Crawler', Information Science and Engineering (ICISE), 2009 1st International Conference 2009.

Zhou Cailan and Chen Kai,Li Shasha,' Improved PageRank Algorithm Based on Feedback of User Clicks' IEEE 2011.

Robin Sharma, Ankita Kandpal, Priyanka Bhakuni, Rashmi Chauhan, R.H. Goudar and Asit Tyagi, ' Web Page Indexing through Page Ranking for Effective Semantic Search', Intelligent Systems and Control (ISCO), 2013 7th International Conference.

Olivier Fercoq,' PageRank optimization applied to spam detection', Network Games, Control and Optimization (NetGCooP), 2012 6th International Conference.

Z. Gy ??ongyi, H. Garcia-Molina, and J. Pedersen, 'Combating web spam with trustrank,' in Proc. 30th international conference on Very large data bases, vol. 30. VLDB Endowment, 2004, pp. 576'587.

V. Krishnan and R. Raj, 'Web spam detection with anti-trust rank,' in Proc. ACM SIGIR workshop on Adversarial Information Retrieval on the Web, 2006.

L. Page, S. Brin, R. Motwani, and T. Winograd, 'The PageRank Citation Ranking: Bringing Order to the Web" Technical report, Stanford Digital Library Technologies Project, 1998.

Ashish Jain, Rajeev Sharma, Gireesh Dixit and Varsha Tomar,'Page Ranking Algorithms in Web Minning, Limitations of Existing methods and New Method for indexing Web Pages', International Conference on Communication Systems and Network Technologies IEEE 2013.

Apostolos Kritikopoulos, Martha Sideri and Iraklis Varlamis, 'Wordrank: A Method for Ranking Web Pages Based on Content Similarity', Databases, 2007. BNCOD '07. 24th British National Conference 2007.

Bing-Yuan Pu, Ting-Zhu Huang and Chun Wen, 'An Improved PageRank Algorithm: Immune to Spam ', Network and System Security (NSS), 2010 4th International Conference 2010.

Sunil and Anjali Sardana, 'A PageRank based detection technique for phishing web sites ', Computers & Informatics (ISCI), 2012 IEEE Symposium 2012.

WebSites:
www.pr.efactory.de/e-pagerank-algorithm.shtml
PageRank Algorithm - The Mathematics of Google Search
www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html
www.en.wikipedia.org/wiki/PageRank

Books:
'Head First Servlets and JSP' by Bert Bates and Kathy Sierra.
'Java: The Complete Reference, J2SE 5 Edition' by Herbert Schildt.'

Source: Essay UK - http://doghouse.net/free-essays/information-technology/web-search-engine.php


Not what you're looking for?

Search:


About this resource

This Information Technology essay was submitted to us by a student in order to help you with your studies.


Rating:

Rating  
No ratings yet!


Word count:

This page has approximately words.


Share:


Cite:

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay UK, Introduction To Web Search Engine. Available from: <http://doghouse.net/free-essays/information-technology/web-search-engine.php> [22-02-19].


More information:

If you are the original author of this content and no longer wish to have it published on our website then please click on the link below to request removal:


Essay and dissertation help

badges