A Flexible Tool for Jaccard Score Evaluation

Nela Tomca, Laslo Kraus, Veljko Milutinovic

Department of Computer Engineering,
School of Electrical Engineering,
University of Belgrade,
POB 35-54,
11120 Belgrade, Serbia, Yugoslavia

E-mail: nela@galeb.etf.bg.ac.yu,
ekraus@etf.bg.ac.yu, vm@etf.bg.ac.yu

Abstract

Internet search is becoming problematic due to Information overload on the Internet. In order to help users in information searching, programs known as agents appeared. An agent is a program that can operate autonomously to accomplish unique tasks without direct human supervision. An agent presented in this paper is an agent for local search, written in the Java programming language. For a set of input documents, the agent finds and displays similar documents and the information regarding how similar they are.

Introduction

The Internet enables a computer user to be connected to a virtually endless number of sites on the network. The World Wide Web (WWW) uses the Internet to transmit hypermedia documents among computer users located around the world. In order to fully utilize the power of the WWW as a gigantic information source, it is essential to develop intelligent software systems on top of the Web, to assist users in retrieving relevant documents.

Information searching over the cyberspace is becoming more and more important. Two major approaches have been developed for Internet searching: client-based searching (implemented through agents - programs which look for documents similar to the set of input documents) and online database indexing and searching (machines like Yahoo, Altavista,...).

Problem Definition

Client-based searching agents can be agents for local search (which means that they search for documents linked to the input documents) and agents for non-local search (which can find documents that are not linked to the input documents). There is a research in progress at the Department of Computer Engineering, the University of Belgrade [3], related to client-based non-local Internet search (using genetic algorithms).

Broadly defined, an agent is a program that can operate autonomously to accomplish unique tasks without direct human supervision.

The agent presented in this paper named "BFS Agent" is an agent for local search, written in the Java programming language. For set of input documents, the agent finds and displays similar documents and the information regarding how similar they are. It uses another agent, called "Spider" [4] (part of the aforementioned project) for document fetching.

A Discussion of Existing Solutions

TueMosaic and WebCrawler are two prominent examples of agents based on Best First Search algorithm. One agent for local searching, called "BFS Spider" [2] can be found on the Internet. But, "BFS Spider" is written partly in C++ (the search engine) and partly in Java (the user interface), so it cannot be meaningfully compared with the agent presented in this paper.

Proposed Solution and its Essence

In this research, we have developed an agent for local Internet search based on the Best First Search algorithm. In order to compare similarity between two homepages, Jaccard's scores are used.

Jaccard's Scores [1]

Given two homepages, x and y, and their links in sets X and Y, the Jaccard's score between x and y based on links is then computed as follows:

where #(S) indicates the cardinality of set S.

 

Given a set of homepages the terms in these homepages are identified (keywords). The term frequency and the homepages frequency are then computed. Term frequency, tfxj, represents the number of occurrences of term j in homepage x. Homepage frequency, dfj, represents the number of homepages in a collection of N homepages in which term j occurs. The combined weight of term j in homepage x, dxj, is computed as follows:

where wj represents the number of words in term j, and N represents the total number of homepages. The Jaccard's score between homepages x and y based on indexing is then computed as follows:

where L is the total number of terms.

Best First Search Algorithm [1]

1. Input homepages and Initialization:

Initialize k to 1. A set of homepages, input1, input2,..., inputm are obtained from users. These input homepages are fetched and their linked homepages are saved in H={h1, h2...}.

2. Determine the best homepage:

Determine the best homepage in H which has the highest Jaccard's score among all the homepages in H and save it as outputk. The Jaccard's score based on links for hi is computed as follows:

where JSlinks(inputj,hi) represents the Jaccard's score between inputj and hi based on links. Jaccard's score based on indexing for hi is also computed similarly:

where JSindex(inputj,hi) represents the Jaccard's score between inputj and hi based on indexing. The Jaccard's score for hi is computed as follows:

The homepage in H which have the highest Jaccard's score is saved as an output, outputk.

3. Fetch the best homepage:

Fetch the best homepage, outputk, and add all its linked homepages to H. Increase k by 1.

4. Repeat until all the output homepages are obtained:

Repeat steps 2 and 3 until k equals to the total number of outputs plus one required by user.

A Graphical example of the way the Best First Search algorithm works is given in Figure 1.

Figure 1. The Best First Search algorithm
(an example)

Some Relevant Details

Structure of "BFS Agent" is given in Figure 2. Ellipses represent classes of "BFS Agent," and rounded rectangle represents the "Spider" agent.

Class "agent" represents an executable class which performs the Best First Search algorithm, and fetches the relevant documents and stores them on the local disk. Fields and methods needed for implementation of the Best First Search algorithm can be found in class "BFS." The role of "BFSURL" class is to compute Jaccard's scores. Class "KeyParser" extracts keywords from a set of input documents. Class "BFSParser" performs search for keywords and hyperlinks through the document.

Figure 2. Cooperation between "BFS Agent" classes and program "Spider"

Conclusion

Agent for local search was meant to be a part of a bigger project, but also to be able to do the autonomous work. During "Agent" development, it was noticed that it can be improved. For example, improvement of class which performs extraction of keywords, creation of graphical user interface, creation of mobile agents... Agent presented in this paper can give better search results than index engines in cases of local search regarding time of search and quality of the documents found.

References

  1. Chen, H., Chung, Y., Ramsey, M., Yang, C., Ma, P., Yen, J., "Intelligent Spider for Internet Searching," Proceedings of the 30th Annual Hawaii International Conference on System Sciences - Volume IV, Kailua-Kona, Hawaii, USA, January 1997,
    pp. 178-188.
  2. Ramsey, M., "AI Internet Search Spider," http://ai.bpa.arizona.edu/~mramsey/, December 1997.
  3. Kraus, L., Milutinovic, V., Technical Report on A New Genetic Algorithm for Internet Search based on Principles of Spatial and Temporal Locality, Number p30/SinfoN '97, University of Belgrade, Belgrade, Serbia, Yugoslavia, November 1997.
  4. Slijepcevic, S., A programmable Agent for Internet Retrieval, B.Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, October 1997.

Appendix

We would like to express our gratitude to the members of the Internet group at the Department of Computer Engineering, School of Electrical Engineering, the University of Belgrade for their support and suggestions in writing this paper. Therefore we thank to Jelena Mirkovic, Suzana Cveticanin, Davor Magdic, and Sasa Slijepcevic.