A Genetic Algorithm for Internet Search
Using a DB-Oriented Topic Sorted Mutation

Jelena Mirkovic, Laslo Kraus, Veljko Milutinovic

Department of Computer Engineering,
School of Electrical Engineering,
University of Belgrade,
POB 35-54,
11120 Belgrade, Serbia, Yugoslavia

E-mail: sunshine@galeb.etf.bg.ac.yu,
ekraus@etf.bg.ac.yu, vm@etf.bg.ac.yu

Abstract

Internet search is becoming problematic due to Information overload on the Internet. In order to help users in information searching, various applications appeared. There are two basic approaches to Internet search: indexing and agent search. An agent presented in this paper uses genetic algorithm for global search. It is written in the Java programming language. For a set of input documents, the agent finds and displays similar documents and the information regarding how similar they are. It uses database of topic sorted URLs to perform mutation.

Introduction

Each day number of documents and servers on the Internet grows rapidly. In such an abundance of data it is hard to get the information needed in the shortest time possible. There are two approaches to Internet search:

  1. Indexing and database searching (e.g. Altavista, Yahoo...) - tools for indexing fetch all available documents on the Internet, parse their contents, and store it in an indexed database. Based on the keywords, submitted by the user, the tool searches the database and displays all documents containing desired keywords. Advantage of this approach is that it covers a big search space, but these tools have usually poor evaluation function, and keywords that user submits have to be carefully picked up, in order to obtain a desired number and quality of resulting documents.
  2. Agent search - autonomous program (agent) performs the search without user supervision. It takes the number of URLs as input parameters and follows their links in order to find similar documents on the Web.

Program package Tropical, developed in this B.Sc. thesis is agent that performs Internet search using genetic algorithm with database mutation. It is written in Java programming language.

Problem Definition

Client-based searching agents can be agents for local search (which means that they search for documents linked to the input documents) and agents for non-local search (which can find documents that are not linked to the input documents).

Broadly defined, an agent is a program that can operate autonomously to accomplish unique tasks without direct human supervision.

The agent presented in this paper named Tropical is an agent for global search, written in the Java programming language. For set of input documents, the agent finds and displays similar documents and the information regarding how similar they are. It uses services of following packages:

A Discussion of Existing Solutions

One agent for Internet search using genetic algorithm is described in the paper [2]. Aim of this B.Sc. thesis is repeating experiment described in [2], but also its improvement using new methods of mutation.

Proposed Solution and its Essence

Basic Genetic Algorithm

Genetic algorithm is a search method that is used for covering a big search space, and finding the optimal solution. Its application is especially important in AI. It is frequently used for finding optimal schedule of resource. Basic steps in the workflow of genetic algorithm are:

  1. representing each solution uniformly in a way suitable for computer processing (often like arrays or strings)
  2. initialization of the current set of solutions that is examined (current configuration set) by picking randomly from the whole search space desired number of potential solutions
  3. Forming set of promising solutions for further examination (mating pool) by picking the best ones from current configuration set. How good is the solution, determines the fitness function which is defined for each problem separately.
  4. Performing operators of crossover and mutation on mating pool and thus generating new solutions. Crossover operator generates new solution by using genetic material of existing solutions in the mating pool. Mutation operator generates a whole new solution randomly. New solutions are then inserted in the current configuration set.
  5. If stopping criterion is satisfied algorithm ends, otherwise steps 3-5 are repeated.

Genetic Algorithm for Internet Search

Algorithm performs following steps:

  1. User submits input set of documents like string of URLs.
  2. From the Web are fetched the input documents and parsed in order to extract keywords.
  3. From the Web are fetched documents that are pointed to by the hyperlinks in input documents and they form the current configuration set H.
  4. From the database is randomly picked desired number of URLs that cover the same topic as do the input documents (this topic is specified by the user at the beginning of the program) and corresponding documents are fetched from the Web and inserted in the set H.
  5. Determining the documents from the set H that are most similar to the input ones by calculating Jaccard Score for each of them.
  6. Inserting the most similar document in the output set and adding the documents linked to it in the set H.
  7. Repeating steps 4-6 until desired number of output documents is obtained.

Some Relevant Details

Linking the database - just once before you first start the program

Start button in Windows environment, Settings, Control Panel, ODBC, User DSN Window, Add, Microsoft Access Driver, Finish, In the field Data Source Name put URLbase, Select, select the database url.mdb in the tropical directory, Advanced, in the fields for username and password type mladen and java

Starting the program

In JDK 1.1.4 Package:

java tropical -dProxySet=true

-dProxyHost=proksi

-dProxyPort=port

proksi is IP address of proxy server you are using
port port number of this proxy server

I suggest you put this line in tropical.bat file and then just type tropical each time.

Example of the command line is:

java tropical -dProxySet=true

-dProxyHost=147.91.8.62

-dProxyPort=8080

Input parameters

Output window

Result of the program work can also be found in the file izlaz.html in the Tropical directory and can be viewed during program execution.

 

Conclusion

Agent for global search was meant to be a part of a bigger project, but also to be able to do the autonomous work. During "Tropical" development, it was noticed that it could be improved. For example, improvement of class which performs extraction of keywords, creation of mobile agents, intelligent search, cooperation with Altavista tool in order to cover a bigger search space... Agent presented in this paper can give better search results than index engines in cases when desired URLs are in the database regarding time of search and quality of the documents found.

References

  1. Chen, H., Chung, Y., Ramsey, M., Yang, C., Ma, P., Yen, J., "Intelligent Spider for Internet Searching," Proceedings of the 30th Annual Hawaii International Conference on System Sciences - Volume IV, Kailua-Kona, Hawaii, USA, January 1997,
    pp. 178-188.
  2. Ramsey, M., "AI Internet Search Spider," http://ai.bpa.arizona.edu/~mramsey/, December 1997.
  3. Slijepcevic, S., A Programmable Agent for Internet Retrieval,
    B. Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, Oktober 1997.
  4. Ramsey, M., "AI Internet Search Spider," http://ai.bpa.arizona.edu/~mramsey, December 1997.
  5. Tomca, N., A Flexible Tool for Jaccard Score Evaluation ,
    B. Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, December 1997.
  6. Mrkic, M., Obradovic, V., Managing an Internet Address Database, B. Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, May 1997.
  7. Cakulev, I., Program Package Host for Searching the DSN Domain, B. Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, June 1997.

Appendix

We would like to express our gratitude to the members of the Internet group at the Department of Computer Engineering, School of Electrical Engineering, the University of Belgrade for their support and suggestions in writing this paper.