A Genetic Algorithm for Internet Search
Using a DB-Oriented Topic Sorted Mutation
Jelena Mirkovic, Laslo Kraus, Veljko Milutinovic
Department of Computer Engineering,
School of Electrical Engineering,
University of Belgrade,
POB 35-54,
11120 Belgrade, Serbia, Yugoslavia
E-mail: sunshine@galeb.etf.bg.ac.yu,
ekraus@etf.bg.ac.yu, vm@etf.bg.ac.yu
Abstract
Internet search is becoming problematic due to Information overload on the Internet.
In order to help users in information searching, various applications appeared. There are two basic approaches to Internet search: indexing and agent search. An agent presented in this paper uses genetic algorithm for global search. It is written in the Java programming language. For a set of input documents, the agent finds and displays similar documents and the information regarding how similar they are. It uses database of topic sorted URLs to perform mutation.
Introduction
Each day number of documents and servers on the Internet grows rapidly. In such an abundance of data it is hard to get the information needed in the shortest time possible. There are two approaches to Internet search:
- Indexing and database searching (e.g. Altavista, Yahoo...) - tools for indexing fetch all available documents on the Internet, parse their contents, and store it in an indexed database. Based on the keywords, submitted by the user, the tool searches the database and displays all documents containing desired keywords. Advantage of this approach is that it covers a big search space, but these tools have usually poor evaluation function, and keywords that user submits have to be carefully picked up, in order to obtain a desired number and quality of resulting documents.
- Agent search - autonomous program (agent) performs the search without user supervision. It takes the number of URLs as input parameters and follows their links in order to find similar documents on the Web.
Program package Tropical, developed in this B.Sc. thesis is agent that performs Internet search using genetic algorithm with database mutation. It is written in Java programming language.
Problem Definition
Client-based searching agents can be agents for local search (which means that they search for documents linked to the input documents) and agents for non-local search (which can find documents that are not linked to the input documents).
Broadly defined, an agent is a program that can operate autonomously to accomplish unique tasks without direct human supervision.
The agent presented in this paper named Tropical is an agent for global search, written in the Java programming language. For set of input documents, the agent finds and displays similar documents and the information regarding how similar they are. It uses services of following packages:
- Spider [3] for document fetching
- Agent [5] for Best First Search algorithm
- Generator [6] for designing and communication with the database of topic sorted URLs.
A Discussion of Existing Solutions
One agent for Internet search using genetic algorithm is described in the paper [2]. Aim of this B.Sc. thesis is repeating experiment described in [2], but also its improvement using new methods of mutation.
Proposed Solution and its Essence
Basic Genetic Algorithm
Genetic algorithm is a search method that is used for covering a big search space, and finding the optimal solution. Its application is especially important in AI. It is frequently used for finding optimal schedule of resource. Basic steps in the workflow of genetic algorithm are:
- representing each solution uniformly in a way suitable for computer processing (often like arrays or strings)
- initialization of the current set of solutions that is examined (current configuration set) by picking randomly from the whole search space desired number of potential solutions
- Forming set of promising solutions for further examination (mating pool) by picking the best ones from current configuration set. How good is the solution, determines the fitness function which is defined for each problem separately.
- Performing operators of crossover and mutation on mating pool and thus generating new solutions. Crossover operator generates new solution by using genetic material of existing solutions in the mating pool. Mutation operator generates a whole new solution randomly. New solutions are then inserted in the current configuration set.
- If stopping criterion is satisfied algorithm ends, otherwise steps 3-5 are repeated.
Genetic Algorithm for Internet Search
Algorithm performs following steps:
- User submits input set of documents like string of URLs.
- From the Web are fetched the input documents and parsed in order to extract keywords.
- From the Web are fetched documents that are pointed to by the hyperlinks in input documents and they form the current configuration set H.
- From the database is randomly picked desired number of URLs that cover the same topic as do the input documents (this topic is specified by the user at the beginning of the program) and corresponding documents are fetched from the Web and inserted in the set H.
- Determining the documents from the set H that are most similar to the input ones by calculating Jaccard Score for each of them.
- Inserting the most similar document in the output set and adding the documents linked to it in the set H.
- Repeating steps 4-6 until desired number of output documents is obtained.
Some Relevant Details
Linking the database - just once before you first start the program
Start button in Windows environment, Settings, Control Panel, ODBC, User DSN Window, Add, Microsoft Access Driver, Finish, In the field Data Source Name put URLbase, Select, select the database url.mdb in the tropical directory, Advanced, in the fields for username and password type mladen and java
Starting the program
In JDK 1.1.4 Package:
java tropical -dProxySet=true
-dProxyHost=
proksi
-dProxyPort=
port
proksi is IP address of proxy server you are using
port port number of this proxy server
I suggest you put this line in tropical.bat file and then just type tropical each time.
Example of the command line is:
java tropical
-dProxySet=true
-dProxyHost=
147.91.8.62
-dProxyPort=
8080
Input parameters
- Input Set - field for input set of URLs. It is necessary that you enter either a set of URLs or a name of the file containing URLs or both.
- Input file - field for the name of the file with URLs (each in the new line)
- Number of outputs - field for the required number of output documents. Default number is 10.
- Database mutation - list for desired number of mutated documents per iteration. Default number is 0.
- Spatial mutation - not realized yet.
- Temporal mutation - not realized yet.
- Topic - list for choosing one or more topics covered by mutation.
- Comments - comments on/off field.
Output window
- Program Flow - some program messages that keep you informed about program execution.
- Result Set - result set of URLs.
- Medium Jaccard's Score - medium Jaccard's Score for the result set.
Result of the program work can also be found in the file izlaz.html in the Tropical directory and can be viewed during program execution.
Conclusion
Agent for global search was meant to be a part of a bigger project, but also to be able to do the autonomous work. During "Tropical" development, it was noticed that it could be improved. For example, improvement of class which performs extraction of keywords, creation of mobile agents, intelligent search, cooperation with Altavista tool in order to cover a bigger search space... Agent presented in this paper can give better search results than index engines in cases when desired URLs are in the database regarding time of search and quality of the documents found.
References
- Chen, H., Chung, Y., Ramsey, M., Yang, C., Ma, P., Yen, J., "Intelligent Spider for Internet Searching," Proceedings of the 30th Annual Hawaii International Conference on System Sciences - Volume IV, Kailua-Kona, Hawaii, USA, January 1997,
pp. 178-188.
- Ramsey, M., "AI Internet Search Spider," http://ai.bpa.arizona.edu/~mramsey/, December 1997.
- Slijepcevic, S., A Programmable Agent for Internet Retrieval,
B. Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, Oktober 1997.
- Ramsey, M., "AI Internet Search Spider," http://ai.bpa.arizona.edu/~mramsey, December 1997.
- Tomca, N., A Flexible Tool for Jaccard Score Evaluation ,
B. Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, December 1997.
- Mrkic, M., Obradovic, V., Managing an Internet Address Database, B. Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, May 1997.
- Cakulev, I., Program Package Host for Searching the DSN Domain, B. Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, June 1997.
Appendix
We would like to express our gratitude to the members of the Internet group at the Department of Computer Engineering, School of Electrical Engineering, the University of Belgrade for their support and suggestions in writing this paper.