A Software Package for Experimenting in Genetic Search on Internet:
Spatial versus Temporal Locality and Static versus Mobile Agents

Veljko Milutinović, Laslo Kraus, Jelena Mirković, Ljiljana Nešić, Nela Tomča, Suzana Cvetićanin,
Dragana Cvetković, Saša Slijepčević, Vladan Obradović, Mladen Mrkić, Igor Čakulev

Department of Computer Engineering
School of Electrical Engineering
University of Belgrade
P.O. Box 35-54, 11120 Belgrade, Serbia, Yugoslavia

E-mail: vm@etf.bg.ac.yu

 

Abstract

As the number of documents and servers on the Internet grows with the enormous speed, it becomes necessary to design efficient tools for search and retrieval of documents. This paper describes several existing solutions to the problem and discusses the implementation of one of these solutions: design of intelligent genetic algorithms for Internet search.

 

Introduction

Among the huge number of documents and servers on the Internet, it is hard to quickly locate documents that contain potentially useful information. Therefore, the key factor in software development nowadays should be the design of applications that efficiently locate and retrieve Internet documents that best meet user's requests. The accent is on intelligent content examination and selection of documents that are most similar to those submitted by the user, as input.

One approach to this problem is indexing all accessible Web pages and storing this information into the database. When the application is started it extracts keywords form the user supplied documents and consults the database to find documents in which given keywords appear with the greatest frequency. This approach, besides the need to maintain a huge database, suffers from the poor performance - it gives numerous documents totally unconnected to the user's topic of interest.

The second approach is to follow links from a number of documents submitted by the user and to find the most similar ones, performing a genetic search on the Internet. Namely, application starts from a set of input documents, and by following their links, it finds documents that are most similar to them. This search and evaluation are performed using genetic algorithms as a heuristic search method. If only links from input documents are followed, it is the Best First Search, or genetic search without mutation. If, besides the links of the input documents, some other links are also examined, it is genetic search with mutation.

The second approach was realized and tested at the University of Hong Kong, and results were presented at the 30th Annual Hawaii International Conference of System Sciences [4]. The Best First Search was compared to genetic search where mutation was performed by picking a URL from a subset of URLs covering the selected topic. That subset is obtained from a compile-time generated database. It was shown that search using genetic algorithms gives better results than the Best First Search for a small set of input documents, because it is able to step outside the local domain and to examine a larger search space.

There is an ongoing research at the University of Belgrade concerning mutation exploiting spatial and temporal locality. The idea of spatial locality exploitation is to examine documents in the neighborhood of the best ranked documents so far, i.e. the same server or local network. Temporal locality concerns maintaining information about previous search results and performing mutation by picking URLs from that set.

If either of above described methods is performed, a lot of time is spent for fetching documents from the Internet onto the local disk, because content examination and evaluation must be performed off-line. Thus, a huge amount of data is transferred through the network in vain, beacuse only a small percent of fetched documents will turn out to be useful. The logical improvement is construction of mobile agents that would browse through the network and perform the search locally, on the remote servers, transferring only the needed documents and data.

 

Problem statement: Making a package

While it is undoubtful that there are numerous search engines that are designed to locate and retrieve Internet documents that would best meet user's requests, many of them operate in a non-user-friendly manner and offer, as their result, a huge number of files, out of which many are completely unrelated to what the user originally attempted to find. This is mostly the case with indexing engines, that have great potential for finding almost any desired document, but have also poor evaluation functions. It is hard for a user to form a query because a small number of keywords, submitted as input, will result in too many documents retrieved, and too many keywords will produce a small result set. Due to the simple evaluation function documents placed on the top of the result list are often less acceptable than lower ones. If the list is long, one might never reach those better documents, because there is a strong probability that the user gets bored after "clicking" on several hundred links and gives up.

The links approach has usually better evaluation function and this approach is user-friendly, since user only has to submit documents and not keywords. It explores a smaller search space, but with a greater efficiency. Of course, this approach could be greatly limited if one tries to follow only those links that are contained in the user's documents, so external help is needed, in the form of a database of links or some sort of indexing. The major advantage is that it is not necessary to keep a large database of indexes and the search remains roughly within the area of the user's interest.

 

We have decided to follow the second approach and, since we wanted to explore how different mutation strategies can effect the search efficiency, we have decided to develop a set of software packages that would perform Internet search using genetic algorithms as decision making tools.

These packages are developed using the LEGO approach. That is: they are able to operate as stand-alone applications, but are also easily interfaced with one another to combine different search methods.

 

Existing solutions

As already indicated, papaer [4] is describing the experiment with genetic search on the Internet. This experiment was conducted at the University of Hong Kong and the following steps were performed:

1. A set of input documents, submitted by the user, was indexed using the SWISH package for indexing HTML documents, and keywords were extracted. The current set is initialized with the input set.

2. Documents that the hyperlinks from the current set refer to, were fetched and their similarity to the input documents evaluated using Jaccard's score as the evaluation function. The best ones were injected into the current set.

3. From the prepared database of topic-sorted URLs, a number of URLs was selected randomly (under user selected category) and injected into the current set.

4. The best document from the current set was injected into the output set and documents that its hyperlinks point to were injected into the current set.

5. Steps 2, 3, and 4 were repeated until the predefined size of the output set was attained, or until current set was emptied.

 

Our solution

In our project, a set of software packages was developed for experimenting in genetic search on the Internet. These packages are compatible and easily interfaced with one another to support reconfigurable nature of the project. The Figure 1 shows the block diagram of our project:

Figure 1: Block diagram of the static implementation of the project. Continuous lines show the flow of data, and dashed lines show the control flow. Rectangles represent the applications and ovals the input and output data structures.

Spider is the application for fetching Internet documents, starting from one input URL, up to the certain depth, specified by the user. Documents are stored onto the local disk in the folder stucture resembling the one on the remote server. A separate folder is created for each remote server and hyperlinks in the stored documents are changed, so that they point to the documents on the local disk.

Agent is the application that performs Best First Search. It starts from the set of input documents, and by following their links and evaluating Jaccard's Score of the documents found, finds Internet pages most similar to them.

Generator is the application that generates and updates the database of URLs that are classified according to the topic they cover. This database is used for mutation, when repeating the Hong Kong experiment.

Topic is the application that performs mutation as described in [4] by selecting URLs from the previously generated database and injecting them into the current set and thus performing, so called "topic mutation".

Space is the application that performs mutation using spatial locality. Document is a spatialy local to the other document if they are located on the same server or the same local network. If genetic algorithm finds a document of a high fitness value on a particular site there is a strong possibility that it can find similar documents somewhere on the same local network. This is because many people that have accounts on the same server or network usually have similar interests (which is most likely for academic networks). This approach is a bit hard to conduct since genetic algorithm has to either examine all sites on a server/net (which is time consuming) or randomly pick a subset of them.

Time is the application that performs mutation using temporal locality. A database of URLs that appeared in the output set of the previously performed searches is maintained, along with the counter for each URL. This counter shows the number of times that every particular URL occured in the output set of any search. Mutation is performed by picking from this database a number of URLs with the highest counter value and inserting them into the current set.

Control Program is the control layer that manages program interfacing and execution. It is possible to select mutation rate and type (topic, spatial, temporal) and follow the progress of search by examining average Jaccard's score of the documents in the output set

 

Details of our work

Here is an overview of input and output parameters of the aforementioned applications:

 

input

output

Spider

one URL

folder structure on the local disk

Agent

a set of URLs

a set of URLs

Generator

topic selection

database of URLs

Topic

a set of URLs

a set of URLs

Space

a set of URLs

a set of URLs

Time

a set of URLs

a set of URLs

 

Spider takes as input one URL, fetches the corresponding document and follows the links contained in it, fetches all documents (HTML files, pictures, animations, etc.) they point to, up to a certain depth specified by the user. These documents are stored on the local disk in the folder structure that is identical to the one on the remote server, and all links are changed to point to the documents on the disk.

Agent takes as input a set of URLs, calls the Spider application for every one of them, with a fixed depth of 1 (only the document that the URL points to) and the fetched documents form the input set or generation. It extracts keywords from the documents in the set and stores them in the convenient structure. It also extracts hyperlinks from the input documents. Program then proceeds further and enters a loop inside which the following steps are performed:

1. Spider is called with extracted hyperlinks and the depth of 1. Thus the current set is formed.

2. Fetched documents are evaluated using the Jaccard's score function and earlier extracted keywords, and the one that is most similar to the input documents (that has the highest Jaccard's score) is inserted into the output set.

3. The documents that hyperlinks of the best document point to are fetched using Spider application and inserted into the current set.

4. Steps 1, 2, and 3 are performed until the desired number of output documents is reached or until the current set is empty.

Generator takes as input the desired topic, calls the Yahoo engine and sumbits a query looking for all documents covering the specified topic. It extracts URLs from the Yahoo's result page, and fills the database TopData with them. This database has only two fields: URL and topic. Topic field is filled with the input parameter of the application.

Topic takes as input the current set from the Agent application and a topic that user selects among those existing in the database TopData. It performs mutation on the current set in the following way: it selects from the database TopData all documents that belong to the topic specified, randomly picks the specified number of them and injects them into the current set.

Space takes as input the current set from the Agent application and searches for the Internet documents from the local network, on which the best ranked document is located. It does not examine all existing documents (for it would be too much overhead), but picks randomly a subset of them, and evaluates them by calculating Jaccard's score. The best ones are injected into the current set.

Time takes as input the current set from the Agent application and injects into it those URLs from the database NetData that appeared with the greatest frequency in the output set of previous searches. This database has two fields: URL and count number, and is updated at the end of the run-time of the main application - Control Program in the following way:

1. Each URL from the output set is searched for in NetData.

2. If it is found, its count number is incremented.

3. If it is not found, it is inserted into NetData with count number equal to 1.

4. In the case of overflow (limit size of NetData is reached) documents with the lowest count number are deleted from database and thus free space is gained.

Control Program takes as input a set of URLs, desired number of output documents and the mutation type and rate. If mutation rate for the certain type is greater than zero, then this mutation type is performed: If mutation rate is yero for all three mutation types (topic, spatial, temporal) Best First Search is performed. Any combination of topic, spatial, and temporal mutation is possible. If topic mutation is selected, user is asked to select the desired topic. Control Program then proceeds and calls the Agent, interrupting it after each iteration to perform the selected way of mutation. After it, the processing continues.

 

An Experiment

In experiment that we performed we measured average Jaccard's score fro the documents in the output set, while changing the type and rate of mutation. The input set was fixed and consisted of three files. The desired number of output documents was fixed at 10. The database used for topic mutation was filled at the compile time in as convenient way as possible, in order to achieve best simulation results. The input window of the program used for performing the experiment is shown in figure 2.

Figure 2: The application window

We measured average Jaccard's score for following cases:

In all cases when mutation was employed, the rate of the mutation (number of URLs inserted per iteration) was changed from 0 to 30 with the step of one.

In the case when the Best First Search was performed the resulting Jaccard's score was 0.038252.

The simulation results for the experiment when only topic mutation was performed are shown in figure 3. As it can be seen from the graph, topic mutation significantly increases the quality of the pages found using this search tool. This increase is getting smaller as the muatation rate is increased but still monotonicly increases with the mutation rate. The conclusion is that the mutation should be performed in the search, and to the certain extent, to gain satisfactory results. The mutation rate of 10 already gives us the quality of pages found 300% better than when no mutation is used. However, the mutation rate of 20 gives the quality of pages only 30% better than when mutation of rate 10 is performed.

 Figure 3: The simulation results for topic mutation

Our next step was to define if any additional gain in the quality of pages can be achieved by emplying spatial and temporal mutation.

The simulation results for employing temporal and spatial mutation in addition to topic mutation are shown in figure 4.

 Figure 4: The simulation results for temporal and spatial mutation combined with topic mutation

As it can be seen from the graph spatial muutation brings in only sporadic increase. In same cases we get even worse results than with topic mutation. Temporal mutation brings in the constant increase in the quality of pages found.

In the next experiment we empolyed all three types of mutation and the results are shown in figure 5.

Figure 5: The simulation results for topic, spatial and temporal mutation combined

The effect of emplying all three types of mutation is the cionstant increase inquality of pages found. This increase is greater than any of these mutation types is performed separately or in combination of two types.

 

Reimplementation in the mobile domain

As the main part of the work in the application is done through the fetching and examining of Internet documents, often times in vain, the logical solution of a problem is to construct mobile agents that would transfer themselves onto the home servers and examine documents at remote sites, transferring back only those ones that are likely to be useful. This approach would achieve significant gain in time and disk space, especially in the module that calculates Jaccard's score since only these numbers could be sent back through the network, instead of all documents that are potential candidates for the result set.

If we calculate program running time in a form of equation, we get:

where:

If a static method is used for Jaccard's score evaluation:

for the generation size of 100 documents,
50KB each, it amounts to 5MB.

If mobile agent is used for Jaccard's score evaluation:

for the generation size of 100 documents,
4B for a floating point number, it amounts to 400B.

At the Figure 2 are given graphs showing time needed to run the application for different generation sizes. Term is estimated to be 5ms, and is 0,1ms*GS. The GS is generation size.

 

Figure 3: Estimated running time of the application as a function of generation size, for static and mobile implementation. Time is given on the logarithmic scale.

As it can be seen, gain in time is enormous, and one can also expect the corresponding lowering of the network traffic because significantly fewer amounts of data are transferred in the case of mobile implementation. Figure 3 gives the block diagram of our program in the mobile implementation.

Figure 4: Block diagram of the mobile implementation of the project. Continuous lines show the flow of data, and dashed lines show the control flow. Rectangles represent the applications and ovals the input and output data structures. Waveforms represent mobile applications.

As it can be seen from the Figure 3, time critical parts of the application should be realized as mobile agents.

JC is the agent that browses through the network and performs Jaccard's score evaluation, sending back only the best URLs and their Jaccard's score.

SL is the agent that browses through the network neighborhood of the highest ranked documents and performs the Jaccard's score evaluation, sending back only the best URLs and their Jaccard's score. The rest of the applications process their data as usual.

 

About complexity

So far, Spider, Agent, Generator, and Topic applications are completely realized and consist of 24KB, 58KB, 40KB, and 64KB lines of code, respectively. Other applications do not exceed 80KB lines of code per each application.

This project was started at August 1997 and was fully accomplished and ready for testing, in static version, by August 1998.

However, since the LEGO approach of our work is well suited for mobile implementation, we are eager to realize this project in a mobile environment and observe the expected gain in performance. On the other hand, we expect that complexity of the applications will not grow more than 50%, as a consequence of this improvement.

 

Conclusion

Because of the fast growth of the quantity and variety of Internet sites, finding the information needed as quickly and thoroughly as possible becomes a most important issue for research. There are two approaches to Internet search: indexed search and design of intelligent agents. Genetic algorithm is a search method that can be used in the design of intelligent agents. However, incorporating the knowledge about spatial and temporal locality, and making these agents mobile can improve the performance of the application and reduce the network traffic.

 

Acknowledgements

Authors would like to express their gratitude to the members of the EBI group at the Department of Computer Engineering, School of Electrical Engineering, University of Belgrade, for their assistance, suggestions, and support in writing this paper. Therefore we thank to Vladan Dugarić, Dejan Petković, Saša Slijepčević, and Božidar Radunović. Also, the authors are thankful to Dr Dejan Milojičić for his outstanding discussions during his recent lecture at our department.

A version of this paper was presented at [10] and further details about the applications and the executable code can be found at the following locations:

http://galeb.etf.bg.ac.yu/~ebi

http://galeb.etf.bg.ac.yu/~vm

 

References

  1. Goldberg, D., Genetic Algorithms in Search, Optimization and Machine Learning, Addison- Wesley, Reading, Massachusetts, USA 1989.
  2. Milojičić S., Musliner D., Shroeder-Preikschat W "Agents: Mobility and communication, " Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences, Maui, Hawaii, USA, January 1998.
  3. Joerg P., Mueller "The Design of Intelligent Agents: A layered approach, " Springer-Verlag, Germany, 1997.
  4. Chen, H., Chung, Y., Ramsey, M., Yang, C., Ma, P., Yen, J., "Intelligent Spider for Internet Searching, " Proceedings of the Thirtieth Annual Hawaii International Conference on System Sciences, Maui, Hawaii, USA, January 1997.
  5. Kraus, L., Milutinovic, V., "Technical Report on a New Genetic Algorithm for Internet Search Based on Priciples of Spatial and Temporal Locality, " Proceedings of the SinfoN '97, Zlatibor, Serbia, Yugoslavia, November 1997.
  6. Tomca, N., A Flexible Tool for Jaccard Score Evaluation, B.Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, November 1997. Award paper at SinfoN-97, Zlatibor, Serbia, Yugoslavia, October 1997.
  7. Slijepcevic, S., A Programmable Agent for Internet Retrieval, B.Sc. Thesis, University of Belgrade, Belgrade, Serbia, Yugoslavia, October 1997. Award paper at SinfoN-97, Zlatibor, Serbia, Yugoslavia, October 1997.
  8. Milojičić D:, "MOA," Invited Lecture, University of Belgrade, Serbia, Yugoslavia, February 1998.
  9. Milutinović V., "Electronic Business Internet," Lecture Notes, University of Belgrade, Belgrade, Serbia, Yugoslavia. http://galeb.etf.bg.ac.yu/~vm/tutorial/tutorial.html
  10. Milutinovic, V., Kraus, L., Mirkovic, J., et al. "A Software Package for Experimenting in Genetic Search on Internet: Spatial versus Temporal Locality and Static versus Mobile Agents", Proceedings of the WETICE'98 Conference, Stanford, USA, June 1998.