An Agent for Internet Search

Sasa Slijepcevic, Laslo Kraus, and Veljko Milutinovic

 

Department of Computer Engineering,
School of Electrical Engineering,
University of Belgrade,
POB 35-54,
11120 Belgrade, Serbia, Yugoslavia

 

E-mail:sascha@galeb.etf.bg.ac.yu,
ekraus@etf.bg.ac.yu, vm@etf.bg.ac.yu

 

 

Abstract

In the last few years Internet has become the most valuable source of information. It is accompanied with great increase in the amount of data and time needed for its retrieval. One of the solutions of the problems mentioned lies in automatic fetching of the documents onto the local disk and the later off-line browsing through them. The program presented by this paper fetches the documents onto the local disk up to the depth given by the input parameters, following hyperlinks. It also possesses the option of giving the number of parallel processes what improves multithreading of the application. Outcome of the program is file structure in which each WWW server from which files have been retrieved has specific directory. The directory contains files with the changed hyperlinks that point to the files on the local disk.

 

1. Introduction

During the last decade of the 20th century the Internet has become the most significant and biggest source of information in many fields of the human activity. Along with the great increase of number of the computers connected to the worldwide network [1], there is an important change in purpose of the Internet. It is no longer property of computer experts, but the indispensable tool in almost all professions.

Such a fast expansion of the Internet elicits some problems that obstruct efficient utilization of its capacities. The main problem is the great amount of data that sometimes prevents retrieving of information, or this retrieval is too much time-consuming.

2. Problem definition

The rapid growth of the number of users and hosts, information servers, and network traffic has produced a corresponding increase in network loads and user response times. Saving of time and money could be achieved through the automatic fetching of information. One of the prospective approaches to the automatic fetching is use of agents for Internet search. These agents usually accept one URL or set of URLs, and then fetch the documents that hyperlinks point to, up to the certain depth, defined by the user. The fetched documents can be reviewed later off-line, i.e. without connection to the Internet. In computer terminology these programs are called off-line browsers.

3. Existing solutions

There are quite a number of off-line browsers in the software market [2], what shows the demand for that kind of applications. In academic environment also exists demand of corresponding extent. However, the existing applications, though programs of quality, are less suitable for use in such environment, because of their commercial nature.

4. Proposed solutions

Application presented in this paper - "Spider" - is written in the Java programming language. Program has to be started from command line. Input parameters and options are:

Result of execution is a folder tree. Root of that structure is a working folder, from which the program execution was invoked. For each of HTTP servers (in example: www.nba.com, galeb.etf.bg.ac.yu, afrodita.rcub.bg.ac.yu, www.cmu.edu ), from whom the files are fetched, a folder is created containing files originated on this server. The structure of such folder reflects structure of the folders on this server. The proposed organization uniquely defines location for every file and thus enables change of the hyperlinks and simple check if the file is on the disk. An example of such structure is shown on Figure 1.

The content of the files fetched from the WWW is changed, so that the hyperlinks in the files will point to locally saved files instead to files on their original locations on the Web. The changing does not take place if any protocol distinct from HTTP is encountered. Also, the hyperlinks containing CGI-script call are not changed. An example of the HTML file before and after the change is given in Figure 2.

5. Details of Proposed Solution

Figure 3. shows class hierarchy including standard classes from Java programming environment and classes developed for "Spider" application. Classes developed in application are positioned in standard class structure, which is delivered with Java programming environment. Classes Object and Thread are standard classes, and other classes are developed through the process of programming "Spider".

6. Conclusion

The main goals determined in the beginning of the application development are achieved. Simple, but efficient off-line browser is developed, and on the other side it could be base for the development of more sophisticated system.

The further development of the application will be focused to the graphic user interface (GUI) and communication with HTTP servers.

The GUI can be the real advance, only if accompanied by an altered program structure, because the dynamic control of the application has to be built in along with the GUI.

Another part of the application where significant improvement could be achieved is communication with the HTTP servers. The above mentioned and other improvements will be achieved in new versions very soon.

7. References

[1] Mayr, D., "The History of the Internet and the WWW,"
http://ourworld.compuserve.com/homepages/dmayr/history.htm, October 1997.

[2] Mendelson, E., "PC Magazine - Utility Guide (Off-Line Browsers: Editor's Choice)," http://www.zdnet.com/pcmag/features/utility/offbrwsr/uobf.htm, October 1997.

[3] Mendelson, E., "PC Magazine - Utility Guide (Off-Line Browsers: Editor's Choice)," http://www.zdnet.com/pcmag/features/utility/offbrwsr/uobec.htm, October 1997.