Index module

--------------------------------------------------------------------------------------------------------------------------------------------------------

The Indexing Module
Developer : Ndapandula Nakashole
Indexing is a vital part of a search engine. The role of the indexing module is to record which words appear in each page. This project aimed to design and implement an indexing system that runs on a cluster, uses an efficient index maintenance strategy and is scalable.

Implementation
Indexing subsystem overview

 

pic1
Figure 1: The core components of the indexing subsystem

The Dispatcher is the central component responsible for invoking the other components of the system. The Indexer and Updater components index the HTML documents that are made available by the crawler. The system employs an existing crawler, GNU Wget. GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP. It is a non-interactive command line tool.

Parallel Indexing

A master-slave approach was used to achieve parallel indexing. The idea behind this approach is that one process is responsible for coordinating the work of others. This mechanism is particularly useful when there is little to no communication between the slave processes and when the amount of work that each slave has to perform is difficult to predic. Both of the above cases apply to the task of indexing, as a result the model was chosen to accomplish parallel indexing. 

Results
The results were not absolutely rigorous but provide indications of some degree of success and serve as a starting point for future, more comprehensive evaluation.

Index creating
The graph below shows performance of the index creating program both with static and dynamic allocation. From graph, it can be seen that for small data sizes, the time taken to index data for dynamic and static allocations is almost the same. However, as the size of the data increases, the static allocation line is significantly above the dynamic allocation lines.

pic3
Figure 3: Performance of static and dynamic allocation

 

Index updating
The figure below shows the time taken to update an index versus the amount. From the figure, it can be seen that index updating is faster with dynamic allocation than with static allocation even when the size of the inverted index being updated is big. The index size of 128 Megabytes contains 5148 inverted files obtained by indexing of 100 000 files.

pic4
Figure 4: Time to update versus data size

Conclusions
From the project implementation, it can be concluded that the indexing module met the project requirements. However, the testing of the project requires future work, to test the indexing module on more varied and larger data sets in order for the results to be generalized to wider application areas.

 

Copyright 2006 Ndapandula Nakashole & Calvin Pedzai. All rights reserved.Last update :10 November 2006