MapReduce Paradigm for Distributed Inverted Index Creation
MapReduce concept in programming aims to allow a parallel distributed system to be developed, mainly when large data sets are processed. MapReduce paradigm, therefore, ensure the available dataset is converted to tuples set, and then they are reduced and combined to develop a smaller tuple set. The computing nodes clusters process the map by reducing the user’s tasks earlier defined by the program.
Steps in developing MapReduce Paradigm for Inverted Index
As defined earlier, the MapReduce concept aims to reduce the large dataset to smaller tuples that can be managed independently. Therefore, the significant steps involved n Inverted Index include;
- They are collecting the necessary documents which require indexing. During this phase, a programmer may decide to order a set of strings for indexing as the working documents.
- Text tokenization. Each collected document needs to be converted into a list of tokens.
- Carry out Linguistic processing – the programmer ensures a list of indexing terms s produced during linguistic processing.
- The documents are then indexed based on the term of occurrence, creating an inverted index comprising postings and dictionaries.
Fundamental Concepts behind MapReduce that Contribute to Its Scalability
MapReduce is built on concepts such as Input reader, Map function, and shuffle function. During document collection, it is assumed that each document consists of a unique serial number commonly referred to as document ID. Therefore, successive integers can be assigned to the new document during the first encounter when carrying out index construction. Normalized tokens form input to indexing for every document. During indexing, the primary step is to ensure the new list is sorted alphabetically, resulting in a distinctive representation in the middle column. Any instance of multiple occurrences from individual documents is then merged. The resulting group of terms is then divided into postings and a dictionary.