MapReduce Paradigm for Distributed Inverted Index Creation
Steps in developing MapReduce Paradigm for Inverted Index
- They are collecting the necessary documents which require indexing. During this phase, a programmer may decide to order a set of strings for indexing as the working documents.
- Text tokenization. Each collected document needs to be converted into a list of tokens.
- Carry out Linguistic processing – the programmer ensures a list of indexing terms s produced during linguistic processing.
- The documents are then indexed based on the term of occurrence, creating an inverted index comprising postings and dictionaries.
Fundamental Concepts behind MapReduce That Contribute To Its Scalability
- File bottleneck principle- Inverted Index MapReduce concept utilizes Hadoop as a default architecture utilizing a single name node compared to other nodes. To save on cost, a distributed metadata structure is adopted, eliminating a single node system.
- Node Expansion principle – it allows users to adjust the number of nodes according to processing power needed. Smaller processing requirements and data storage are required for smaller Hadoop users hence achieving the affordable advantage of using fewer nodes.
- Node Capacity principle – to maximize storage and processing capacity, the inverted index MapReduce concept allows for reducing nodes when physical storage becomes a limiting factor.
How MapReduce Paradigm Can Be used to solve the problem of counting the number of occurrences of each word in a large collection of documents
To solve the number of occurrences for each word in the large document collection, the map is run on the dataset hence generating key and value attributes. The map tasks are then distributed among different nodes and then executed simultaneously. The map output is then grouped when reducing tasks to ensure a single reduce task handles each world occurrence in the documents set using a hash function.