There are various benchmarks that are used for texting MapReduce Libraries. Some of the common names are Matrix Multiplication (MM, multiplies two large square matrices); Sparse Integer Occurrence (SIO, counts the number of times each integer appears in a large dataset); Word Occurrence (WO, counts the number of times each word occurs in a text corpus); Linear Regression (LR, computes a linear model of a set of data), and KMeans Clustering (KMC, partitions a set of data points into clusters) (Stuart & Owens, Multi-GPU MapReduce on GPU Clusters). All these benchmarks touch different aspects of the library.
We decided to go with word occurrence because of its following characteristics:
• Non-Uniform Records: MR deals with data record by record. A record could be a line, a paragraph or a row. A text document can have such records of fixed as well as variable lengths. Also, some keys might exist in a part and not in the other. Working with such example would make the system capable to deal with all kind of records.
• Many Key/Value Pair: Text documents can have an enormous number of different keys, and their repetition would give us dynamic size values.
• Scalable: As we are dealing with a cluster of nodes, scalability is one of the most important aspects that we need to keep a keen eye on. The output set for WO is much smaller, leading to a different configuration of the pipeline and drastically different scaling.