MCQs
The output of the reduce task is typically written to the FileSystem. The output of the Reducer is not sorted.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
HBase is the Hadoop database: a distributed, scalable Big Data store that lets you host very large tables ” billions of rows multiplied by millions of columns ” on clusters built with commodity hardware.
HDFS and NoSQL file systems focus almost exclusively on adding nodes to increase performance (scale-out) but even they require node configuration with elements of scale up.
Archives options is also a generic option.
The primary key is used for partitioning, and the combination of the primary and secondary keys is used for sorting.
As a MapReduce application writer, you don't need to deal with InputSplits directly, as they are created by an InputFormat.
One might be tab-separated plain text, the other a binary sequence file. Even if they are in the same format, they may have different representations, and therefore need to be parsed differently.
A RecordReader is little more than an iterator over records, and the map task uses one to generate record key-value pairs, which it passes to the map function.
With multiple reducers, records will be allocated evenly across reduce tasks, with all records that share the same key being processed by the same reduce task.