Performance Optimizations in the Dataset Record Reader

The Dataset Record Reader is designed to store key/value pairs in the IMDG with minimum network overhead and maximum storage efficiency. By using splits defined for the HDFS file, it creates "chunks" of key/pairs in the IMDG using overlapped updates to the IMDG while each HDFS record reader reads from the HDFS file and supplies key/value pairs to its mapper. These chunks are stored as highly available objects within the IMDG.

images/fig7_hserver_detail_1.png

Likewise, on subsequent Hadoop MapReduce runs in which key/value pairs are available in the IMDG, the Dataset Record Reader bypasses the underlying HDFS record reader and supplies key/value pairs from the IMDG. ScaleOut hServer uses the same set of splits to efficiently retrieve the key/value chunks from the IMDG in an overlapped manner that minimizes latency. To minimize network overhead, chunks are served from the ScaleOut hServer service process running on the same Hadoop worker node as the requesting mapper.

images/fig8_hserver_detail_2.png