Handling Modifications to the Input Files

When splits are created by the Dataset Input Format, the modification times of the input file(s) are compared to the modification times recorded in the IMDG. If these times do not match, one of the two actions can be taken based on the EnableAppends property, which can be set with DatasetInputFormat.setEnableAppends(…).

If this property is set to the default value of false, the cached data set within the IMDG is deleted, and a new set of splits is calculated based on the new file. This set of splits is recorded in the IMDG during the subsequent job run. If the file is deleted and replaced by another file, the property should be false to avoid allowing the dataset input format to serve recorded splits from the old file.

If the property is set to true, it is assumed that the file was appended. The dataset input format will use the splits that were already recorded and add splits corresponding to the appended portion of the file by reading from HDFS and recording the splits on the next run.