While migrating towards Datasets, I encountered an odd performance degradation: training suddenly slows down dramatically. I train with an image dataset using Keras and execute a to_tf_dataset
just before training.
First, I have optimized my dataset following [Solved] Image dataset seems slow for larger image size - #6 by ydshieh, which actually improved the situation from what I had before but did not completely solve it.
Second, I saved and loaded my dataset again using tf.data.experimental.save
and tf.data.experimental.load
before training (for which I would have expected no performance change during training). However, I ended up with the performance I had before tinkering with Datasets.
Any idea what’s the reason for this behavior and how to speed-up training with Datasets without my save/load hack?
Some details below.
Environment
- Datasets 2.2.2
- TensorFlow/Keras 2.8.0
- Training on GPU (NVIDIA Tesla V100-SXM2-32GB)
Epoch Breakdown (without my hack):
- Epoch 1/10
41s 2s/step - loss: 1.6302 - accuracy: 0.5048 - val_loss: 1.4713 - val_accuracy: 0.3273 - lr: 0.0010 - Epoch 2/10
32s 2s/step - loss: 0.5357 - accuracy: 0.8510 - val_loss: 1.0447 - val_accuracy: 0.5818 - lr: 0.0010 - Epoch 3/10
36s 3s/step - loss: 0.3547 - accuracy: 0.9231 - val_loss: 0.6245 - val_accuracy: 0.7091 - lr: 0.0010 - Epoch 4/10
36s 3s/step - loss: 0.2721 - accuracy: 0.9231 - val_loss: 0.3395 - val_accuracy: 0.9091 - lr: 0.0010 - Epoch 5/10
32s 2s/step - loss: 0.1676 - accuracy: 0.9856 - val_loss: 0.2187 - val_accuracy: 0.9636 - lr: 0.0010 - Epoch 6/10
42s 3s/step - loss: 0.2066 - accuracy: 0.9615 - val_loss: 0.1635 - val_accuracy: 0.9636 - lr: 0.0010 - Epoch 7/10
32s 2s/step - loss: 0.1814 - accuracy: 0.9423 - val_loss: 0.1418 - val_accuracy: 0.9636 - lr: 0.0010 - Epoch 8/10
32s 2s/step - loss: 0.1301 - accuracy: 0.9856 - val_loss: 0.1388 - val_accuracy: 0.9818 - lr: 0.0010 - Epoch 9/10
loss: 0.1102 - accuracy: 0.9856 - val_loss: 0.1185 - val_accuracy: 0.9818 - lr: 0.0010 - Epoch 10/10
32s 2s/step - loss: 0.1013 - accuracy: 0.9808 - val_loss: 0.0978 - val_accuracy: 0.9818 - lr: 0.0010
Epoch Breakdown (using my save & load before training hack):
- Epoch 1/10
13s 625ms/step - loss: 3.0478 - accuracy: 0.1146 - val_loss: 2.3061 - val_accuracy: 0.0727 - lr: 0.0010 - Epoch 2/10
0s 80ms/step - loss: 2.3105 - accuracy: 0.2656 - val_loss: 2.3085 - val_accuracy: 0.0909 - lr: 0.0010 - Epoch 3/10
0s 77ms/step - loss: 1.8608 - accuracy: 0.3542 - val_loss: 2.3130 - val_accuracy: 0.0909 - lr: 0.0010 - Epoch 4/10
1s 98ms/step - loss: 1.8677 - accuracy: 0.3750 - val_loss: 2.3157 - val_accuracy: 0.0909 - lr: 0.0010 - Epoch 5/10
1s 204ms/step - loss: 1.5561 - accuracy: 0.4583 - val_loss: 2.3049 - val_accuracy: 0.0909 - lr: 0.0010 - Epoch 6/10
1s 210ms/step - loss: 1.4657 - accuracy: 0.4896 - val_loss: 2.2944 - val_accuracy: 0.0909 - lr: 0.0010 - Epoch 7/10
1s 205ms/step - loss: 1.4018 - accuracy: 0.5312 - val_loss: 2.2917 - val_accuracy: 0.0909 - lr: 0.0010 - Epoch 8/10
1s 207ms/step - loss: 1.2370 - accuracy: 0.5729 - val_loss: 2.2814 - val_accuracy: 0.0909 - lr: 0.0010 - Epoch 9/10
1s 214ms/step - loss: 1.1190 - accuracy: 0.6250 - val_loss: 2.2733 - val_accuracy: 0.0909 - lr: 0.0010 - Epoch 10/10
1s 207ms/step - loss: 1.1484 - accuracy: 0.6302 - val_loss: 2.2624 - val_accuracy: 0.0909 - lr: 0.0010
Data Loading
dataset = load_dataset("Lehrig/Monkey-Species-Collection", "downsized")
def read_image_file(example):
with open(example["image"].filename, "rb") as f:
example["image"] = {"bytes": f.read()}
return example
dataset = dataset.map(read_image_file)
dataset.save_to_disk(dataset_dir)
Preprocessing
dataset = load_from_disk(dataset_dir)
# Preprocess
num_classes = dataset["train"].features["label"].num_classes
one_hot_matrix = np.eye(num_classes)
feature_extractor = ImageFeatureExtractionMixin()
def to_pixels(image):
image = feature_extractor.resize(image, size=size)
image = feature_extractor.to_numpy_array(image, channel_first=False)
image = image / 255.0
return image
def process(examples):
examples["pixel_values"] = [
to_pixels(image) for image in examples["image"]
]
examples["label"] = [
one_hot_matrix[label] for label in examples["label"]
]
return examples
features = Features({
"pixel_values": Array3D(dtype="float32", shape=(size, size, 3)),
"label": Sequence(feature=Value(dtype="int32"), length=num_classes)
})
prep_dataset = dataset.map(
process,
remove_columns=["image"],
batched=True,
batch_size=batch_size,
num_proc=2,
features=features,
)
prep_dataset = prep_dataset.with_format("numpy")
# Split
train_dev_dataset = prep_dataset['test'].train_test_split(
test_size=test_size,
shuffle=True,
seed=seed
)
train_dev_test_dataset = DatasetDict({
'train': train_dev_dataset['train'],
'dev': train_dev_dataset['test'],
'test': prep_dataset['test'],
})
train_dev_test_dataset.save_to_disk(prep_dataset_dir)
Train
dataset = load_from_disk(prep_data_dir)
data_collator = DefaultDataCollator(return_tensors="tf")
train_dataset = dataset["train"].to_tf_dataset(
columns=['pixel_values'],
label_cols=['label'],
shuffle=True,
batch_size=batch_size,
collate_fn=data_collator
)
validation_dataset = dataset["dev"].to_tf_dataset(
columns=['pixel_values'],
label_cols=['label'],
shuffle=False,
batch_size=batch_size,
collate_fn=data_collator
)
print(f'{datetime.datetime.now()} - Saving Data')
tf.data.experimental.save(train_dataset, model_dir+"/train")
tf.data.experimental.save(validation_dataset, model_dir+"/val")
print(f'{datetime.datetime.now()} - Loading Data')
train_dataset = tf.data.experimental.load(model_dir+"/train")
validation_dataset = tf.data.experimental.load(model_dir+"/val")
...
model.fit(
train_dataset,
epochs=epochs,
validation_data=validation_dataset,
callbacks=[earlyStopping, mcp_save, reduce_lr_loss]
)