Dataset slow during model training

While migrating towards :hugs: Datasets, I encountered an odd performance degradation: training suddenly slows down dramatically. I train with an image dataset using Keras and execute a to_tf_dataset just before training.

First, I have optimized my dataset following [Solved] Image dataset seems slow for larger image size - #6 by ydshieh, which actually improved the situation from what I had before but did not completely solve it.

Second, I saved and loaded my dataset again using tf.data.experimental.save and tf.data.experimental.load before training (for which I would have expected no performance change during training). However, I ended up with the performance I had before tinkering with :hugs: Datasets.

Any idea what’s the reason for this behavior and how to speed-up training with :hugs: Datasets without my save/load hack?

Some details below.

Environment

  • :hugs: Datasets 2.2.2
  • TensorFlow/Keras 2.8.0
  • Training on GPU (NVIDIA Tesla V100-SXM2-32GB)

Epoch Breakdown (without my hack):

  • Epoch 1/10
    41s 2s/step - loss: 1.6302 - accuracy: 0.5048 - val_loss: 1.4713 - val_accuracy: 0.3273 - lr: 0.0010
  • Epoch 2/10
    32s 2s/step - loss: 0.5357 - accuracy: 0.8510 - val_loss: 1.0447 - val_accuracy: 0.5818 - lr: 0.0010
  • Epoch 3/10
    36s 3s/step - loss: 0.3547 - accuracy: 0.9231 - val_loss: 0.6245 - val_accuracy: 0.7091 - lr: 0.0010
  • Epoch 4/10
    36s 3s/step - loss: 0.2721 - accuracy: 0.9231 - val_loss: 0.3395 - val_accuracy: 0.9091 - lr: 0.0010
  • Epoch 5/10
    32s 2s/step - loss: 0.1676 - accuracy: 0.9856 - val_loss: 0.2187 - val_accuracy: 0.9636 - lr: 0.0010
  • Epoch 6/10
    42s 3s/step - loss: 0.2066 - accuracy: 0.9615 - val_loss: 0.1635 - val_accuracy: 0.9636 - lr: 0.0010
  • Epoch 7/10
    32s 2s/step - loss: 0.1814 - accuracy: 0.9423 - val_loss: 0.1418 - val_accuracy: 0.9636 - lr: 0.0010
  • Epoch 8/10
    32s 2s/step - loss: 0.1301 - accuracy: 0.9856 - val_loss: 0.1388 - val_accuracy: 0.9818 - lr: 0.0010
  • Epoch 9/10
    loss: 0.1102 - accuracy: 0.9856 - val_loss: 0.1185 - val_accuracy: 0.9818 - lr: 0.0010
  • Epoch 10/10
    32s 2s/step - loss: 0.1013 - accuracy: 0.9808 - val_loss: 0.0978 - val_accuracy: 0.9818 - lr: 0.0010

Epoch Breakdown (using my save & load before training hack):

  • Epoch 1/10
    13s 625ms/step - loss: 3.0478 - accuracy: 0.1146 - val_loss: 2.3061 - val_accuracy: 0.0727 - lr: 0.0010
  • Epoch 2/10
    0s 80ms/step - loss: 2.3105 - accuracy: 0.2656 - val_loss: 2.3085 - val_accuracy: 0.0909 - lr: 0.0010
  • Epoch 3/10
    0s 77ms/step - loss: 1.8608 - accuracy: 0.3542 - val_loss: 2.3130 - val_accuracy: 0.0909 - lr: 0.0010
  • Epoch 4/10
    1s 98ms/step - loss: 1.8677 - accuracy: 0.3750 - val_loss: 2.3157 - val_accuracy: 0.0909 - lr: 0.0010
  • Epoch 5/10
    1s 204ms/step - loss: 1.5561 - accuracy: 0.4583 - val_loss: 2.3049 - val_accuracy: 0.0909 - lr: 0.0010
  • Epoch 6/10
    1s 210ms/step - loss: 1.4657 - accuracy: 0.4896 - val_loss: 2.2944 - val_accuracy: 0.0909 - lr: 0.0010
  • Epoch 7/10
    1s 205ms/step - loss: 1.4018 - accuracy: 0.5312 - val_loss: 2.2917 - val_accuracy: 0.0909 - lr: 0.0010
  • Epoch 8/10
    1s 207ms/step - loss: 1.2370 - accuracy: 0.5729 - val_loss: 2.2814 - val_accuracy: 0.0909 - lr: 0.0010
  • Epoch 9/10
    1s 214ms/step - loss: 1.1190 - accuracy: 0.6250 - val_loss: 2.2733 - val_accuracy: 0.0909 - lr: 0.0010
  • Epoch 10/10
    1s 207ms/step - loss: 1.1484 - accuracy: 0.6302 - val_loss: 2.2624 - val_accuracy: 0.0909 - lr: 0.0010

Data Loading

    dataset = load_dataset("Lehrig/Monkey-Species-Collection", "downsized")

    def read_image_file(example):
        with open(example["image"].filename, "rb") as f:
            example["image"] = {"bytes": f.read()}
            return example
    dataset = dataset.map(read_image_file)

    dataset.save_to_disk(dataset_dir)

Preprocessing

    dataset = load_from_disk(dataset_dir)

    # Preprocess
    num_classes = dataset["train"].features["label"].num_classes
    one_hot_matrix = np.eye(num_classes)
    feature_extractor = ImageFeatureExtractionMixin()

    def to_pixels(image):
        image = feature_extractor.resize(image, size=size)
        image = feature_extractor.to_numpy_array(image, channel_first=False)
        image = image / 255.0
        return image

    def process(examples):
        examples["pixel_values"] = [
            to_pixels(image) for image in examples["image"]
        ]
        examples["label"] = [
            one_hot_matrix[label] for label in examples["label"]
        ]
        return examples

    features = Features({
        "pixel_values": Array3D(dtype="float32", shape=(size, size, 3)),
        "label": Sequence(feature=Value(dtype="int32"), length=num_classes)
    })

    prep_dataset = dataset.map(
        process,
        remove_columns=["image"],
        batched=True,
        batch_size=batch_size,
        num_proc=2,
        features=features,
    )

    prep_dataset = prep_dataset.with_format("numpy")

    # Split
    train_dev_dataset = prep_dataset['test'].train_test_split(
        test_size=test_size,
        shuffle=True,
        seed=seed
    )

    train_dev_test_dataset = DatasetDict({
        'train': train_dev_dataset['train'],
        'dev': train_dev_dataset['test'],
        'test': prep_dataset['test'],
    })

    train_dev_test_dataset.save_to_disk(prep_dataset_dir)

Train

    dataset = load_from_disk(prep_data_dir)

    data_collator = DefaultDataCollator(return_tensors="tf")

    train_dataset = dataset["train"].to_tf_dataset(
        columns=['pixel_values'],
        label_cols=['label'],
        shuffle=True,
        batch_size=batch_size,
        collate_fn=data_collator
    )

    validation_dataset = dataset["dev"].to_tf_dataset(
        columns=['pixel_values'],
        label_cols=['label'],
        shuffle=False,
        batch_size=batch_size,
        collate_fn=data_collator
    )

    print(f'{datetime.datetime.now()} - Saving Data')
    tf.data.experimental.save(train_dataset, model_dir+"/train")
    tf.data.experimental.save(validation_dataset, model_dir+"/val")

    print(f'{datetime.datetime.now()} - Loading Data')
    train_dataset = tf.data.experimental.load(model_dir+"/train")
    validation_dataset = tf.data.experimental.load(model_dir+"/val")

    ...

    model.fit(
        train_dataset,
        epochs=epochs,
        validation_data=validation_dataset,
        callbacks=[earlyStopping, mcp_save, reduce_lr_loss]
    )

I created an according github issue → please look here: Dataset slow during model training · Issue #4478 · huggingface/datasets · GitHub