To_json Performance

Hello,

I split Wikipedia Data in Train, Test, Validate and now I want to safe it in Filesystem with the following Code.

import os

output_directory = "/media/rainer/T7/Notebooks"
os.makedirs(output_directory, exist_ok=True)

train_testvalid['train'].to_json(os.path.join(output_directory, "de_wikipedia_train.json"), batch_size=1000, num_proc=6, orient="records", lines=True)

The Performance is very poor. It says 90 hours…
My CPU Performance is 0,9 %
The Disk work with < 1 %

What I tried:

  • Use only one CPU Core.
  • Use Multiple CPU Cores.
  • Use bigger batch_size
  • Use different Storage for Read and Write
  • Use RAM Disk

Did not help…

Hope you can give me a hint how to speed up this prozess.

Kind regards

Rainer

Hi!

Let’s try working on the script.

You are using the to_json method from pandas to save the dataframe. Depending on the size and complexity of the dataframe, this could be a slow process. Especially when the orient=“records” and lines=True parameters are used, as these can be computationally intensive.

So you can try the following suggestions:
Use the standard to_csv method instead of to_json for initial testing. CSVs are often faster to write.
If JSON is a requirement, consider using other libraries like orjson or ujson for faster serialization.
Ensure that the source dataframe (train_testvalid[‘train’]) is in memory and not being read from disk during this operation.

Your can try these and see where that gets you.
Hope that helps!