Hello,
I split Wikipedia Data in Train, Test, Validate and now I want to safe it in Filesystem with the following Code.
import os
output_directory = "/media/rainer/T7/Notebooks"
os.makedirs(output_directory, exist_ok=True)
train_testvalid['train'].to_json(os.path.join(output_directory, "de_wikipedia_train.json"), batch_size=1000, num_proc=6, orient="records", lines=True)
The Performance is very poor. It says 90 hours…
My CPU Performance is 0,9 %
The Disk work with < 1 %
What I tried:
- Use only one CPU Core.
- Use Multiple CPU Cores.
- Use bigger batch_size
- Use different Storage for Read and Write
- Use RAM Disk
Did not help…
Hope you can give me a hint how to speed up this prozess.
Kind regards
Rainer
Hi!
Let’s try working on the script.
You are using the to_json method from pandas to save the dataframe. Depending on the size and complexity of the dataframe, this could be a slow process. Especially when the orient=“records” and lines=True parameters are used, as these can be computationally intensive.
So you can try the following suggestions:
Use the standard to_csv method instead of to_json for initial testing. CSVs are often faster to write.
If JSON is a requirement, consider using other libraries like orjson or ujson for faster serialization.
Ensure that the source dataframe (train_testvalid[‘train’]) is in memory and not being read from disk during this operation.
Your can try these and see where that gets you.
Hope that helps!