To_json Performance

RainerGa · October 9, 2023, 10:32am

Hello,

I split Wikipedia Data in Train, Test, Validate and now I want to safe it in Filesystem with the following Code.

import os

output_directory = "/media/rainer/T7/Notebooks"
os.makedirs(output_directory, exist_ok=True)

train_testvalid['train'].to_json(os.path.join(output_directory, "de_wikipedia_train.json"), batch_size=1000, num_proc=6, orient="records", lines=True)

The Performance is very poor. It says 90 hours…
My CPU Performance is 0,9 %
The Disk work with < 1 %

What I tried:

Use only one CPU Core.
Use Multiple CPU Cores.
Use bigger batch_size
Use different Storage for Read and Write
Use RAM Disk

Did not help…

Hope you can give me a hint how to speed up this prozess.

Kind regards

Rainer

Bjornedt · October 9, 2023, 6:27pm

Hi!

Let’s try working on the script.

You are using the to_json method from pandas to save the dataframe. Depending on the size and complexity of the dataframe, this could be a slow process. Especially when the orient=“records” and lines=True parameters are used, as these can be computationally intensive.

So you can try the following suggestions:
Use the standard to_csv method instead of to_json for initial testing. CSVs are often faster to write.
If JSON is a requirement, consider using other libraries like orjson or ujson for faster serialization.
Ensure that the source dataframe (train_testvalid[‘train’]) is in memory and not being read from disk during this operation.

Your can try these and see where that gets you.
Hope that helps!

Topic		Replies	Views
Saving dataset in the current state without cache 🤗Datasets	9	5886	March 17, 2022
.map() function extremely slow 🤗Datasets	1	1324	September 13, 2023
Processing time and methods Beginners	2	352	March 21, 2022
Cannot preprocess wikipedia dataset 🤗Datasets	1	501	June 3, 2023
Getting models to output structured JSON Beginners	1	1145	October 30, 2023

To_json Performance

Related topics