I had collected data for a language text for translation How can I add it up into datsets

MKK · July 30, 2021, 11:46am

I just am trying to keep the repository of my efforts to translate a language. I have some 7k sentences in CSV format. I want to know, how can I upload into HF datasets. This may be a basic question however if there is a page already with detailed instructions do let me know.
I tried but no options exist.
Thanks.

lhoestq · August 16, 2021, 8:46am

Hi !
Currently a HF dataset on the Hub requires a python script to be loaded (see here for example).
But we are adding a feature that will allow you to upload to the Hub with a method my_dataset.push_to_hub().

You can track the progress of the first PR we made here : it allows to load a dataset without a python script. Then we’ll add the actual push_to_hub method

TheLongSentance · August 20, 2021, 3:52pm

@lhoestq on a similar note, then I have been preparing a large corpus of medical data for T5 denoising fine-tuning with adding all the random sentinel tokens and spans to the input and target columns.

I find that on trying to load larger files of the data using the run_summarisation.py example I get a memory overrun. So first question - is loading from a dataset much more efficient with memory usage? So instead of loading the whole file to hold the whole dataset in memory, it allows access but brings spans of data into memory only when needed? (as is suggested on this page?)

Second question, if the loading as a dataset is what I need to do to minimise memory usage, what is the best/easiest route to get this done? I have the csv file with the two columns for input and target (and other metadata in other columns such as the original text). So this is a very simple file. Just to be clear, there isn’t an available web resource to load the original data from, it is data I have generated and have myself on my PC. Can I simple load the csv data in a python script into a dataset using ‘load_dataset()’ and then just use your new `my_dataset.push_to_hub()’ method which would then host/store the data on the hub for retrieval by me/others that wanted to try it?

And thanks in advance for any help, new to this so pointers would be very much appreciated!

lhoestq · August 20, 2021, 4:39pm

Hi !

I find that on trying to load larger files of the data using the run_summarisation.py example I get a memory overrun. So first question - is loading from a dataset much more efficient with memory usage?

In summarization.py the dataset is loaded using something similar to load_dataset("csv", data_files=["path/to/my/data.csv"]). I converts your CSV into an Arrow file which can be loaded without filling your RAM. Therefore I’m not sure how it could cause memory issues in your case, can you provide more details about what happened ?

Second question, if the loading as a dataset is what I need to do to minimise memory usage, what is the best/easiest route to get this done?

The data_files argument to pass to load_dataset works for both local and remote data. In your case, since your CSV is local you can just do

from datasets import load_dataset

my_dataset = load_dataset("csv", data_files=["local/path/to/my/data.csv"])

Ultimately you will be able to do my_dataset. push_to_hub() to share it with other people.

As I mentioned in my previous message, we also plan to letting users upload their data files (csv, json etc.) directly in a dataset repository on the Hugging Face Hub and allow them to load it with load_dataset("username/dataset_name"). I expect the PR to get merged next week and a new release of datasets to happen right after

TheLongSentance · August 21, 2021, 1:51pm

@lhoestq thanks for quick (and helpful!) response. Decided to deal with things one at a time, so have concentrated on trying to get your more information on the memory issue when loading larger files. I tried various runs with my data (a T5 denoising fine-tuning exercise). I have two sets of train/validate/test files one set where the training csv file is about 1.4GB (400k rows), one set where the training csv file is 600MB (225K rows). I tried t5-small, t5-base and t5-large models, with and without fp16. So large/small files x t5-small/base/large x fp16/not fp16 = 12 runs. I am on a Windows 10 machine with 1 NVIDIA GPU (RTX 3090) with 24GB of memory. I was running the run_summarization.py with max_train_samples 5000 and max_eval_samples 500, per_device_train_batch_size = 4, eval_steps 500.

I logged the first 500 steps and the 1st eval on wandb and put together this set of results showing GPU memory consumption.

Findings:

t5-large with fp16/no fp16 just blew memory which are the two single dots on the left of chart (not showing them above 100% in this wandb chart because effectively they didn’t start)
t5-large with the small files without/with fp16 is up near 100% and 80% GPU memory allocated
t5-small with the small files without/with fp16 is up near 20% and 15% GPU memory allocated
t5-small with the large files without/with fp16 is up near 40% and 30% GPU memory allocated
t5-base with the small files without/with fp16 is up near 45% and 39% GPU memory allocated
t5-base with the large files without fp16 rose to 100% then settled down to 60% (this seemed strange/interesting so I did two runs of this to check it repeated)
t5-base with the large files with fp16 rose to 100% and stayed there (this seemed strange/interesting so I did two runs of this to check it repeated)

Comments:

With t5-large no fp16 with small files I am banging up against the 24GB available on the GPU
(fp16 gives me some head room at 80% but also returns NaN, a different but maybe related issue?)
Using input files (aka more samples) seems to require more GPU memory for the model?
The up to 100% then back down to 60% behaviour of t5-base with large files seems interesting
(adding fp16 takes it up to 100% and there it stays for t5-base)

And is one of the two stack text dumps for when t5-large immediately runs out of GPU memory:

wandb: Run wandb offline to turn off syncing.
0%| | 1/3750 [00:02<2:09:23, 2.07s/it]Traceback (most recent call last):
File “C:\Users\BrianS\PycharmProjects\PyTorch\transformers\examples\pytorch\summarization\run_summarization.py”, line 624, in
main()
File “C:\Users\BrianS\PycharmProjects\PyTorch\transformers\examples\pytorch\summarization\run_summarization.py”, line 548, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\transformers\trainer.py”, line 1287, in train
tr_loss += self.training_step(model, inputs)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\transformers\trainer.py”, line 1782, in training_step
loss = self.compute_loss(model, inputs)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\transformers\trainer.py”, line 1814, in compute_loss
outputs = model(**inputs)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\torch\nn\modules\module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\transformers\models\t5\modeling_t5.py”, line 1561, in forward
encoder_outputs = self.encoder(
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\torch\nn\modules\module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\transformers\models\t5\modeling_t5.py”, line 998, in forward
layer_outputs = layer_module(
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\torch\nn\modules\module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\transformers\models\t5\modeling_t5.py”, line 691, in forward
hidden_states = self.layer-1
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\torch\nn\modules\module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\transformers\models\t5\modeling_t5.py”, line 301, in forward
forwarded_states = self.DenseReluDense(forwarded_states)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\torch\nn\modules\module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\transformers\models\t5\modeling_t5.py”, line 261, in forward
hidden_states = self.dropout(hidden_states)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\torch\nn\modules\module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\torch\nn\modules\dropout.py”, line 58, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\torch\nn\functional.py”, line 1168, in dropout
return VF.dropout(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 24.00 GiB total capacity; 21.38 GiB already allocated; 11.31 MiB free; 21.97 GiB reserved in total by PyTorch)
wandb: Waiting for W&B process to finish, PID 1268
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: - 0.00MB of 0.00MB uploaded (0.00MB deduped)
wandb:
wandb: Find user logs for this run at: C:\Users\BrianS\PycharmProjects\PyTorch\transformers\examples\pytorch\summarization\wandb\run-20210821_122954-1rtixwev\logs\debug.log
wandb: Find internal logs for this run at: C:\Users\BrianS\PycharmProjects\PyTorch\transformers\examples\pytorch\summarization\wandb\run-20210821_122954-1rtixwev\logs\debug-internal.log
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:
wandb: Synced wandb T5-Large no FP16 MIMIC_III large file: Weights & Biases

Process finished with exit code 1

TheLongSentance · August 21, 2021, 1:53pm

Just checking one more thing to be more consistent above. I noticed that the larger files I was using also had larger amounts of text/tokens as input/output so just checking if length of input/output text is a factor rather than number of samples per file.

TheLongSentance · August 21, 2021, 4:38pm

Seems to be more about size of text input to the model than size of file in terms of physical size or number of rows. When I use files with number of input (text) tokens = 400 t5-large blows, but with 200 it doesn’t, regardless of length of file (number of samples).

So given that I am near the limit of GPU memory then maybe staying at 200 text tokens below means that I don’t breach the memory limit because the model is (a tiny bit) smaller?

lhoestq · August 23, 2021, 4:09pm

The amount of memory depends on the model, your batch size and the number of tokens in your batches. So in your case you have to either lower your batch size or have batches with a maximum number of tokens that is not too high (like 200 in your example).

It’s not easy to know in advance what will be the amount of GPU used so feel free to test different values if needed

Topic		Replies	Views
Uploading a dataset that doesn't fit in memory to the HF hub 🤗Datasets	5	73	October 24, 2024
Loading just part of dataset 🤗Datasets	4	4682	February 25, 2025
Uploading large datasets iteratively 🤗Datasets	4	1506	October 30, 2023
Streaming in dataset uploads 🤗Datasets	2	50	March 31, 2025
Create dataset with data stored in Zenodo 🤗Datasets	2	482	June 14, 2023

I had collected data for a language text for translation How can I add it up into datsets

Related topics