Space Connection Error - Dataset Map

Cicciokr · December 30, 2024, 8:37am

Hi, i have created a Streamlit space with a CPU UPGRADED and i try to do a Dataset.map, after about 20 minute the space go on “Connection timeout” on container log i see “stopping…” and my procedure of Dataset.map stop and i need to restart…

On the screen i see a popup with message: “Sorry, there is an error on our side.”

How can i resolve? I have tried also to do a tokenized_dataset local on my pc but it is 45 GB and i don’t know how to load it on huggingface to use on space and so skip this step of dataset.map

Thank you

Alanturner2 · December 30, 2024, 9:14am

To resolve the connection timeout and error during Dataset.map in your Streamlit Space:

Increase Resources: Ensure your Streamlit Space has sufficient CPU and memory for processing large datasets.
Optimize Dataset Processing: Process the dataset in smaller batches using a chunked approach to avoid timeouts.
Pre-tokenize Locally: Tokenize the dataset locally, save it in a format like Parquet/JSON, and upload it to Hugging Face. Load the tokenized dataset directly in your Streamlit Space.
Use map Optimizations: Use batched=True and num_proc in Dataset.map for better performance.
Check Logs: Review container logs for detailed error messages to identify specific issues.

Cicciokr · December 30, 2024, 9:30am

Hi,
i have loaded my dataset and converted it on parquet format
dataset = load_dataset(“Cicciokr/CC-100-Latin”, revision=“refs/convert/parquet”)

i have also tokenized locally my dataset and is about 45GB but is not possible to upload it on the Spaces, error “reach limit 1GB”.

To use the Dataset it in the Trainer function, i need to do a mapping with tokenizer right or is not necessary and i can use directly the dataset?

tokenizer = RobertaTokenizerFast(

vocab_file=“./vocab.json”,*
merges_file=“./merges.txt”,*
)
def preprocess_function(examples):
return tokenizer(examples[‘text’], truncation=True, padding=‘max_length’, max_length=512)*

tokenized_dataset = dataset.map(preprocess_function, batched=True, num_proc=8)

trainer = Trainer(

train_dataset=tokenized_dataset[‘train’]*
)

In the log i have no other information only connection timeout, and i use the CPU Upgraded, is not possible to have connection timeout while is in run mode with no other error, i pay the hourly use of this resources.

Thank you

Alanturner2 · December 30, 2024, 9:34am

Yes, to use your dataset with the Trainer function, you need to map the tokenizer over the dataset to tokenize the text data. You’ve already set up the tokenizer and preprocessing function correctly. By using the dataset.map function, you’re applying the tokenizer to your dataset, which prepares it for use with the Trainer.

John6666 · December 30, 2024, 10:16am

i have also tokenized locally my dataset and is about 45GB but is not possible to upload it on the Spaces, error “reach limit 1GB”.

Model repos and dataset repos are large and fast, but you cannot upload large files to Spaces.
It is possible to download from model repos and dataset repos after starting Spaces, and it is fast, so it is good to put datasets in dataset repos.

Edit:
However, the free Spaces disk space is still 50GB even after starting up, so if you only want to read a 45GB data set, it should be fine, but it might be difficult to process… it’s 50-50.

Cicciokr · December 30, 2024, 10:19am

How can i do to load “tokenized dataset” from the dataset repos?
And also my question is… if the space can’t even run a dataset map, is it possible to run training on a T4 small or will it always go into connection timeout?

Now i’m trying to do a Dataset.map with a T4 Small, but is slower then a CPU Upgraded ( T4 Small about 7000 examples/s, CPU Upgraded 20000 examples/s )

Thank you

John6666 · December 30, 2024, 10:23am

How can i do to load “tokenized dataset” from the dataset repos?

I’ve found a way to save it. However, I think it would be more reliable to split the data set into several parts and run the training in several sessions if possible. It would be difficult if the data cannot be divided due to its nature…

Another option would be to customize the Trainer’s DataCollator to incorporate a tokenizer. And use trainer with IterableDataset.

Topic		Replies	Views
Issue with huggingface.load_dataset() Spaces	4	2435	January 8, 2025
Build error when accessing space app Spaces	3	2098	November 3, 2022
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10365	August 10, 2023
Streamlit Space Stuck on "Building" Spaces	6	1782	November 27, 2023
Unable to Train for a Long Time 🤗Datasets	4	1852	February 16, 2023

Space Connection Error - Dataset Map

Related topics