Hi, i have created a Streamlit space with a CPU UPGRADED and i try to do a Dataset.map, after about 20 minute the space go on “Connection timeout” on container log i see “stopping…” and my procedure of Dataset.map stop and i need to restart…
On the screen i see a popup with message: “Sorry, there is an error on our side.”
How can i resolve? I have tried also to do a tokenized_dataset local on my pc but it is 45 GB and i don’t know how to load it on huggingface to use on space and so skip this step of dataset.map
To resolve the connection timeout and error during Dataset.map in your Streamlit Space:
Increase Resources: Ensure your Streamlit Space has sufficient CPU and memory for processing large datasets.
Optimize Dataset Processing: Process the dataset in smaller batches using a chunked approach to avoid timeouts.
Pre-tokenize Locally: Tokenize the dataset locally, save it in a format like Parquet/JSON, and upload it to Hugging Face. Load the tokenized dataset directly in your Streamlit Space.
Use map Optimizations: Use batched=True and num_proc in Dataset.map for better performance.
Check Logs: Review container logs for detailed error messages to identify specific issues.
In the log i have no other information only connection timeout, and i use the CPU Upgraded, is not possible to have connection timeout while is in run mode with no other error, i pay the hourly use of this resources.
Yes, to use your dataset with the Trainer function, you need to map the tokenizer over the dataset to tokenize the text data. You’ve already set up the tokenizer and preprocessing function correctly. By using the dataset.map function, you’re applying the tokenizer to your dataset, which prepares it for use with the Trainer.
i have also tokenized locally my dataset and is about 45GB but is not possible to upload it on the Spaces, error “reach limit 1GB”.
Model repos and dataset repos are large and fast, but you cannot upload large files to Spaces.
It is possible to download from model repos and dataset repos after starting Spaces, and it is fast, so it is good to put datasets in dataset repos.
Edit:
However, the free Spaces disk space is still 50GB even after starting up, so if you only want to read a 45GB data set, it should be fine, but it might be difficult to process… it’s 50-50.
How can i do to load “tokenized dataset” from the dataset repos?
And also my question is… if the space can’t even run a dataset map, is it possible to run training on a T4 small or will it always go into connection timeout?
Now i’m trying to do a Dataset.map with a T4 Small, but is slower then a CPU Upgraded ( T4 Small about 7000 examples/s, CPU Upgraded 20000 examples/s )
How can i do to load “tokenized dataset” from the dataset repos?
I’ve found a way to save it. However, I think it would be more reliable to split the data set into several parts and run the training in several sessions if possible. It would be difficult if the data cannot be divided due to its nature…
Another option would be to customize the Trainer’s DataCollator to incorporate a tokenizer. And use trainer with IterableDataset.