Load_dataset() doesn't load ONE of the Subset

stepkurniawan · October 17, 2023, 12:57pm

So I have a dataset: stepkurniawan/qa-rag-llama · Datasets at Hugging Face

in that dataset, I made a couple of subsets:

“default”,
“Llama-2-13b-chat-hf”, and
“Llama-2-7b-chat-hf”.

I load using : qa_dataset = load_dataset("stepkurniawan/qa-rag-llama","Llama-2-13b-chat-hf")

When I load “Llama-2-7b-chat-hf”, everything works well, however, when I try to load “Llama-2-13b-chat-hf”, it gave me the dataset of the “default” one.

Why?
And how can I fix it?

Thank you!

Update: now I changed the “Llama-2-7b-chat-hf”, the changes are seen in the Dataset Viewer, but when I do in the code

HF_HUB_QA_LLAMA = "stepkurniawan/qa-rag-llama"
qa_dataset = load_dataset(HF_HUB_QA_LLAMA, "Llama-2-7b-chat-hf")

it still get me the old dataset… I dont understand…

Update 2:
The difference is just the number of rows… Previously, they have only 3-5 rows… But after I change, everything becomes 50 rows. Does HF have rows limit ?

stepkurniawan · October 18, 2023, 1:36pm

pushing for visibility

stepkurniawan · October 20, 2023, 9:15am

Anyone?
is this a bug?

lhoestq · October 20, 2023, 10:10am

Hi ! Maybe you tried to run push_to_hub twice at the same time and therefore it caused conflicts ? Feel free to retry and push subsets one at a time.

Does HF have rows limit ?

No, you can have as many rows as you want

stepkurniawan · October 20, 2023, 10:18am

Hi @lhoestq , thanks for answering

Im sure I didnt do double pushing at the same time.
I tried with 2 subsets (“Llama2-13b” & “Llama2-7b”) and both of them are behaving weird the same way.

can you confirm what I am doing is correct?

push_to_hub: result_dataset.push_to_hub(HF_HUB_QA_LLAMA, token=hf_token, config_name=subset_name)
load_dataset : qa_dataset = load_dataset(HF_HUB_QA_LLAMA, subset_name)

And another question is, by best practice, do you recommend push_to_hub(), or directly uploading to HF using terminal?

lhoestq · October 20, 2023, 10:31am

Ok cool ! What you’re doing is correct.

And another question is, by best practice, do you recommend push_to_hub(), or directly uploading to HF using terminal?

push_to_hub() is the recommended way. It optimizes for I/O efficiency, allows resuming, offers a better experience for the dataset viewer, and also allows defining multiple subsets.

lhoestq · October 20, 2023, 10:33am

Can you try to reupload your subsets and ping me if you still have an issue when you reload the data using load_dataset ?

stepkurniawan · October 20, 2023, 10:33am

ok…
that is a bad news…
since now I dont know what I did wrong and its still unsolved
It normally works with datasets without subset, however im struggling to do with with subset…

I can try do it once more and ping you if the bad news still persist, sure.

stepkurniawan · October 20, 2023, 4:34pm

@lhoestq
Hey, Ive tested it again.
It turns out that using another laptop will give me 50 rows, and using my laptop, I still get 3 rows.
Does it mean I have to clear cache somehow/ somewhere?

lhoestq · October 20, 2023, 5:05pm

The cache is automatically refreshed as long as you have an internet connection: it will get the latest files from the HF website. Otherwise it can use the lcoal cache and show a WARNING log message to say that it loaded the dataset from your local cache

stepkurniawan · October 20, 2023, 5:49pm

I certainly don’t get the warning.

qa_dataset = load_dataset("stepkurniawan/qa-rag-llama", "Llama-2-13b-chat-hf")
print("something")

and this is in my terminal:

(base) [xxx@ml3-gpu2 RAG-comparation]$ /usr/bin/env /home/xxx/python /home/xxx/RAG-comparation/rag_ragas.py 
something

btw, how to do I delete the cache?

lhoestq · October 24, 2023, 8:58am

Actually it’s at “info” log level

stepkurniawan · October 24, 2023, 6:29pm

@lhoestq
I turned on INFO log level using

transformers.logging.set_verbosity_info()

But i still didnt get the cache warning. Is the code above correct?
It could be that the problem is not the cache.

We could check that by deleting the cache maybe?
How to do that?

lhoestq · October 25, 2023, 10:12am

The cache is at ~/.cache/huggingface/datasets by default, feel free to delete it (make sure you’re not using datasets when deleting it)

stepkurniawan · October 25, 2023, 10:20am

Thank you! it works after I delete the cache!!!
But it is weird since I dont want my code to use cache…
Is there any way i can force the program to NOT use the cache?

lhoestq · October 25, 2023, 10:41am

yes you can do

import datasets

datasets.disable_caching()

and

load_dataset(..., download_mode="force_redownload")

Topic		Replies	Views
Loading just part of dataset 🤗Datasets	4	4651	February 25, 2025
Some issues about loading script of datasets 🤗Datasets	0	23	July 31, 2024
Extremely slow data loading of imagefolder 🤗Datasets	9	2417	January 4, 2024
Help loading a dataset that I pushed to hub 🤗Datasets	4	594	August 16, 2023
Load_datasets is extremely slow in loading HF datasets Beginners	1	2435	December 15, 2023

Load_dataset() doesn't load ONE of the Subset

Related topics