Load_dataset() doesn't load ONE of the Subset

So I have a dataset: stepkurniawan/qa-rag-llama · Datasets at Hugging Face

in that dataset, I made a couple of subsets:

  1. “default”,
  2. “Llama-2-13b-chat-hf”, and
  3. “Llama-2-7b-chat-hf”.

I load using : qa_dataset = load_dataset("stepkurniawan/qa-rag-llama","Llama-2-13b-chat-hf")

When I load “Llama-2-7b-chat-hf”, everything works well, however, when I try to load “Llama-2-13b-chat-hf”, it gave me the dataset of the “default” one.

Why?
And how can I fix it?

Thank you!

Update: now I changed the “Llama-2-7b-chat-hf”, the changes are seen in the Dataset Viewer, but when I do in the code

HF_HUB_QA_LLAMA = "stepkurniawan/qa-rag-llama"
qa_dataset = load_dataset(HF_HUB_QA_LLAMA, "Llama-2-7b-chat-hf")

it still get me the old dataset… I dont understand…

Update 2:
The difference is just the number of rows… Previously, they have only 3-5 rows… But after I change, everything becomes 50 rows. Does HF have rows limit ?

pushing for visibility

Anyone?
is this a bug?

Hi ! Maybe you tried to run push_to_hub twice at the same time and therefore it caused conflicts ? Feel free to retry and push subsets one at a time.

Does HF have rows limit ?

No, you can have as many rows as you want

Hi @lhoestq , thanks for answering

Im sure I didnt do double pushing at the same time.
I tried with 2 subsets (“Llama2-13b” & “Llama2-7b”) and both of them are behaving weird the same way.

can you confirm what I am doing is correct?

  1. push_to_hub: result_dataset.push_to_hub(HF_HUB_QA_LLAMA, token=hf_token, config_name=subset_name)

  2. load_dataset : qa_dataset = load_dataset(HF_HUB_QA_LLAMA, subset_name)

And another question is, by best practice, do you recommend push_to_hub(), or directly uploading to HF using terminal?

Ok cool ! What you’re doing is correct.

And another question is, by best practice, do you recommend push_to_hub(), or directly uploading to HF using terminal?

push_to_hub() is the recommended way. It optimizes for I/O efficiency, allows resuming, offers a better experience for the dataset viewer, and also allows defining multiple subsets.

Can you try to reupload your subsets and ping me if you still have an issue when you reload the data using load_dataset ?

ok…
that is a bad news…
since now I dont know what I did wrong and its still unsolved :frowning:
It normally works with datasets without subset, however im struggling to do with with subset…

I can try do it once more and ping you if the bad news still persist, sure.

@lhoestq
Hey, Ive tested it again.
It turns out that using another laptop will give me 50 rows, and using my laptop, I still get 3 rows.
Does it mean I have to clear cache somehow/ somewhere?

The cache is automatically refreshed as long as you have an internet connection: it will get the latest files from the HF website. Otherwise it can use the lcoal cache and show a WARNING log message to say that it loaded the dataset from your local cache

I certainly don’t get the warning.

qa_dataset = load_dataset("stepkurniawan/qa-rag-llama", "Llama-2-13b-chat-hf")
print("something")

and this is in my terminal:

(base) [xxx@ml3-gpu2 RAG-comparation]$ /usr/bin/env /home/xxx/python /home/xxx/RAG-comparation/rag_ragas.py 
something

btw, how to do I delete the cache?

Actually it’s at “info” log level

@lhoestq
I turned on INFO log level using

transformers.logging.set_verbosity_info()

But i still didnt get the cache warning. Is the code above correct?
It could be that the problem is not the cache.

We could check that by deleting the cache maybe?
How to do that?

The cache is at ~/.cache/huggingface/datasets by default, feel free to delete it (make sure you’re not using datasets when deleting it)

Thank you! it works after I delete the cache!!!
But it is weird since I dont want my code to use cache…
Is there any way i can force the program to NOT use the cache?

yes you can do

import datasets

datasets.disable_caching()

and

load_dataset(..., download_mode="force_redownload")
1 Like