Thanks!
So now I have two options:
- use the parquet branch: OpenCoder-LLM/opc-annealing-corpus at refs/convert/parquet
- try Load and the data_files options (requires playing with URLs and likely wildcards)
It’s a bit frustrating to see the repo contains all the metadata but a specific approach is required or the dataset needs to be duplicated in a another format.
A unified interface loading all sorts of dataset formats would be great; it seems almost implemented because the load_dataset function loads all the arrow files by itself.
Might look into the code to see if I can comme with a change.
Thanks again!
Best.