Help: Not able to find Cohere wikipedia/miracl datasets

Hi All,

I am new to this forum. I am looking for the Cohere datasets below and see 404. Could someone point alternative location for these?

https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings

both en and de are needed.

https://huggingface.co/datasets/Cohere/miracl-en-corpus-22-12

Thanks in advance

GRB

1 Like

Seems same here…

Can anyone help? or response from admin?

1 Like

Only third-party mirrors can be found. If you absolutely need the genuine article, you may have to contact Cohere via the Community tab for any model or dataset


Mirrors / alternative hosts (not the Hugging Face dataset pages)

Wikipedia embeddings mirrors

  • Gitee mirror for the simple subset: hf-datasets/wikipedia-22-12-simple-embeddings (Gitee)
  • Gitee AI mirrors for Cohere’s language datasets (including DE): Cohere/wikipedia-22-12-de-embeddings (Gite AI)
  • Another mirror endpoint showing EN: (Gite AI)

MIRACL EN corpus mirror

  • Gitee AI mirror: Cohere/miracl-en-corpus-22-12 (Gite AI)
  • Elastic Rally track hosts small packaged subsets for benchmarking (useful if you only need a sample): (rally-tracks.elastic.co)

MIRACL documents (no Cohere embeddings)

  • The upstream MIRACL corpus mirror (Apache-2.0): hf-datasets/miracl-corpus. (Gitee)