BigPatent - cased version

Hi! I am trying to work with the BigPatent dataset.

In Tensorflow, there are a number of versions of the dataset:

  • 1.0.0 : lower cased tokenized words
  • 2.0.0 : Update to use cased raw strings
  • 2.1.2 (default): Fix update to cased raw strings.

The default version with the datasets library seems the 1.0.0.

Is it possible to load the cased raw strings version?
I have tried with

ds = load_dataset("big_patent", "g", revision="2.1.2", split="validation", download_mode="force_redownload")

but this does not seem to work

Hi! We don’t follow the TFDS versioning scheme. Instead, our versioning of the GH datasets is synced with the version of the datasets package, i.e., when you load a GH dataset using datasets==x.y.z, the version of the dataset will be the one at the git tag x.y.z.

Yes, our big_patent script matches version 1.0.0, so feel free to open an issue in our GH repo to add support for version 2.1.2.

Hi, thanks for your reply. I have already opened an issue (#3861) for adding the cased version. Thanks.

1 Like