BigPatent - cased version

Hi! I am trying to work with the BigPatent dataset.

In Tensorflow, there are a number of versions of the dataset:

  • 1.0.0 : lower cased tokenized words
  • 2.0.0 : Update to use cased raw strings
  • 2.1.2 (default): Fix update to cased raw strings.

The default version with the datasets library seems the 1.0.0.

Is it possible to load the cased raw strings version?
I have tried with

ds = load_dataset("big_patent", "g", revision="2.1.2", split="validation", download_mode="force_redownload")

but this does not seem to work

Hi! We don’t follow the TFDS versioning scheme. Instead, our versioning of the GH datasets is synced with the version of the datasets package, i.e., when you load a GH dataset using datasets==x.y.z, the version of the dataset will be the one at the git tag x.y.z.

Yes, our big_patent script matches version 1.0.0, so feel free to open an issue in our GH repo to add support for version 2.1.2.

Hi, thanks for your reply. I have already opened an issue (#3861) for adding the cased version. Thanks.

1 Like

To follow up on this: the cased and uncased versions actually contain different content, and the cased one is easier since it contains a Summary of the Invention in the input.

See the paper describing the issue here: