BigPatent - cased version

silvia-casola · March 8, 2022, 1:50pm

Hi! I am trying to work with the BigPatent dataset.

In Tensorflow, there are a number of versions of the dataset:

1.0.0 : lower cased tokenized words
2.0.0 : Update to use cased raw strings
2.1.2 (default): Fix update to cased raw strings.

The default version with the datasets library seems the 1.0.0.

Is it possible to load the cased raw strings version?
I have tried with

ds = load_dataset("big_patent", "g", revision="2.1.2", split="validation", download_mode="force_redownload")

but this does not seem to work

mariosasko · April 7, 2022, 1:29pm

Hi! We don’t follow the TFDS versioning scheme. Instead, our versioning of the GH datasets is synced with the version of the datasets package, i.e., when you load a GH dataset using datasets==x.y.z, the version of the dataset will be the one at the git tag x.y.z.

Yes, our big_patent script matches version 1.0.0, so feel free to open an issue in our GH repo to add support for version 2.1.2.

silvia-casola · April 14, 2022, 10:22am

Hi, thanks for your reply. I have already opened an issue (#3861) for adding the cased version. Thanks.

silvia-casola · April 19, 2023, 4:41pm

To follow up on this: the cased and uncased versions actually contain different content, and the cased one is easier since it contains a Summary of the Invention in the input.

See the paper describing the issue here:

Topic		Replies	Views
Bigpatent dataset versions 🤗Datasets	2	468	June 21, 2021
Quick Tour: "Train using Tensorflow" gives `Dataset argument should be a datasets.Dataset` error Beginners	4	1073	May 29, 2023
Understanding DataCollation 🤗Transformers	0	15	July 18, 2024
Dataset loading is not working 🤗Datasets	2	5102	September 13, 2022
Using TFBertTokenizer with tf.data.Dataset 🤗Transformers	3	294	March 10, 2024

BigPatent - cased version

Related topics