Bookcorpus dataset format

vblagoje · October 8, 2020, 9:25am

The current book corpus dataset is parsed into sentences directly, which is great, but then there is no way to determine document boundaries. Would it be useful to have another bookcorpus dataset that is chunked into books rather than sentences directly?

Shawn Presser went to great lenghts to preserve the structure of the books’ text and it is available at https://github.com/soskek/bookcorpus/issues/27 for download.

lhoestq · October 12, 2020, 8:30am

Indeed ! It was already suggested in https://github.com/huggingface/datasets/issues/486 to use this link. It would be very cool to add it to the library. You can make a script to use the new link if you want. You can take some inspiration from the docs and from the current bookcorpus script.
Let me know if you have questions, you can ping me on the forum or on github

vblagoje · October 12, 2020, 8:45am

Ok, deal, I’ll do this. I need this dataset ready for use…yesterday

lhoestq · April 26, 2023, 5:00pm

Since this thread still has views:

bookcorpusopen is available at bookcorpusopen · Datasets at Hugging Face and is at document level.

from datasets import load_dataset
bookcorpusopen = load_dataset("bookcorpusopen", split="train")

Topic		Replies	Views
How to make text files to hugging face standard text row structured data to use with HF datasets? Beginners	0	703	August 18, 2023
Seeking Guidance on Creating and Training a Model with a Specific Dataset Beginners	4	495	February 2, 2024
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4635	January 22, 2021
Hugging Face Dataset with tree 🤗Datasets	0	595	October 6, 2022
Sharing BERT formatted corpus Intermediate	7	1743	September 15, 2020

Bookcorpus dataset format

Related topics