Bookcorpus dataset format

The current book corpus dataset is parsed into sentences directly, which is great, but then there is no way to determine document boundaries. Would it be useful to have another bookcorpus dataset that is chunked into books rather than sentences directly?

Shawn Presser went to great lenghts to preserve the structure of the books’ text and it is available at https://github.com/soskek/bookcorpus/issues/27 for download.

Indeed ! It was already suggested in https://github.com/huggingface/datasets/issues/486 to use this link. It would be very cool to add it to the library. You can make a script to use the new link if you want. You can take some inspiration from the docs and from the current bookcorpus script.
Let me know if you have questions, you can ping me on the forum or on github

Ok, deal, I’ll do this. I need this dataset ready for use…yesterday :slight_smile:

1 Like

Since this thread still has views:

bookcorpusopen is available at bookcorpusopen · Datasets at Hugging Face and is at document level.

from datasets import load_dataset
bookcorpusopen = load_dataset("bookcorpusopen", split="train")
1 Like