Hi, Hugging Face community! (or dear @mariosasko again )
I’m currently following this tutorial, where the dataset is created as follows:
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="../dataset/dutch.txt",
block_size=128)
This method is straightforward for text files, but I’m working with a dataset in the Hugging Face .arrow
format, created using datasets.Dataset.save_to_disk
. I noticed that transformers.TextDataset
and transformers.LinebyLineTextDataset
don’t seem to support reading from a Hugging Face dataset folder. The source code is here.
Furthermore, when using the Trainer
, it seems to require a transformers.Dataset
:
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
This is where my confusion lies. transformers.Dataset
doesn’t allow reading from an HGF dataset folder and doesn’t seem to enable specifying a column (in my case, column='text'
). On the other hand, datasets.Dataset
doesn’t allow setting a block_size
, which seems crucial for my task, and it’s unclear whether it’s compatible with Trainer
.
I’m trying to understand which .Dataset
class would be the most appropriate for my scenario . Should I use transformers.Dataset
, adapting it somehow to read from HGF data folders, or is there a way to use datasets.Dataset
with the necessary block_size
and ensure compatibility with Trainer
?
Any guidance or suggestions on how to approach this would be greatly appreciated!
Thank you in advance!