Hello,
I have a the following issue. I am trying to train a language model in tensorflow using the nice new TF notebooks notebooks/language_modeling_from_scratch-tf.ipynb at master · huggingface/notebooks · GitHub
I know that the starting point of the training is to actually load the data using the datasets
package.
from datasets import load_dataset
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
and then do something like
tokenized_datasets = datasets.map(
tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)
My issue is that my text data is stored on my disk as several parquet
files with the following constraints:
- each file fits into the RAM easily (say in a Pandas dataframe)
- there are too many files to load them all at once in the RAM
What is the best way to proceed using datasets
here? Concatenating them will not work as my RAM is not sufficient. And training in the notebook seems to only work with one instance of a datasets
What do you think?
Thanks!