Cant save Dataset as Parquet-File since Updating Datasets?

Hi Guys,

i was using Datasets==1.5.0 while Preprocessing my Dataset & Saving it. I updated to the latest Datasets Version (1.6.1) and since then i cant export my Datasets as Parquet File.

With Version 1.5.0 i could just do:

import pyarrow.parquet as pq
...
...
pq.write_table(train_dataset.data, 'train.parquet')
pq.write_table(eval_dataset.data, 'eval.parquet')

When i run the same code with the latest datasets version i get:

  File "../preprocess_dataset.py", line 132, in <module>
    pq.write_table(train_dataset.data, f'{resampled_data_dir}/{data_args.dataset_config_name}.train.parquet')
  File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py", line 1674, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py", line 588, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
TypeError: Argument 'table' has incorrect type (expected pyarrow.lib.Table, got ConcatenationTable)

Should i just use 1.5.0 or is the a quick and easy work around?

Im not that familar with python. in java i could just use one version for one project and another version for another project. can / should i do the same here?

pq.write_table(train_dataset.data.table, 
pq.write_table(eval_dataset.data.table, 'eval.parquet')

is working , so this can be closed

2 Likes