Dataset set_format

vblagoje · November 10, 2020, 11:43am

Hello everyone,

Datasets provide this great feature of formatting datasets using set_format and then choosing the desired format (numpy, torch etc). The encoded dataset I prepared has columns/features of various data types (int32, int8 etc) but HF models require all features to be dtype torch.long/int64. Is there a simple trick to convert all features to torch.long tensors when selecting torch format?

I understand that I could have prepared the dataset with int64 type but that significantly increases the dataset file size footprint.

Thanks,
Vladimir

vblagoje · November 10, 2020, 1:00pm

Nevermind, I found a way. RTFM.

format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
dataset.set_format(**format)

BramVanroy · November 10, 2020, 1:31pm

Here is a related issue that may be helpful.

vblagoje · November 10, 2020, 3:28pm

@BramVanroy it really was. You got me at “After a weekend of debugging”. I can totally relate

I have pre-processed my model input data and saved it as python lists. When loading the training data, I am setting format to torch. From your discussion with @lhoestq this conversion is as fast because loading converts numpy arrays to torch tensors, right? Do I need to save my data as numpy arrays as well? Doesn’t seem to be needed. What is the best format to save your training data so it can be loaded as fast as possible into torch tensors?

BramVanroy · November 10, 2020, 3:47pm

It doesn’t really matter: your data is converted into a list to be compatible with the Arrow data format. Then, when you call or request items from the dataset, the data will be cast from the underlying Arrow format (list-like) to numpy, and then to the returned format that you request (with set_format). My guess is that this last step also includes the precision information (e.g. float 32 vs float64), although ideally the precision should already be correct in the numpy conversion to prevent data loss.

@lhoestq Please correct me if I’m wrong.

lhoestq · November 10, 2020, 3:53pm

Hi ! You’re right @BramVanroy
Under the hood all the data are in Arrow format. Arrow has a good interoperability with numpy that allows to cast arrow objets to numpy fast and with zero-copy. Then from numpy the conversion to pytorch is also very fast.

To summarize this is what it looks like when you load a pytorch tensor from a dataset:

memory-mapped arrow file -> load a sample in arrow format -> convert to numpy -> convert to pytorch (using the set_format args)

BramVanroy · November 10, 2020, 3:55pm

Thanks for the clarification @lhoestq! Would it be possible to move the precision to the numpy cast, or is it ensured that the numpy call is always the highest possible precision? I’m asking this because, imagine that you want to save data as double precision, and you set the set_format to double precision torch tensors, but then your data is first cast to single precision and only when casting to torch it is in double. You’d lose a lot of data, I think?

lhoestq · November 10, 2020, 4:06pm

The numpy cast from arrow will reuse the same precision of the arrow data. It doesn’t reduce the precision

The precision of the arrow data is defined by the features field of the dataset. For example to have double precision you need

features = Features({"col1": Value("double")})

BramVanroy · November 10, 2020, 4:32pm

That is not entirely correct I think. If you remember, I had an issue where a torch.float32 would unexpectedly end up as a torch.float64 if you do not manually specify the precision. So the input precision is not automatically used in the output. It went like this:

torch.float32 -> list -> float64 (numpy) -> torch.float64.

lhoestq · November 10, 2020, 4:52pm

Well the precision that is used is the one of the arrow format, that is read from the file on disk.

You’re right that when writing an arrow file, on the other hand, there’s currently an issue. You’re having this issue because when you write floats in a dataset using map for example then it will try to infer the precision and take the highest one if it’s not manually specified which one to use (here float64 even though you’re providing float32 tensors using map).

BramVanroy · November 10, 2020, 5:52pm

Well, I guess that if it picks the highest possible precision available then no information is lost ever, so that’s okay.

Thanks for the reply!

dhruvgrammarly · November 24, 2024, 4:22am

This seems inconsistent with what the documentation at

to_numpy(*self*, *zero_copy_only=False* )

Return a NumPy copy of this array (experimental).

Parameters:

**zero_copy_only** [bool](), default `False`
Introduced for signature consistence with pyarrow.Array.to_numpy. This must be False here since NumPy arrays’ buffer must be contiguous.

This suggests that it’s making a copy of the data and not doing a zero copy to numpy arrays. I’m also running into a problem when loading it as numpy array or python list seems equally slow. Maybe I’m doing something horribly wrong. Create batch from list of ids in the dataset is very slow - #4

Topic		Replies	Views
How to change the datatype of a dataset after it has been converted to torch with huggingface images? Beginners	1	1334	September 1, 2023
Getting list of tensors instead of tensor array after using set_format 🤗Datasets	1	2152	November 30, 2021
Returns list of tensors instead of tensors with set_format in datasets Beginners	1	670	March 8, 2022
Local dataset loading performance: HF's arrow vs torch.load 🤗Datasets	5	1174	November 24, 2024
Iterable datasets for array data, limited formatting options 🤗Datasets	2	424	December 28, 2023

Dataset set_format

Related topics