Dataset set_format

Hello everyone,

Datasets provide this great feature of formatting datasets using set_format and then choosing the desired format (numpy, torch etc). The encoded dataset I prepared has columns/features of various data types (int32, int8 etc) but HF models require all features to be dtype torch.long/int64. Is there a simple trick to convert all features to torch.long tensors when selecting torch format?

I understand that I could have prepared the dataset with int64 type but that significantly increases the dataset file size footprint.

Thanks,
Vladimir

1 Like

Nevermind, I found a way. RTFM.

format = {'type': 'torch', 'format_kwargs' :{'dtype': torch.long}}
dataset.set_format(**format)

2 Likes

Here is a related issue that may be helpful.

2 Likes

@BramVanroy it really was. You got me at “After a weekend of debugging”. I can totally relate :slight_smile:

I have pre-processed my model input data and saved it as python lists. When loading the training data, I am setting format to torch. From your discussion with @lhoestq this conversion is as fast because loading converts numpy arrays to torch tensors, right? Do I need to save my data as numpy arrays as well? Doesn’t seem to be needed. What is the best format to save your training data so it can be loaded as fast as possible into torch tensors?

It doesn’t really matter: your data is converted into a list to be compatible with the Arrow data format. Then, when you call or request items from the dataset, the data will be cast from the underlying Arrow format (list-like) to numpy, and then to the returned format that you request (with set_format). My guess is that this last step also includes the precision information (e.g. float 32 vs float64), although ideally the precision should already be correct in the numpy conversion to prevent data loss.

@lhoestq Please correct me if I’m wrong.

Hi ! You’re right @BramVanroy
Under the hood all the data are in Arrow format. Arrow has a good interoperability with numpy that allows to cast arrow objets to numpy fast and with zero-copy. Then from numpy the conversion to pytorch is also very fast.

To summarize this is what it looks like when you load a pytorch tensor from a dataset:

memory-mapped arrow file -> load a sample in arrow format -> convert to numpy -> convert to pytorch (using the set_format args)

Thanks for the clarification @lhoestq! Would it be possible to move the precision to the numpy cast, or is it ensured that the numpy call is always the highest possible precision? I’m asking this because, imagine that you want to save data as double precision, and you set the set_format to double precision torch tensors, but then your data is first cast to single precision and only when casting to torch it is in double. You’d lose a lot of data, I think?

The numpy cast from arrow will reuse the same precision of the arrow data. It doesn’t reduce the precision :slight_smile:

The precision of the arrow data is defined by the features field of the dataset. For example to have double precision you need

features = Features({"col1": Value("double")})

That is not entirely correct I think. If you remember, I had an issue where a torch.float32 would unexpectedly end up as a torch.float64 if you do not manually specify the precision. So the input precision is not automatically used in the output. It went like this:

torch.float32 -> list -> float64 (numpy) -> torch.float64.

Well the precision that is used is the one of the arrow format, that is read from the file on disk.

You’re right that when writing an arrow file, on the other hand, there’s currently an issue. You’re having this issue because when you write floats in a dataset using map for example then it will try to infer the precision and take the highest one if it’s not manually specified which one to use (here float64 even though you’re providing float32 tensors using map).

Well, I guess that if it picks the highest possible precision available then no information is lost ever, so that’s okay.

Thanks for the reply!

This seems inconsistent with what the documentation at

to_numpy(*self*, *zero_copy_only=False* )

Return a NumPy copy of this array (experimental).

Parameters:

**zero_copy_only** [bool](), default `False`
Introduced for signature consistence with pyarrow.Array.to_numpy. This must be False here since NumPy arrays’ buffer must be contiguous. 

This suggests that it’s making a copy of the data and not doing a zero copy to numpy arrays. I’m also running into a problem when loading it as numpy array or python list seems equally slow. Maybe I’m doing something horribly wrong. Create batch from list of ids in the dataset is very slow - #4