Token Offsets in Rust vs. Python

sbrunk · April 27, 2023, 7:21pm

Hi,

I’m working on Scala bindings for the (Rust) Tokenizers library. While writing tests, I realized that the offsets for certain strings seem to differ when using the Python bindings and when calling the underlying Rust implementation directly.

This can also be seen directly in the quicktour docs when looking at the offsets of a “”:

print(output.offsets[9])
# (26, 27)

println!("{:?}", output.get_offsets()[9]);
// (26, 30)

I suspect the difference is due to how strings are represented as bytes vs. characters, i.e. calling len() on a rust str returns the number of bytes:

let input = "😁";
println!("{}", input.len());
// 4

There’s also OffsetType which might be related but I haven’t been able to figure out yet where this is handled on the Python side.

Any hints appreciated. Thanks!

sbrunk · April 27, 2023, 8:17pm

I just realized that there’s also encode_char_offsets and the the encode method in Python calls it.

Topic		Replies	Views
Offset mappings differ for tokenizers 🤗Tokenizers	0	1663	October 30, 2023
BUGs on offset-mapping 🤗Tokenizers	0	172	May 24, 2024
Tokenizers offset issue Beginners	0	662	September 8, 2022
Return_offsets_mapping when decoding 🤗Tokenizers	3	32	April 25, 2025
Tokenizer.encode not returning encodings 🤗Tokenizers	2	896	October 9, 2021

Token Offsets in Rust vs. Python

Related topics