Token Offsets in Rust vs. Python

Hi,

Iā€™m working on Scala bindings for the (Rust) Tokenizers library. While writing tests, I realized that the offsets for certain strings seem to differ when using the Python bindings and when calling the underlying Rust implementation directly.

This can also be seen directly in the quicktour docs when looking at the offsets of a ā€œ:grin:ā€:

print(output.offsets[9])
# (26, 27)
println!("{:?}", output.get_offsets()[9]);
// (26, 30)

I suspect the difference is due to how strings are represented as bytes vs. characters, i.e. calling len() on a rust str returns the number of bytes:

let input = "šŸ˜";
println!("{}", input.len());
// 4

Thereā€™s also OffsetType which might be related but I havenā€™t been able to figure out yet where this is handled on the Python side.

Any hints appreciated. Thanks!

I just realized that thereā€™s also encode_char_offsets and the the encode method in Python calls it.