Hi,
Iām working on Scala bindings for the (Rust) Tokenizers library. While writing tests, I realized that the offsets for certain strings seem to differ when using the Python bindings and when calling the underlying Rust implementation directly.
This can also be seen directly in the quicktour docs when looking at the offsets of a āā:
print(output.offsets[9])
# (26, 27)
println!("{:?}", output.get_offsets()[9]);
// (26, 30)
I suspect the difference is due to how strings are represented as bytes vs. characters, i.e. calling len()
on a rust str
returns the number of bytes:
let input = "š";
println!("{}", input.len());
// 4
Thereās also OffsetType which might be related but I havenāt been able to figure out yet where this is handled on the Python side.
Any hints appreciated. Thanks!