Does the string length distribution in the Hugging Face dataset viewer represent token length or character length?

Does the string length distribution in the Hugging Face dataset viewer represent token length or character length? as follows

it’s the number of characters

I appreciate your response. Btw, would you like to share your answer’s specific reference (like the URL of the guideline )? I tried to find the answer in Huggingface docs but I failed.

I think we only mention it here in the docs: Explore statistics over split data

I also checked the “string_text” field in this doc earlier, but I couldn’t find any details on how to calculate its length. Anyway, thank you for your assistance.

the code is here:

Feel free to open a PR on the docs (dataset-viewer/docs/source/statistics.md at main · huggingface/dataset-viewer · GitHub), it would be much appreciated!

cc @polinaeterna.

Awesome! A perfect answer. Thank you for pointing out the code.

And I open a PR to update statistics.md, specifically on “string_text” field statement.

Update statistics.md: Add the length calculation statement. by Siki-cloud · Pull Request #2976 · huggingface/dataset-viewer (github.com)

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.