Does the string length distribution in the Hugging Face dataset viewer represent token length or character length? as follows
it’s the number of characters
I appreciate your response. Btw, would you like to share your answer’s specific reference (like the URL of the guideline )? I tried to find the answer in Huggingface docs but I failed.
I think we only mention it here in the docs: Explore statistics over split data
I also checked the “string_text” field in this doc earlier, but I couldn’t find any details on how to calculate its length. Anyway, thank you for your assistance.
the code is here:
Feel free to open a PR on the docs (dataset-viewer/docs/source/statistics.md at main · huggingface/dataset-viewer · GitHub), it would be much appreciated!
cc @polinaeterna.
Awesome! A perfect answer. Thank you for pointing out the code.
And I open a PR to update statistics.md, specifically on “string_text” field statement.
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.