Does the string length distribution in the Hugging Face dataset viewer represent token length or character length?

Siki-77 · July 8, 2024, 4:43am

Does the string length distribution in the Hugging Face dataset viewer represent token length or character length? as follows

severo · July 8, 2024, 8:42am

it’s the number of characters

Siki-77 · July 8, 2024, 9:42am

I appreciate your response. Btw, would you like to share your answer’s specific reference (like the URL of the guideline )? I tried to find the answer in Huggingface docs but I failed.

severo · July 8, 2024, 10:16am

I think we only mention it here in the docs: Explore statistics over split data

Siki-77 · July 8, 2024, 12:27pm

I also checked the “string_text” field in this doc earlier, but I couldn’t find any details on how to calculate its length. Anyway, thank you for your assistance.

severo · July 8, 2024, 12:41pm

the code is here:

github.com

huggingface/dataset-viewer/blob/fe605b0c8364b5c0f61bb3e0571c9652660b24b4/services/worker/src/worker/statistics_utils.py#L467


      
              ) or n_unique <= NUM_BINS
          
          @classmethod
          def compute_transformed_data(
              cls,
              data: pl.DataFrame,
              column_name: str,
              transformed_column_name: str,
          ) -> pl.DataFrame:
              return data.select(pl.col(column_name)).with_columns(
                  pl.col(column_name).str.len_chars().alias(transformed_column_name)
              )
          
          @classmethod
          def _compute_statistics(
              cls,
              data: pl.DataFrame,
              column_name: str,
              n_samples: int,
          ) -> Union[CategoricalStatisticsItem, NumericalStatisticsItem]:
              nan_count, nan_proportion = nan_count_proportion(data, column_name, n_samples)

Feel free to open a PR on the docs (dataset-viewer/docs/source/statistics.md at main · huggingface/dataset-viewer · GitHub), it would be much appreciated!

cc @polinaeterna.

Siki-77 · July 8, 2024, 12:57pm

Awesome! A perfect answer. Thank you for pointing out the code.

Siki-77 · July 8, 2024, 1:07pm

And I open a PR to update statistics.md, specifically on “string_text” field statement.

Update statistics.md: Add the length calculation statement. by Siki-cloud · Pull Request #2976 · huggingface/dataset-viewer (github.com)

system · July 9, 2024, 1:07am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How can I export the statistical information of an online huggingface dataset instead of downloading the whole dataset 🤗Datasets	3	51	December 2, 2024
Layoutlmv3 sequence_length vs token_sequnce_length size mismatch Models	2	697	November 19, 2022
37 chars or 40 chars Access Token Beginners	0	104	January 21, 2025
Trainer log output reports 0 samples in dataset 🤗Transformers	0	275	July 18, 2022
The input length for bert 🤗Transformers	0	188	March 24, 2023

Does the string length distribution in the Hugging Face dataset viewer represent token length or character length?

Related topics