Truncation strategy for long text documents

Hello,

My study partner and I are doing research on Twitter data for our Master’s Thesis. We have collected a dataset of tweets, aggregated on user-level. Each entry in the dataset corresponds to a user, and each user has a text document and a classification label. These text documents consist of several tweets from one user in one long string (not sentences or word-tokens, just one long string).

We use BERTForSequenceClassification for this, but have a problem with truncation. The average number of tokens for these text documents is 28.000(!), and with a sequence length of 512, there are obviously a huge amount of tokens that are dropped.

Our question is the truncation strategy. We set parameter truncation=True when initializing the BertTokenizer. Will the truncation just keep the 512 first tokens with this strategy, or will it keep the 512 tokens with the highest weights/WordPiece “score”. In other words, if the tokenizer strategy was e.g. TF-IDF, would the truncation process keep the top-512 TF-IDF scoring tokens or just the 512 first tokens. We don’t fully understand how WordPiece gives weights/scores to tokens and if these “scores” are used in truncation.

Hi @eirikdahlen, from the docs one sees that truncation=True is equivalent to the longest_first strategy which just truncates all tokens beyond the maximum context size of the model (e.g. 512 for BERT-base).

The other strategies are only_first and only_second which refer to whether one should apply the truncation exclusively on the first or second set of inputs, e.g. if you’re doing something like entailment where the inputs are a premise and hypothesis.

Since you’re dealing with long texts, you might want to check out the LongFormer model - it can handle input sequences of 4096 tokens so should be able to capture more context in your use case :slight_smile:

3 Likes

Hi @eirikdahlen,
Wordpiece and tf-idf are different concepts. Wordpiece takes a peice of text and converts it into a sequence of tokens. tf-idf on the other hand, takes a sequence of tokens and converts them into a vector (a fixed size list of numbers).

Most implementations of tf-idf take raw text as input and do the tokenization step implicitly, which is why it "doesn’t feel " like tf-idf takes tokens as input, but it does.

In other words, tf-idf has nothing to do with truncation strategies.

That still leaves you with your original problem, you have 28K tokens in a document but can only input 512, that’s a lot of data loss…

The strategy to use here depends on your end goals with the model. Since the data comes in the form tweets and is then aggregated, you might benefit from giving up on the aggregation of the tweets and instead inputting individual tweets.
Can you share more about what you are trying to achieve ?
Tal

2 Likes

Hi @lewtun @talolard ,

Thank you both for your thorough replies, it is greatly appreciated.
We will check out Longformer! Hopefully, BigBird will be available through Huggingface in the near future as well!

I understand TF-IDF and WordPiece is not the same thing, just “hoped” there was some ranking of tokens involved there. I see now that ranking tokens from a score in WordPiece really doesn’t make that much sense.

We are considering not aggregating the tweets on user-level, our only problem is that our dataset is annotated on user-level (not collected by us), and classification is previously done on users in related research. If we convert it to tweet level, it would require a lot of hours to label these tweets manually or semi-automatic (as there are 10 million tweets).

If you look at the first row of the Wikipedia dataset, it is a very long string: wikipedia · Datasets at Hugging Face

So in a typical use case of tokenizing the input texts, all of the text after max_length will be ignored?

bookcorpus = load_dataset("bookcorpus", split="train")
wiki = load_dataset("wikipedia", "20220301.en", split="train")
wiki = wiki.remove_columns(
    [col for col in wiki.column_names if col != "text"]
)  # only keep the 'text' column

assert bookcorpus.features.type == wiki.features.type
train_dataset = concatenate_datasets([bookcorpus, wiki])


def group_texts(examples):
    tokenized_inputs = tokenizer(
        examples["text"],
        return_special_tokens_mask=True,
        truncation=True,
        max_length=128,
    )
    return tokenized_inputs

train_dataset.set_transform(group_texts)

It seems like it would be better to go through each row in the dataset, apply a segmenter on the text column, and create chunks of text to max_length so that more of the data is being seen by the model.