RobertaTokenizer encoding sentences to the same value

aenriquez · November 29, 2023, 6:14pm

I’m trying to encode a list of profiles so that I can later feed them into a Roberta model.

The profiles are basically sentences that follow the same structure. Basically, one profile (row) is as follows:

My name is x. I live in x. I went to x in x. I have a degree in x. I have work as x in companies x. I have work from x. My job locations have been x. The average distance between my locations is x.

The relevant code is:

model = RobertaForSequenceClassification.from_pretrained(model_name)
tokenizer = RobertaTokenizer.from_pretrained(model_name)

def encode_df(dataframe):
    profiles = dataframe.apply(lambda row: tokenizer.encode(list(row), axis=1, add_special_tokens=True))
    return profiles

df = pd.read_excel("/content/profile sentences into df.xlsx")
df = pd.DataFrame(df)
profiles_encoded = encode_df(df)
profiles_input = profiles_encoded.values.tolist()

input_ids = torch.tensor(profiles_input).squeeze(0)

where input_ids gets exported into

torch.onnx.export(roberta_model,
                  (input_ids),
                  ONNX_path,
                  input_names=['input'],
                  output_names=['output'],
                  dynamic_axes={'input' :{0 : 'batch_size',
                                          1: 'sentence_length'},
                                'output': {0: 'batch_size'}})

When I look at profiles_encoded the first row is all 0 and then the rest of the values are all 3. I am guessing this is the reason why I’m getting the same results when I run inference on each of the profiles I get the same result for all of them.

I have run this code with another more traditional dataframe where columns were the raw data (basically not structured as full sentences)

fullName, Location, College, Degree_location, Degree, Jobs, Company, Date, LocationJob Avg_rounded

and it worked fine.

I have also tried to separate each sentence into an individual column, more aligned to how it was before but still gave me the same value for all of them.

Why is it encoding it that way? Is it because all the data follows the same sentence structure?

Topic		Replies	Views
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	176	March 19, 2024
Multiple sentences in RoBERTa training 🤗Datasets	0	567	August 10, 2021
Tokenized sequence lengths 🤗Tokenizers	6	1517	March 10, 2022
Identical CLS token embeddings for all different sentences? Beginners	1	416	April 17, 2023
RobertaTokenizer decode and tokenize do not have the same output 🤗Tokenizers	0	230	October 24, 2023

RobertaTokenizer encoding sentences to the same value

Related Topics