I’m trying to encode a list of profiles so that I can later feed them into a Roberta model.
The profiles are basically sentences that follow the same structure. Basically, one profile (row) is as follows:
My name is x. I live in x. I went to x in x. I have a degree in x. I have work as x in companies x. I have work from x. My job locations have been x. The average distance between my locations is x.
The relevant code is:
model = RobertaForSequenceClassification.from_pretrained(model_name)
tokenizer = RobertaTokenizer.from_pretrained(model_name)
def encode_df(dataframe):
profiles = dataframe.apply(lambda row: tokenizer.encode(list(row), axis=1, add_special_tokens=True))
return profiles
df = pd.read_excel("/content/profile sentences into df.xlsx")
df = pd.DataFrame(df)
profiles_encoded = encode_df(df)
profiles_input = profiles_encoded.values.tolist()
input_ids = torch.tensor(profiles_input).squeeze(0)
where input_ids gets exported into
torch.onnx.export(roberta_model,
(input_ids),
ONNX_path,
input_names=['input'],
output_names=['output'],
dynamic_axes={'input' :{0 : 'batch_size',
1: 'sentence_length'},
'output': {0: 'batch_size'}})
When I look at profiles_encoded the first row is all 0 and then the rest of the values are all 3. I am guessing this is the reason why I’m getting the same results when I run inference on each of the profiles I get the same result for all of them.
I have run this code with another more traditional dataframe where columns were the raw data (basically not structured as full sentences)
fullName, Location, College, Degree_location, Degree, Jobs, Company, Date, LocationJob Avg_rounded
and it worked fine.
I have also tried to separate each sentence into an individual column, more aligned to how it was before but still gave me the same value for all of them.
Why is it encoding it that way? Is it because all the data follows the same sentence structure?