Hi,
So I’m currently trying to train a model in tensorflow on SMILES which is a bit of chemical information which tells you the molecular formula of a given molecule. I thought a transformer would work well because of the importance of context of each character of the string within the SMILES.
I am currently having an issue running the model this is the error I get:
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:653 distributed_training_steps *
self.args.strategy.run(self.apply_gradients, batch)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:618 apply_gradients *
gradients = self.training_step(features, labels)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:601 training_step *
per_example_loss, _ = self.run_model(features, labels, True)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:682 run_model *
outputs = self.model(features, labels=labels, training=training)[:2]
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:1127 call *
outputs = self.bert(
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:615 call *
embedding_output = self.embeddings(input_ids, position_ids, token_type_ids, inputs_embeds, training=training)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:191 call *
return self._embedding(input_ids, position_ids, token_type_ids, inputs_embeds, training=training)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:206 _embedding *
seq_length = input_shape[1]
IndexError: list index out of range
This part is caused by running the trainer.train()
I have checked to make sure there are no zero values in my data and the dataset I have just contains the SMILES (String) which is a feature and then the target which is the CCS (Collisional Cross Section) (float)
This is my code:
dataFrameData = pd.DataFrame(numpyArrayOfData, columns=['CAS', 'CCS', 'Compound', 'Adducts', 'Mass', 'SMILES']).iloc[1:]
# splitting the data into testing and training
train, test = train_test_split(dataFrameData, test_size=0.2)
# getting the CCS as the target from the data
targetTrain = train.pop('CCS').astype(float)
targetTest = test.pop('CCS').astype(float)
# getting the SMILES as first feature from data
dfStrippedTrain = train[['SMILES']].copy()
dfStrippedTest = test[['SMILES']].copy()
# compile these two into a test and training data set and then use enumerate so there is indexing
dataSetTrain = tf.data.Dataset.from_tensor_slices((dfStrippedTrain.values, targetTrain.values)).enumerate()
dataSetTest = tf.data.Dataset.from_tensor_slices((dfStrippedTest.values, targetTest.values)).enumerate()
model = TFBertForSequenceClassification.from_pretrained("bert-large-cased")
training_args = TFTrainingArguments(
output_dir = '/results',
num_train_epochs = 3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
weight_decay=0.01,
logging_dir='/logs'
)
trainer = TFTrainer(
model = model,
args = training_args,
train_dataset = dataSetTrain,
eval_dataset = dataSetTest
)
trainer.train()
trainer.evaluate()
Any help you can give me would be greatly appreciated. Thank you in advance.