IndexError list out of range

Hi,

So I’m currently trying to train a model in tensorflow on SMILES which is a bit of chemical information which tells you the molecular formula of a given molecule. I thought a transformer would work well because of the importance of context of each character of the string within the SMILES.

I am currently having an issue running the model this is the error I get:

C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:653 distributed_training_steps  *
    self.args.strategy.run(self.apply_gradients, batch)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:618 apply_gradients  *
    gradients = self.training_step(features, labels)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:601 training_step  *
    per_example_loss, _ = self.run_model(features, labels, True)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:682 run_model  *
    outputs = self.model(features, labels=labels, training=training)[:2]
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:1127 call  *
    outputs = self.bert(
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:615 call  *
    embedding_output = self.embeddings(input_ids, position_ids, token_type_ids, inputs_embeds, training=training)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:191 call  *
    return self._embedding(input_ids, position_ids, token_type_ids, inputs_embeds, training=training)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:206 _embedding  *
    seq_length = input_shape[1]

IndexError: list index out of range

This part is caused by running the trainer.train()

I have checked to make sure there are no zero values in my data and the dataset I have just contains the SMILES (String) which is a feature and then the target which is the CCS (Collisional Cross Section) (float)

This is my code:

  dataFrameData = pd.DataFrame(numpyArrayOfData, columns=['CAS', 'CCS', 'Compound', 'Adducts', 'Mass', 'SMILES']).iloc[1:]
# splitting the data into testing and training
train, test = train_test_split(dataFrameData, test_size=0.2)

# getting the CCS as the target from the data
targetTrain = train.pop('CCS').astype(float)
targetTest = test.pop('CCS').astype(float)
# getting the SMILES as first feature from data
dfStrippedTrain = train[['SMILES']].copy()
dfStrippedTest = test[['SMILES']].copy()

# compile these two into a test and training data set and then use enumerate so there is indexing 
dataSetTrain = tf.data.Dataset.from_tensor_slices((dfStrippedTrain.values, targetTrain.values)).enumerate()

dataSetTest = tf.data.Dataset.from_tensor_slices((dfStrippedTest.values, targetTest.values)).enumerate()

model = TFBertForSequenceClassification.from_pretrained("bert-large-cased")

training_args = TFTrainingArguments(
     output_dir = '/results',
     num_train_epochs = 3,
     per_device_train_batch_size=16,
     per_device_eval_batch_size=64,
     weight_decay=0.01,
     logging_dir='/logs'
) 

trainer = TFTrainer(
     model = model,
     args = training_args,
     train_dataset = dataSetTrain,
     eval_dataset = dataSetTest
)

trainer.train()
trainer.evaluate()

Any help you can give me would be greatly appreciated. Thank you in advance.

Hi,

what does your SMILES data string look like? Have you considered tokenizing it?

1 Like

Hi,

Ahhh, thanks for the suggestion I have looked at the SMILES data closer and I think it’s a byte instead of a string:

array([b’FC(C1=CC(C(C2=C(S3)C=CC=C2)=CCCN4CCNCC4)=C3C=C1)(F)F’],
dtype=object)>, <tf.Tensor: shape=(), dtype=float64, numpy=186.0>)

I’m going to convert them into strings and see if that helps.

If this doesn’t work I’ll have a look at tokenization, when would tokenization need to be used?

EDIT:
I’ve proceeded with more reading and now understand the error is probably because the input data needs to be tokenized and also padded (index error probably due to the passed sequences not being constant length)

Thank you
A