IndexError list out of range

Elemets · September 18, 2020, 2:06pm

Hi,

So I’m currently trying to train a model in tensorflow on SMILES which is a bit of chemical information which tells you the molecular formula of a given molecule. I thought a transformer would work well because of the importance of context of each character of the string within the SMILES.

I am currently having an issue running the model this is the error I get:

C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:653 distributed_training_steps  *
    self.args.strategy.run(self.apply_gradients, batch)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:618 apply_gradients  *
    gradients = self.training_step(features, labels)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:601 training_step  *
    per_example_loss, _ = self.run_model(features, labels, True)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\trainer_tf.py:682 run_model  *
    outputs = self.model(features, labels=labels, training=training)[:2]
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:1127 call  *
    outputs = self.bert(
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:615 call  *
    embedding_output = self.embeddings(input_ids, position_ids, token_type_ids, inputs_embeds, training=training)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:191 call  *
    return self._embedding(input_ids, position_ids, token_type_ids, inputs_embeds, training=training)
C:\ProgramData\Anaconda3\envs\tf-transformer\lib\site-packages\transformers\modeling_tf_bert.py:206 _embedding  *
    seq_length = input_shape[1]

IndexError: list index out of range

This part is caused by running the trainer.train()

I have checked to make sure there are no zero values in my data and the dataset I have just contains the SMILES (String) which is a feature and then the target which is the CCS (Collisional Cross Section) (float)

This is my code:

  dataFrameData = pd.DataFrame(numpyArrayOfData, columns=['CAS', 'CCS', 'Compound', 'Adducts', 'Mass', 'SMILES']).iloc[1:]
# splitting the data into testing and training
train, test = train_test_split(dataFrameData, test_size=0.2)

# getting the CCS as the target from the data
targetTrain = train.pop('CCS').astype(float)
targetTest = test.pop('CCS').astype(float)
# getting the SMILES as first feature from data
dfStrippedTrain = train[['SMILES']].copy()
dfStrippedTest = test[['SMILES']].copy()

# compile these two into a test and training data set and then use enumerate so there is indexing 
dataSetTrain = tf.data.Dataset.from_tensor_slices((dfStrippedTrain.values, targetTrain.values)).enumerate()

dataSetTest = tf.data.Dataset.from_tensor_slices((dfStrippedTest.values, targetTest.values)).enumerate()

model = TFBertForSequenceClassification.from_pretrained("bert-large-cased")

training_args = TFTrainingArguments(
     output_dir = '/results',
     num_train_epochs = 3,
     per_device_train_batch_size=16,
     per_device_eval_batch_size=64,
     weight_decay=0.01,
     logging_dir='/logs'
) 

trainer = TFTrainer(
     model = model,
     args = training_args,
     train_dataset = dataSetTrain,
     eval_dataset = dataSetTest
)

trainer.train()
trainer.evaluate()

Any help you can give me would be greatly appreciated. Thank you in advance.

rgwatwormhill · September 18, 2020, 2:31pm

Hi,

what does your SMILES data string look like? Have you considered tokenizing it?

Elemets · September 21, 2020, 8:13am

Hi,

Ahhh, thanks for the suggestion I have looked at the SMILES data closer and I think it’s a byte instead of a string:

array([b’FC(C1=CC(C(C2=C(S3)C=CC=C2)=CCCN4CCNCC4)=C3C=C1)(F)F’],
dtype=object)>, <tf.Tensor: shape=(), dtype=float64, numpy=186.0>)

I’m going to convert them into strings and see if that helps.

If this doesn’t work I’ll have a look at tokenization, when would tokenization need to be used?

EDIT:
I’ve proceeded with more reading and now understand the error is probably because the input data needs to be tokenized and also padded (index error probably due to the passed sequences not being constant length)

Thank you
A

Topic		Replies	Views
List index out of range when saving 🤗Transformers	0	379	October 21, 2021
IndexError: index out of range in self - Text Generation with GPT2 Beginners	2	5767	November 27, 2023
Huggingface Question Answering on bert Validation on Squad (list index out of range()) Beginners	0	196	January 7, 2024
IndexError: Target 4 is out of bounds Beginners	1	737	May 2, 2023
Bert Model: IndexError: too many indices for tensor of dimension 2 🤗Transformers	0	752	November 17, 2023

IndexError list out of range

Related topics