Error when doing tokenization

JoaoCP · April 29, 2022, 8:23pm

Hello,

I have a dataset with various documents and this documents are in a jsonl file, and each document is an entry with an id and contents fields. The contents field contains all the information about the document. This information is distributed in 5 fields separated by ‘\n’. For instance:

{“id”: “NCT03538132”, “contents”: “Patients’ Perception on Bone Grafts\nPatients’ Perception on Bone Biomaterials Used in Dentistry : a Multicentric Study\nThe goal of this study is to collect the patients’ opinion about the different types of bone graft, to assess which are the most rejected by the patients and if the demographic variables (such as the gender or the age) and the level of education influence their decision.\nNowadays, many procedures may need regenerative techniques. Some studies have already assessed the patients’ opinion regarding soft tissue grafts, some investigators have centered their studies on the techniques’ efficiency without assessing the patient’s perception.\nInclusion criteria: - Adult (18 years old or more) - Able to read and write - Not under the influence of alcohol or drugs - Had not previously undergone any surgery involving bone graft or bone augmentation. Exclusion criteria: - Any patient who doesn’t fullfill the inclusion criterias.”}

For each document I would like to know how many tokens are generated for each field and plot a distribution of the number of tokens with the respect to the field. I only can get the tokens for 66% of my dataset, after that I get this error:

line 145, in
inputs = tokenizer(
File “/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py”, line 2413, in call
return self.batch_encode_plus(
File “/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py”, line 2598, in batch_encode_plus
return self._batch_encode_plus(
File “/usr/local/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py”, line 439, in _batch_encode_plus
for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range

I saved the documents that are breaking in a log file and tried to run again with a few documents and it breaks it some documents. I tried to change the other of the documents, the ones breaking I put them in the beginning and it worked for these but then it breaks in other documents. Can anyone help me?

This is my code:

collection_iterator = JsonlCollectionIterator(f'{ct_utils.paths["input_dir"]}/documents.jsonl', fields=   ["brief_title", "official_title", "brief_summary", "detailed_description", "criteria"])

fields_tokens = {
                "brief_title": [], 
                "official_title": [],
                "brief_summary": [],
                "detailed_description": [],
                "criteria": []
            }

log_file = open('documents_log.txt', 'w')

for index, batch_info in enumerate(collection_iterator(1, 0, 1)):
    
    for field in collection_iterator.fields:

        # try:
        inputs = tokenizer(
            batch_info[field],
            padding='longest',
            truncation=False,
            add_special_tokens=True,
            return_tensors='pt'
        )
        # except:
        #     log_file.write(batch_info["id"][0] + "\n")

        fields_tokens[field].append({'index': index + 1, 'document_id': batch_info['id'][0], 'tokens_count': inputs["input_ids"].shape[1]})

#log_file.close()

with open("fields_tokens.json", "w") as file:
    json.dump(fields_tokens, file, indent=4)

I didn’t pass the max length and set the truncation to false in the tokenizer because I want it to generate the tokens for the complete text.

Topic		Replies	Views
ArrowInvalid: Column 1 named id expected length 512 but got length 1000 🤗Datasets	4	15223	June 6, 2024
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1074	August 19, 2021
Text Classification tokenizer problems on inference Intermediate	4	2267	October 12, 2022
Programmatic way to Tokenization on Custom Text Columns 🤗Tokenizers	0	568	June 27, 2022
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12709	October 6, 2021

Error when doing tokenization

Related topics