Getting the error in the topic when I try to run datasets .map()
function.
I’m preprocessing about ~3,000 large pdf documents which have been converted into pages of images, words and bboxes. .map()
is being as follows:
dataset = Dataset.from_pandas(data)
features = Features({
'image': Array3D(dtype="int64", shape=(3, 224, 224)),
'input_ids': Sequence(feature=Value(dtype='int64')),
'attention_mask': Sequence(Value(dtype='int64')),
'token_type_ids': Sequence(Value(dtype='int64')),
'bbox': Array2D(dtype="int64", shape=(512, 4)),
'label': ClassLabel(num_classes=len(unique_labels), names=unique_labels),
})
encoded_dataset = dataset.map(preprocess_data, features=features)
and the preprocess_data
function looks like:
def preprocess_data(examples):
directory = os.path.join(FILES_PATH, examples['file_location'])
images_dir = os.path.join(directory, PDF_IMAGE_DIR)
textract_response_path = os.path.join(directory, 'textract_response.json')
doc_meta_path = os.path.join(directory, 'pdf_meta.json')
textract_document = get_textract_document(textract_response_path, doc_meta_path)
images, words, bboxes = get_doc_training_data(images_dir, textract_document)
encoded_inputs = PROCESSOR(images, words, boxes=bboxes, padding="max_length", truncation=True)
# https://github.com/NielsRogge/Transformers-Tutorials/issues/36
encoded_inputs["image"] = np.array(encoded_inputs["image"])
encoded_inputs["label"] = examples['label_id']
return encoded_inputs
This setup worked fine for the first 998/2933 instances, on instance 999 I first got this error:
ArrowInvalid: Can only convert 1-dimensional array values
which I fixed by adding:
encoded_inputs["image"] = np.array(encoded_inputs["image"])
As specified in ArrowInvalid: Can only convert 1-dimensional array values · Issue #36 · NielsRogge/Transformers-Tutorials · GitHub
However, now on the exact same instance (999) I now get:
OverflowError: There was an overflow with type <class 'list'>. Try to reduce writer_batch_size to have batches smaller than 2GB.
(offset overflow while concatenating arrays)
Any idea what is going on here?