DocBank dataset for fine-tuning huggingface pre-trained model

Hi,

I’m looking for advice on what I need to do the following.

  • I have a dataset called DocBank (DocBank Website)
  • The annotation is available in MS-COCO format
  • There is a dataloader available with the following methods :

example.filepath # The image filepath
example.pagesize # The image size
example.words # The tokens
example.bboxes # The normalized bboxes
example.rgbs # The RGB values
example.fontnames # The fontnames
example.structures # The structure labels

Basically I wish to do fine-tuning much like a user did here Fine-tuning Notebook but instead on the Docbank dataset.

Any help pointing me in the correct direction for what I need to create a “huggingface” compatible dataset would be much appreciated.

Hi!

To create a dataset similar to the one in the linked notebook, you need to write a loading script. You can find the instructions on how to do that here: Create a dataset loading script — datasets 1.18.3 documentation, and the loading script of the linked dataset here: funsd.py · nielsr/funsd at main.