DocBank dataset for fine-tuning huggingface pre-trained model

crazybird · March 2, 2022, 3:29pm

Hi,

I’m looking for advice on what I need to do the following.

I have a dataset called DocBank (DocBank Website)
The annotation is available in MS-COCO format
There is a dataloader available with the following methods :

example.filepath # The image filepath
example.pagesize # The image size
example.words # The tokens
example.bboxes # The normalized bboxes
example.rgbs # The RGB values
example.fontnames # The fontnames
example.structures # The structure labels

Basically I wish to do fine-tuning much like a user did here Fine-tuning Notebook but instead on the Docbank dataset.

Any help pointing me in the correct direction for what I need to create a “huggingface” compatible dataset would be much appreciated.

mariosasko · March 4, 2022, 1:48pm

Hi!

To create a dataset similar to the one in the linked notebook, you need to write a loading script. You can find the instructions on how to do that here: Create a dataset loading script — datasets 1.18.3 documentation, and the loading script of the linked dataset here: funsd.py · nielsr/funsd at main.

Topic		Replies	Views
Loading custom audio dataset and fine-tuning model Beginners	6	3235	December 12, 2023
How to load my own BILOU/IOB labels for training? Beginners	1	1551	January 10, 2022
Uploading image dataset to Huggingface Hub 🤗Datasets	2	2573	October 14, 2022
Dataset curation extra parameters Beginners	2	31	January 19, 2025
Hugging Face Dataset with tree 🤗Datasets	0	595	October 6, 2022

DocBank dataset for fine-tuning huggingface pre-trained model

Related topics