Fine-tunning donut for full table data extraction

Hi, is it possible to train donut for table data extraction and if so how would one build the metadata.jsonl gt_parse to include rowspans and collspans?
I want to extract all rows / columns of all tables in the image.

For example this table:

Is this format allowed or si it a better option to specify if a row/col is spanned over multiple rows/cols?

{
 table: [
  {
   rows: [
    [
     { 0: 'Day', 1: 'Seminar', 2: 'Seminar', 3: 'Seminar' },
     { 0: 'Day', 1: 'Schedule', 2: 'Schedule', 3: 'Topic' },
     { 0: 'Day', 1: 'Begin', 2: 'End', 3: 'Topic' },
    ],
    [
     { 0: 'Monday', 1: '8:00 a.m.', 2: '5:00 p.m.', 3: 'Introduction to XML' },
     { 0: 'Monday', 1: '8:00 a.m.', 2: '5:00 p.m.', 3: 'Validity DTD and Relax NG' },
    ]
   ],
  },
 ],
...
}

Thanks

Hi, did you do this model? How it looks like? What is the accurace for tables? I search about it and find that for tables is good to put bounding box with the cell values for donut dataset, but I didnt check it. Maybe you have some expierence right now?

I’ve done some trials using PubTables-1M using a structure like this:
{“file_name”:“PMC1064074_table_0.jpg”,“ground_truth”:"{"gt_parse":{"cells":[{"row_0_col_0":"Kinetic parameter"},{"row_0_col_1":"ND"},{"row_0_col_2":"D"},{"row_0_col_3":"D + dn-RhoA"},{"row_0_col_4":"D + dn-Rac1"},{"row_1_col_0":"Vmax"},{"row_1_col_1":"19.6 ± 0.75"},{"row_1_col_2":"26.2 ± 0.86*"},{"row_1_col_3":"31.3 ± 0.88†"},{"row_1_col_4":"21.6 ± 0.9"},{"row_2_col_0":"K’ for H+"},{"row_2_col_1":"0.150 ± 0.02"},{"row_2_col_2":"0.113 ± 0.05†"},{"row_2_col_3":"0.105 ± 0.07†"},{"row_2_col_4":"0.137 ± 0.023"},

It works on images that only contains the table or extracting the tables from the image and running donut only on that, but i’m not happy with the accuracy and i get wrong cell coordinates for tables that contains rowspan or colspan, i’m still working on it…
Also i only trained it on 10k images, will try training it on full dataset but our GPU’s are busy on another project now…

A friend of mine had better results using pero-ocr, extracting words coordinates and getting the cells col and row by coordinates and building a json.
It’s still a WIP…

Did anyone try to do this? I am eager to know if we can create the dataset for full table extraction. If yes then what will be the proper format even if the cells are spanned over multiple columns and rows?