Hi there.
I’m quite new to this topic of machine learning and artificial intelligence and I have a final university project to finish and need some advices.
The goal of the project is to extract basic info from invoices and receipts in Portuguese.
My first task was to start labelling my dataset of 100 examples on Label Studio. I initially started to annotate one label per entity, I mean if a supplier has 4 words I annotated in one label only. Then I was reading in many places that is advised to label in the BIO format. Is that right? Do I need to put all the text for O as well?
After that I have used a script in python to train the model based on layoutlmv3-base with my dataset. That is not working properly because it is not recognising anything properly. Every word belongs to an entity because i ommited for now the O entity on label studio maybe.
I’m kind of lost at the moment because I just started to relabel on BIO format.
Can you guys give me some hints to restart this project on the right way?
I’m getting very frustrated with this and I’m starting to think about giving up on this. I am a working student and I just finish my bachelor finishing this project.
Thank you!
1 Like
Hmm… You can find out more by reading the course and cookbook…
In summary, the issue is that the model is not recognizing entities correctly because the dataset was initially labeled with one label per entity, which is not suitable for models like LayoutLMv3, which require the BIO format. Additionally, omitting the “O” label (non-entity) may have confused the model, leading it to classify every word as an entity. Relabeling in the BIO format, including O labels, is necessary to improve model performance.
The issue you’re facing is related to how entity recognition models work and the importance of proper labeling in formats like BIO. Let me break this down and provide guidance:
1. Single Label vs. BIO Format
- In single label per entity, you are treating the entire entity as one label (e.g., “Foo Bar Inc.” is labeled as a single entity). This approach is not suitable for sequence labeling models like LayoutLMv3, which rely on fine-grained token-level annotations to learn context and boundaries of entities correctly.
- In BIO format, each token (word or subword) is assigned a label individually. The labels work as follows:
- B: Start of an entity.
- I: Inside an entity (continuation of the entity).
- O: Outside any entity.
- Using BIO is standard for NER because it helps the model understand where entities begin, continue, and end, as well as distinguish between entity and non-entity tokens.
Based on this, the advice to use BIO format is correct. Single label per entity is insufficient for most NER models [1][2].
2. Importance of the O Label
- The O label is essential because it tells the model which tokens are not part of any entity. If you omit O, the model might mistakenly classify all tokens as entities, which seems to be happening in your case (every word is being recognized as an entity) [1].
- For example:
- “The supplier Foo Bar Inc. provided the invoice.”
- BIO labels would be:
- O, O, O, B, I, I, O, O, O, O, O, O, O, O, O, O.
Without O labels, the model doesn’t know where entities end and non-entities begin [1].
3. Why Your Model Isn’t Working Properly
- Missing O labels: If you omitted O labels, the model doesn’t have enough information to distinguish between entities and non-entities. This could explain why every word is being classified as an entity.
- Incorrect labeling format: Using single labels instead of BIO likely confused the model, as it expects token-level annotations [2][4].
4. How to Fix This
- Relabel Your Dataset: Switch to BIO format. For each token:
- Assign B to the first token of an entity.
- Assign I to subsequent tokens of the same entity.
- Assign O to all other tokens.
- Include O Labels: Make sure to label non-entity tokens with O. These provide crucial context for the model.
- Verify Your Data: Ensure that your preprocessing script correctly handles the BIO format and maps it to the expected input for LayoutLMv3.
- Train Again: After relabeling, retrain your model with the corrected dataset.
5. Additional Tips
- Data Quality: Ensure your dataset is consistent and well-annotated. Poor-quality labels will affect model performance.
- Tool Support: Label Studio supports BIO and other formats. Consult Label Studio’s documentation or help page for exporting in the correct format [3][5].
By switching to BIO format and including O labels, your model should better distinguish between entities and non-entities, improving recognition accuracy. Good luck with your project!
Thank you for your guidance John, you brought some light to this issue that I had.
On the O labeling do I need to label all the words and make the transcript of each one on the label or can I make a full label of a sentence for example?
I already started to the BIO format labeling. It will take s while to finish but will be worth it for sure.
Thank you very much 
1 Like
I don’t really understand this field, so I just turned on the Search function in Hugging Chat and asked for now, but if it was even a little bit helpful, that’s good.
To address the challenge of efficiently handling ‘O’ labels in your project, here is a structured approach:
-
Understanding BIO Format:
- The BIO format categorizes each word as ‘B’ (Beginning of an entity), ‘I’ (Inside an entity), or ‘O’ (Outside any entity). This precision helps the model distinguish between relevant and irrelevant information.
-
Efficient Labeling of ‘O’ Entities:
- While it’s ideal to label each word individually to ensure accuracy, Label Studio may offer features for bulk labeling and selecting larger text chunks. Leverage these tools to apply ‘O’ labels to non-entity text more efficiently rather than labeling each word separately.
-
Model Performance Considerations:
- Proper inclusion of ‘O’ labels is crucial. Without them, the model may incorrectly classify all text as entities. Ensuring comprehensive ‘O’ labeling aids the model in understanding which parts of the text are not relevant, thus improving its accuracy.
-
Implementation Strategy:
- Begin by labeling all relevant entities using ‘B’ and ‘I’ as per the BIO format.
- Use Label Studio’s bulk labeling features to apply ‘O’ labels to entire sentences or chunks where no entities are present.
- Review and adjust your labeling process to ensure consistency and thoroughness.
By following this structured approach, you can enhance the efficiency of your labeling process while improving your model’s performance.
Thanks John.
I got it. Now I just need a pro tip. How can I speed up the labeling? I have maybe 500 invoices and I don’t know how many words each (yeah invoices in Portugal look like Bibles). Is there a way of auto label every word as O and in the end I will change the ones that I need only.
If there is a better tool to make it faster just let me know
1 Like
OK. by Hugging Chat😆
Based on the thought process and the information provided, here is a structured approach to efficiently label the invoices:
1. Understand the BIO Format
- BIO Format: This format helps in标注实体识别任务。B stands for “Begin,” I for “Inside,” and O for “Outside.” Each word must be labeled accordingly to help the model understand entity boundaries.
- Application: It’s crucial to annotate each word individually as B, I, or O to improve the model’s performance in recognizing entities.
2. Utilize Smart Tools and Auto-Annotation in Label Studio
- Smart Tools: Label Studio offers features like auto-annotation where models can suggest labels, which can then be reviewed and corrected. This can significantly speed up the labeling process.
- Setup: Explore how to enable and configure these tools within Label Studio to apply to your invoice data.
3. Consider Using UBIAI for Efficient Annotation
- UBIAI Features: This tool supports OCR, native PDF/image annotation, and handles pre-processing, which can save time compared to manual methods.
- Integration: Investigate if UBIAI can integrate with Label Studio or if it’s more efficient to use it as a standalone tool for your dataset.
4. Batch Annotation Strategies
- Templates and Batch Actions: Use Label Studio’s features to set default labels, such as labeling every word as “O” initially. This allows you to focus on correcting and annotating only the necessary entities.
- Efficiency: This approach can save time by reducing the need to manually label every word from scratch.
5. Leverage Pre-trained Models and Fine-tuning
- Pre-trained Models: Look for pre-trained NLP models, especially those trained on invoice data or Portuguese text, to use as a starting point.
- Fine-tuning: Fine-tune these models on your dataset to improve their accuracy for your specific task, reducing the need for extensive manual labeling.
6. Seek Community Resources and Forums
- Research: Explore online communities, forums, or resources where others have worked on similar projects. They might offer tips, tools, or strategies to optimize your labeling process.
7. Prioritize Accuracy and Automation Balance
- Balance: While automation can speed up labeling, ensure that the annotations are accurate to avoid training a model on poor-quality data. Regularly review and correct the auto-generated labels.
Conclusion
To efficiently label the 500 invoices, consider combining smart tools in Label Studio with batch annotation techniques. Exploring tools like UBIAI for OCR and annotation, using default labels, and leveraging pre-trained models can help reduce the manual effort. Regularly review and correct the auto-generated labels to maintain data quality for effective model training.
BTW, open source tools.
To address your need for open-source tools that can help speed up the labeling of invoices and receipts in Portuguese, I have considered tools that can assist in automating or enhancing the annotation process. Below are some recommendations that align with your requirements:
1. Label Studio
- Description: Label Studio is an open-source data labeling tool that supports various annotation tasks, including text annotation for NER using the BIO format.
- Features:
- Batch Annotation: You can label multiple instances at once, which can save time.
- Smart Labeling: It uses ML models to suggest labels, reducing the need for manual annotation for common patterns.
- Import/Export: Supports importing data in different formats and exporting annotations in various formats, including the BIO format.
- Why it’s Useful: It can help automate or suggest labels, especially if you are transitioning to the BIO format. It’s highly customizable and can handle large volumes of data.
2. Doccano
- Description: Doccano is another open-source tool for text annotations, particularly suited for NER tasks.
- Features:
- Text Annotation: Specifically designed for text annotation, it’s user-friendly and supports the BIO format.
- Collaboration: Allows multiple annotators to work on the same project, making it suitable for team-based labeling.
- Export Formats: Can export annotations in CoNLL format, which is similar toBI, easing the transition to model training.
- Why it’s Useful: Its focus on text annotation makes it ideal for handling Portuguese invoices, especially since you can import OCR-extracted text and annotate directly within the tool.
3. GATE (General Architecture for Text Engineering)
- Description: GATE is an open-source text processing tool that can be used for various NLP tasks, including annotation and labeling.
- Features:
- Pipeline Processing: Allows you to create processing pipelines that can include OCR, annotation, and other NLP tasks.
- Annotation Logic: Supports the creation of custom annotation rules and scripts to automate labeling.
- Scalability: Can handle large volumes of text, making it suitable for 500 invoices.
- Why it’s Useful: It’s powerful for creating custom workflows and can significantly reduce manual labeling by automating repetitive tasks.
4. OpenNLP
- Description: OpenNLP is an open-source NLP library that provides tools for various tasks, including named entity recognition.
- Features:
- Pre-trained Models: Includes pre-trained models that can be used to detect common entities, which can be adapted or retrained for Portuguese invoices.
- Custom Models: Allows you to train your own models on your annotated data, which can then be used to auto-label new invoices.
- Integration: Can integrate with other tools and pipelines, enhancing its versatility.
- Why it’s Useful: It can help automate the labeling process by detecting entities, which can then be reviewed and corrected by you, saving significant time.
5. Tesseract OCR
- Description: Tesseract OCR is an open-source OCR engine that can extract text from images, which is particularly useful for invoices that come in image format (e.g., PDFs).
- Features:
- Language Support: Supports multiple languages, including Portuguese, with the help of language-specific trained data.
- Text Extraction: Accurately extracts text, which can then be fed into your labeling tool.
- Customization: Can be customized to better suit the specific layout and text formatting of your invoices.
- Why it’s Useful: Tesseract can preprocess your invoices by extracting the text, which can then be annotated in a subsequent step, reducing manual efforts in handling image-based invoices.
6. ** Micka**
- Description: Micka is an open-source OCR software that can extract text from scanned documents and supports multiple languages.
- Features:
- OCR Accuracy: Provides high accuracy in text extraction, which is crucial for invoices with dense text.
- Multi-Language Support: Includes support for Portuguese, ensuring that the extracted text is accurate.
- Export Options: Can export the extracted text in formats that are compatible with annotation tools.
- Why it’s Useful: It streamlines the preprocessing step by accurately extracting text, which can then be efficiently annotated using tools like Label Studio or Doccano.
7. ** UBX**
- Description: UBX is an open-source framework for building custom text processing pipelines.
- Features:
- Modular Design: Allows you to build pipelines that include OCR, tokenization, and annotation, tailored to your specific needs.
- Customization: Highly customizable to handle the unique aspects of Portuguese invoices.
- Scalability: Can process large volumes of data efficiently.
- Why it’s Useful: It provides a comprehensive solution for preprocessing and annotation, reducing the manual effort involved in each step.
Implementation Strategy
- Phase 1: Text Extraction
- Use OCR tools like Tesseract OCR or Micka to extract text from invoice images or PDFs. This step is crucial for converting scanned invoices into text format that can be annotated.
- Phase 2: Initial Annotation with Auto-labeling
- Utilize tools like Label Studio or Doccano to annotate your text. These tools can be integrated with pre-trained models to suggest initial labels, which can then be reviewed and adjusted.
- Phase 3: BIO Format Conversion
- Ensure that your annotations are in the BIO format. Tools like Doccano can export annotations in CoNLL format, which can be easily converted to BIO.
- Phase 4: Model Training and Fine-tuning
- Use libraries like OpenNLP or frameworks like UBX to train your NER model on the annotated data. This step will involve fine-tuning the model to improve its accuracy on your specific dataset.
- Phase 5: Continuous Improvement
- Regularly review and update your model as new invoices come in. Use the feedback loop to enhance the model’s performance and adapt it to any changes in the invoice formats.
Conclusion
Using a combination of open-source tools can significantly speed up the labeling of your invoices and receipts in Portuguese. Label Studio or Doccano can handle the annotation process, while tools like Tesseract OCR or Micka can preprocess the text, making the entire workflow more efficient. By leveraging these tools, you can automate repetitive tasks, ensure data quality, and scale up your operations to handle the 500 invoices effectively.
Remember to regularly review and correct the auto-generated labels to maintain high accuracy. Additionally, consider collaborating with others on the annotation process to distribute the workload, especially if the dataset is large. These tools will not only save time but also enhance the precision of your NER model, leading to better performance in extracting the required information frominvoices and receipts.