Creating a Cyrillic(Bulgarian) Handwritten OCR Dataset - Guidance needed

Hello! I am at the start of creating an OCR dataset for handwritten Bulgarian text, a Cyrillic script. After some back and forth with LLMs and searching for existing HTR dataset guidelines and best practices, I’ve settled around the following setup:

  • LabelStudio running from a docker container on my Windows laptop.

    • Labeling Interface
      <Rectangle name="bbox" toName="image" strokeWidth="3"/>
      
      <Header value="Transcription"/>
      <TextArea name="transcription" toName="image" editable="true" perRegion="true" required="true" rows="5" placeholder="Type Cyrillic transcription here..."/>
      
  • Text is labeled on a line level with a simple bounding box that I aim to cover mostly the current row. I split if the line is too curved

  • Illegible words are not included in a bounding box

Sample image and transcription:

The dataset for now includes about 500 images of student notebooks and I am in the process of acquiring more. It has about 15 writing styles(ranging from very pretty and legible to the thing above) and the images are of varying quality, more on the higher quality side.

I have a few questions on which I need guidance:

  • Do I have to switch to a more fine-tuned polygon method?
  • Some of the images contain english text at times. I have just included it as it is “normal transcription english normal transcription”. Are there issues with this?
  • Some of the images often contain some more advanced formatting formula such as x^2 or similar. I have currently just skipped those

As far as I can find Considering the amount of effort it will take:

  • Is such a dataset even worth creating?

    • It is quite niche - bulgarian and handwriting at that
    • LLMs with vision capabilities manage at times to accurately transcribe. But in the non perfect scenarios they attempt to make something that fits in the rest of the words but it is incorrect most of the time.

I have tried to utilise russian HTR models provided by Transkribus for example, but their results are very much unusable. I also spent an evening setting up EasyOCR in the LabelStudio docker container, tho the bounding boxes from that weren’t helpful at all. It is most definitely a me issue for that tho.

Thanks in advance for your time and answers!

1 Like

I think worth creating, too:


Your setup is a good version 1.

The right mental model is not “build a perfect OCR system for everything Bulgarian handwritten.” It is “build a line-level handwritten text recognition dataset for Bulgarian student notebooks.” That is a much more realistic target, and it matches how current HTR work is still organized: line-level recognition remains a core task, while page-level layout and end-to-end document reading are treated as a harder, separate layer. (arXiv)

Overall answer

My answer to your main questions is:

  • No, you do not need to switch to polygons now.
  • Occasional English is fine. Keep it as written.
  • Do not skip all notation. Keep simple inline forms like x^2, but exclude complex math in v1.
  • Yes, the dataset is worth creating.
  • The poor results from Russian models and EasyOCR are not surprising.

That is the shortest correct answer. The reasons matter.

1. Your current annotation design is basically correct

You are annotating line by line, with one box and one transcription. That is a standard and sensible choice for a first HTR dataset. Kraken’s training documentation makes the separation very explicit: segmentation is about finding lines and regions on the page, while recognition is about converting line images into text. For recognition, the target unit is the text contained in a line. (Kraken)

That matters because it means your boxes are not a compromise that ruins the project. They are a practical way to create the exact supervision signal a recognizer needs.

Your current strategy also fits the way the recent HTR survey frames the field. It explicitly distinguishes between work up to line level and work beyond line level. Your project belongs in the first category, and that is a good place to start. (arXiv)

2. Do you need polygons or baselines?

Not yet.

You only need to move from rectangles to polygons or baselines when layout becomes the bottleneck. The recent line-segmentation survey is useful here because it explains the different representations clearly: text lines may be represented by bounding boxes, polygons, or baselines, and the right choice depends on the extraction problem, not on abstract purity. It also stresses that text-line extraction matters because it affects downstream HTR accuracy. (Spinger Links)

For your case, rectangles are enough when:

  • one box mostly contains one line
  • neighboring lines do not overlap too much
  • ascenders and descenders are not being chopped off
  • the line is not so curved that the crop includes too much irrelevant text

So I would keep your current approach for most pages.

I would only introduce polygons or baselines for a hard subset where one of these keeps happening:

  • the line is strongly curved
  • neighboring lines touch
  • slant or perspective makes a rectangle inefficient
  • a box must include too much of the line above or below

That gives you the right tradeoff: fast annotation for most of the corpus, extra detail only where it clearly buys accuracy. (Kraken)

3. English inside Bulgarian lines

This is not a problem by itself.

Transkribus’ data-preparation guide says a model can be trained to recognize two or more hands, languages, types of writing, or alphabets at the same time, but those variants must appear in the ground truth in a representative way. In other words, mixed Bulgarian and English is allowed. The real issue is not the existence of English. The real issue is whether it appears often enough, and whether it is transcribed consistently. (Transkribus Help Center)

So your current policy is good:

normal transcription english normal transcription

That is better than inventing special markers around English words, because special markers would become part of the target text.

What I would add is metadata. Mark lines as:

  • bg
  • en
  • mixed

That gives you a way to evaluate later whether mixed-script lines are harder than pure Bulgarian lines. Without that tag, you will not know whether the model is failing because of handwriting difficulty or script mixing. That is an inference from the current guidance on representative ground truth and the known importance of domain mismatch in HTR. (Transkribus Help Center)

4. What to do with formulas like x^2

You should not skip all of them.

Transkribus recommends a consistent, accurate transcript that reflects what is on the page, and it explicitly discusses the value of a diplomatic transcription where punctuation, superscripts, and subscripts are transcribed as they appear. It also notes that, if conventions are consistent enough, the model can learn them. (Transkribus Help Center)

That means simple inline notation should stay in the dataset:

  • x^2
  • a+b
  • y=7
  • dates
  • percentages
  • short Latin variable names

But complex mathematical layout is different. CROHME exists precisely because handwritten mathematical expression recognition is treated as its own task, not just ordinary OCR with a few special characters added. (Transkribus Help Center)

So the practical rule should be:

  • Keep simple inline notation in the main dataset.
  • Use one encoding consistently. For example, always x^2, not sometimes x^2 and sometimes .
  • Exclude or separately flag complex displayed formulas in version 1.

That keeps the task coherent.

5. Illegible words

Your current rule needs one refinement.

Right now you say illegible words are not included in a bounding box. That is workable only if the rule is consistent. The Transkribus guide is blunt: ground truth should be as accurate as possible, because mistakes in ground truth teach the model the wrong thing. It also repeatedly stresses consistency of editorial choices. (Transkribus Help Center)

A cleaner rule would be:

  • if the whole line is too unclear, exclude the line
  • if only one short span is unclear, use one fixed unreadable-span convention
  • if uncertainty is frequent in that line, exclude it from the core training set

The exact placeholder matters less than consistency.

6. Is the dataset worth creating?

Yes.

This is the strongest part of the answer.

Why it is worth it

Current HTR models still suffer from distribution shift. A 2025 study on HTR generalization found that out-of-distribution performance drops are driven first by textual divergence and then by visual divergence. That is directly relevant to you. “Cyrillic” is not enough. “Handwriting” is not enough. “Russian HTR” is not enough. Bulgarian student notebooks have their own text distribution, spelling, symbols, classroom notation, page layout, and writer habits. (arXiv)

There is also still visible scarcity in Bulgarian OCR resources. A recent Bulgarian paper describes creating the first benchmark dataset for OCR text correction in historical Bulgarian orthography, which is a strong sign that Bulgarian OCR remains under-resourced enough that new datasets still matter. That paper is about historical print correction, not modern handwriting, but that actually strengthens the case: the public ecosystem is still building core Bulgarian resources rather than already being saturated. (Spinger Links)

So yes, the niche is real. That is exactly why the dataset is valuable.

Why vision LLMs do not remove the need

A 2025 benchmark of large language models for handwritten text recognition found that these models perform strongly on English, more weakly on other languages, and do not show a significant self-correction capability. The comparison with Transkribus-style models was mixed rather than uniformly in favor of LLMs. (Science Direct)

That matches your observation very well. A vision model can produce something plausible. But “plausible” is not the same as “faithful transcription.” For OCR and HTR, exact character fidelity matters.

7. Why the Russian HTR models failed on your pages

That result is not surprising.

The current HTR evidence says transfer depends on more than script. Writer style, domain, lexicon, notation, and image conditions all matter. The OOD study above is the core reason. A Russian model may know Cyrillic strokes, but still fail on Bulgarian school notebooks because the target distribution is different. (arXiv)

The same logic explains why EasyOCR was not very helpful. EasyOCR describes itself as a general OCR system that reads scene text and dense document text. It is broad and convenient, but it is not a handwriting-specialized, notebook-line HTR system. That makes it a poor fit for your exact use case, especially for generating useful line boxes on messy handwritten pages. (GitHub)

So no, this does not look like a “you issue.”

8. Is 500 images enough?

It is enough to start. It is not enough to be done.

Transkribus recommends 5,000 to 15,000 words as a starting range, around 25 to 75 pages, and specifically advises at least 10,000 words for each hand for handwritten documents. It also says that models trained on much larger multi-hand corpora can start to generalize to unseen hands, though with weaker performance than on in-domain validation. (Transkribus Help Center)

That means your current collection is likely enough for:

  • defining a transcription policy
  • building a pilot model
  • discovering the main failure modes

But it is probably not enough to support a strong claim like “general Bulgarian handwriting OCR” yet.

The biggest scaling priority from here is probably more writers, not just more pages from the same writers. That follows directly from the per-hand guidance above. (Transkribus Help Center)

9. What matters most in your project

The most important decision is not box versus polygon.

It is transcription policy consistency.

The Transkribus guide says this several times in different ways:

  • ground truth must be accurate
  • editorial choices must be consistent
  • the most common approach is a consistent transcript that accurately represents what is read
  • if conventions are consistent enough, the model can learn them (Transkribus Help Center)

For your case, the highest-value work is writing a short policy that fixes decisions for:

  • spaces
  • punctuation
  • capitalization
  • English words
  • digits
  • inline formulas
  • strike-throughs
  • unreadable spans
  • end-of-line hyphenation

That will help your model more than switching every line to a polygon.

10. My concrete recommendation for your exact setup

I would do this:

Keep

  • line-level annotation
  • one transcription per line
  • rectangular boxes for most data
  • English inline as written
  • simple inline notation such as x^2

Add

  • metadata: writer_id, page_id, language_tag, quality_tag, notation_tag
  • a one-page transcription handbook
  • a writer-independent validation and test split

Exclude for version 1

  • highly uncertain lines
  • complex displayed math
  • pages where layout is so messy that your current boxes stop isolating lines reliably

Introduce later, only if needed

  • polygon or baseline annotation for a hard subset
  • page-level segmentation models
  • post-correction as a second stage

This is the version of the project most likely to produce a usable dataset with a realistic annotation cost. It also matches how current HTR tools and surveys structure the problem. (Kraken)

Final verdict

Your current plan is good.

Not perfect. Good.

The adjustments I would make are:

  1. Do not switch the whole corpus to polygons.
  2. Do not remove English.
  3. Do not skip all formulas. Keep simple inline notation.
  4. Write strict transcription rules now.
  5. Collect more writer diversity.
  6. Treat this as a line-level Bulgarian HTR dataset, not a general OCR dataset.

That is a project worth continuing. It fills a real gap, and the failure of off-the-shelf Russian models is evidence of that gap, not evidence that the project is misguided. (Spinger Links)