Reinforement Structure Analysis

Iโ€™m working on an AI/ML solution to automatically count the number of horizontal iron bars in rebar cage images taken at construction sites.

The challenge is that a single image contains:

  1. Front-face horizontal bars (the ones that should be counted),
  2. Rear bars visible through the cage,
  3. Interior bars and stirrups,
  4. Heavy occlusion and overlapping steel members.

For example, in the attached image, a human inspector would count only the front-facing horizontal iron bar levels.

My questions are:

  1. What would be the most robust approach to distinguish front-face bars from interior/rear bars in a single RGB image (or multiple if required)?
  2. Has anyone solved a similar problem involving dense repetitive construction structures and occlusion?
  3. Would monocular depth estimation (e.g., Depth Anything, MiDaS) be sufficient?
  4. Are there classical CV techniques that could outperform deep learning for this specific task?
  5. If you were building a production-grade solution, how would you structure the pipeline?

Any suggestions, papers, datasets, or practical experiences would be greatly appreciated. It is not necessary that I have to stick to CV classical approach only.

Attached image for reference (example : one green mark means one horizontal bar).

Hmm, maybe something like this?:


I would treat this as a front-face rebar perception / bar-level counting problem, not mainly as a generic monocular-depth problem.

My short answer would be:

Monocular depth estimation may help, but I would not rely on Depth Anything / MiDaS alone. I would first use rebar-specific detection or segmentation to get candidate rebars, then use geometry, occlusion cues, apparent thickness, continuity, optional depth, and vertical clustering to decide which horizontal bar levels belong to the front face.

The important distinction is that your target is probably not โ€œall visible rebars.โ€ It is more specific:

Count the front-face horizontal bar levels while ignoring rear/interior/stirrup bars visible through the cage.

That means the core problem is not only detection. It is target-layer selection.

1. First, I would reframe the task

Your image contains several visually similar things:

Visible structure Should it be counted? Why it is difficult
Front-face horizontal bars Yes These are the target rows/levels
Rear-face horizontal bars No They look similar and are visible through the cage
Interior bars No They create false horizontal/diagonal candidates
Stirrups / hoops / vertical members No They overlap and occlude the target bars
Occluded or partially visible front bars Maybe / yes They may be broken into fragments in the image

So I would formulate the problem as:

detect/segment rebar candidates
โ†’ identify which candidates belong to the front face
โ†’ cluster front-face horizontal candidates by vertical position
โ†’ count clusters as horizontal bar levels

This is different from simply detecting every steel bar instance.

2. Existing rebar-specific work is directly relevant

I would look at rebar-specific detection and segmentation work before relying only on generic monocular depth.

A useful recent reference is the ROI-1555 Rebar Detection and Instance Segmentation Dataset. The Hugging Face dataset page says it contains 1555 rebar images with bounding boxes and pixel-wise masks, covering diverse specifications, layouts, scenarios, and environmental conditions:

That line of work is useful because it treats this as a rebar perception problem: detect/segment steel bars under varying layouts, camera views, and assembly stages.

However, I would be careful not to assume that a generic rebar segmenter immediately solves your exact task. A rebar segmenter gives you candidate rebars. Your harder task is then:

Which of these detected/segmented bars belong to the front face?
Which horizontal candidates form one countable bar level?

So I would think of rebar-specific segmentation as the first stage, not the whole solution.

3. A practical pipeline I would try

A production-ish pipeline could look like this:

Input image
  โ†“
Crop / detect the rebar cage region
  โ†“
Detect or segment rebar candidates
  โ†“
Keep near-horizontal elongated candidates
  โ†“
Score each candidate for "front-face likelihood"
  โ†“
Cluster selected candidates by vertical position
  โ†“
Return count + overlay + confidence / review flag

More concretely:

Stage Method Notes
Cage / ROI extraction Manual crop, detector, or simple image UI Reduces background false positives
Rebar candidate extraction Rebar-specific detector/segmenter, YOLO-seg, Mask R-CNN, Mask2Former, etc. This should probably be learned rather than pure classical CV
Horizontal filtering Orientation, aspect ratio, skeletonization, Hough lines, connected components Classical CV is useful here
Front-face selection Geometry, apparent thickness, contrast, continuity, occlusion order, optional depth This is the main hard part
Level counting Cluster by y-coordinate / projected cage coordinate Count row/level clusters, not necessarily individual fragments
Output Count + visual overlay + confidence Important for inspection use cases

4. Why monocular depth alone is probably not sufficient

Depth Anything / MiDaS can be useful, but I would use them as one cue, not as the final authority.

MiDaS is commonly described as producing relative inverse depth from a single image, not guaranteed metric 3D geometry:

Depth Anything is also very useful for robust monocular depth estimation:

But in a dense rebar cage, there are several reasons monocular depth can be unreliable as the only decision signal:

Issue Why it matters here
Repetitive structure Front and rear bars have similar appearance
Thin objects Depth boundaries around thin steel bars can be unstable
Occlusion A front bar may be partially hidden or broken in the image
Relative depth You may get useful ordering, but not always a reliable construction-level separation
Similar material/color Steel bars may not provide strong semantic cues for depth

So I would use depth like this:

front-face score =
  geometry score
+ continuity score
+ apparent thickness / sharpness score
+ occlusion cue score
+ optional monocular depth score

Not like this:

depth map โ†’ threshold โ†’ front bars

The second approach is probably too brittle.

5. If multiple images, video, stereo, or RGB-D are possible, use them

If you can capture more than one image, I would prefer that over trying to solve everything from one RGB image.

Useful options:

Capture setup Benefit
Single RGB image Cheapest, but hardest for front/rear separation
Short video / slight camera motion Parallax helps distinguish front and rear structures
Two or more views Easier to infer cage planes and target layer
Stereo camera More reliable depth than monocular depth
RGB-D camera Useful for spacing and target-layer extraction if the sensor works in the environment

There is relevant work on steel-bar installation inspection using Mask R-CNN + stereo vision, where CNN-based detection is combined with stereo-based attribute estimation:

There is also work on rebar spacing inspection using vision-based deep learning with RGB-D cameras:

This does not mean RGB-D is mandatory, but it suggests that for production inspection, adding geometric information can be more robust than expecting a single RGB monocular-depth model to infer everything.

6. Classical CV can help, but I would not use it alone

Classical CV may be useful after candidate extraction.

For example:

  • edge detection
  • morphology
  • skeletonization
  • connected components
  • Hough line transform
  • horizontal projection profiles
  • y-coordinate clustering
  • line-fragment merging

OpenCVโ€™s Hough line transform / probabilistic Hough transform is a standard tool for line detection:

But I would not expect pure Hough lines on the raw image to solve the full task. The rear bars and interior bars can also produce strong line candidates. So classical CV is probably best used as post-processing:

segmentation mask
โ†’ horizontal line / skeleton extraction
โ†’ merge fragments
โ†’ cluster rows
โ†’ apply front-face filtering

Not as:

raw image
โ†’ Hough lines
โ†’ count

7. Annotation strategy matters

If you train or fine-tune a model, the label design should match the real goal.

Possible annotation strategies:

Annotation strategy Pros Cons
rebar as one class Easy; close to existing datasets Still need front/rear separation later
horizontal_rebar, vertical_rebar, stirrup Better structure awareness Still may not identify front face
front_horizontal_bar, other_rebar Directly aligned with your goal Requires custom labels
row/level annotations as polylines Very close to the final count Less like standard object detection
keypoints at intersections Useful for spacing/geometry More annotation effort

For your exact problem, I would probably prefer one of these:

front_horizontal_bar
other_visible_rebar
ambiguous_or_occluded

or, if the goal is only counting levels:

front_horizontal_bar_level_polyline

That way the model learns the distinction you actually care about, instead of learning only โ€œsteel bar vs background.โ€

8. A useful mental model: detection first, then layer assignment

I would separate the problem into two subproblems:

A. Rebar perception

Detect or segment steel bars.

Relevant approaches:

  • YOLO-style object detection
  • YOLO-seg
  • Mask R-CNN
  • Mask2Former
  • Deformable DETR
  • SAM-assisted annotation

For general rebar counting, there is already work using YOLOv3 on construction-site rebar images:

That paper is not exactly the same as your problem, because counting rebar sections is different from counting front-face horizontal cage levels. But it is a useful signal that rebar counting/detection is a normal and practical CV task, not an exotic one.

B. Target-layer / front-face selection

After you have rebar candidates, decide which ones are on the front plane.

Possible cues:

Cue Why it helps
Apparent thickness Front bars may look thicker / clearer
Sharpness / contrast Front bars may have stronger edges
Continuity Front-face horizontal bars often continue across the cage width
Occlusion order Front bars may visually occlude rear bars
Regular spacing Front-face levels should form a plausible repeated pattern
Cage geometry Candidate bars should lie on the same front plane
Monocular depth Helpful as a soft cue, not absolute truth
Multi-view / RGB-D geometry Much stronger if available

This split is important because an off-the-shelf depth model and an off-the-shelf detector are both incomplete in different ways.

9. Where SAM may fit

SAM can be useful, but I would not assume it solves the full dense-rebar problem out of the box.

SAM is a promptable segmentation model designed for zero-shot transfer:

For this task, I would use SAM mainly for:

Use Recommendation
Annotation bootstrapping Good idea
Interactive correction UI Good idea
Quickly testing masks around bars Good idea
Fully automatic dense rebar separation I would be cautious
Final production model Fine only after validation/fine-tuning/workflow testing

Dense, thin, overlapping structures are exactly where generic segmentation can become fragile. A rebar-specific model plus simple geometric post-processing may be more predictable.

10. Dependency / backend note

If Depth Anything / MiDaS and basic CV tooling are already in your stack, I would not overstate the dependency problem. The main advice is simply: do not add every heavy model family at once.

For a local or backend implementation, I would keep the first version small:

one detector/segmenter
+ OpenCV / NumPy post-processing
+ optional monocular depth cue

I would avoid starting with:

YOLO
+ SAM/SAM2
+ Depth Anything
+ MiDaS
+ multiple segmentation frameworks
+ complex 3D reconstruction

The more practical path is:

Phase 1:
  one rebar detector/segmenter
  horizontal filtering
  y-clustering
  visual overlay

Phase 2:
  add front-face scoring rules

Phase 3:
  add monocular depth only if it improves validation results

Phase 4:
  add multi-view/RGB-D/stereo if production accuracy requires it

11. What I would build first

If I had to build a first prototype, I would do this:

1. Collect 50โ€“200 representative images.
2. Label only the target front horizontal bar levels, or label
   front_horizontal_bar vs other_rebar.
3. Train or fine-tune one detector/segmenter.
4. Extract near-horizontal elongated components.
5. Merge fragmented detections belonging to the same row.
6. Cluster by vertical position.
7. Output count + overlay.
8. Mark low-confidence cases for human review.

The overlay is important. For inspection tasks, a wrong count without explanation is not very useful. A count with an overlay lets the inspector see what the model counted.

12. What I would not do first

I would not start with:

single RGB image
โ†’ Depth Anything
โ†’ threshold depth
โ†’ count front bars

That may work on some images, but I would expect it to fail when:

  • front and rear bars have similar appearance,
  • bars overlap heavily,
  • the cage is not perfectly frontal,
  • lighting changes,
  • the rear bars are sharp and visible,
  • the front bars are partially occluded.

13. Suggested answer to your direct questions

Question My answer
Most robust way to distinguish front-face from rear/interior bars? Rebar-specific detection/segmentation first, then front-layer selection using geometry, continuity, occlusion cues, and optional depth.
Similar problems? Yes: rebar detection/counting, rebar instance segmentation, steel-bar installation inspection, and RGB-D rebar spacing inspection.
Is monocular depth enough? I would not rely on it alone. It can help as a soft cue.
Could classical CV outperform deep learning? For the whole task, probably not. For post-processing after segmentation, yes, classical CV can be very useful.
Production pipeline? Controlled capture if possible, rebar detector/segmenter, target-layer selection, row clustering, confidence scoring, and human review for uncertain cases.

14. Final recommendation

My recommendation would be:

Use rebar-specific detection/segmentation as the foundation.
Do not make monocular depth the foundation.
Use depth only as one cue for front-face selection.
Use geometric post-processing to convert detections into countable horizontal levels.
If production accuracy matters, prefer multi-view, stereo, or RGB-D capture over single-image monocular depth alone.

So the strongest formulation is probably:

This is not just a depth-estimation problem. It is a rebar-specific target-layer counting problem. Detect/segment the rebar candidates first, then solve front-face assignment and horizontal-level clustering.

Okay, understood.
I am providing a few images of the reinforcement structures to better understand the situation.
What I can do to solve the problem:

  1. take multiple images of the same face
  2. video of the same face
  3. use of drone (in future)
  4. specialized cameras, if extremely required

1.

2.

3.

4.

Oh! So you can use multiple images for a single target/face? Then the set of useful models and algorithms changes quite a lot, in a good way:


The extra images and your capture options change the recommendation quite a bit.

If you can take multiple images or a video of the same target face, I would not keep the system framed as a strict single-image problem. I would treat each target face as an inspection set:

one target face
โ†’ one straight-on reference image
โ†’ several slightly shifted images, or a short slow video
โ†’ candidate detection / segmentation per frame
โ†’ cross-frame consistency / motion / geometry checks
โ†’ front-layer selection
โ†’ horizontal-level clustering
โ†’ count + overlay + confidence

That change matters because some methods that are weak or ambiguous in a single RGB image become much more useful once the input becomes a same-target image/video set.

1. Why multiple images change the problem

In a single image, you mostly have:

  • appearance,
  • line geometry,
  • local occlusion cues,
  • apparent thickness,
  • monocular depth as a soft cue,
  • learned segmentation/detection.

That can help, but the front/rear separation is still underconstrained.

With multiple images or video of the same target face, you also get:

  • temporal consistency,
  • parallax,
  • cross-frame voting,
  • possible optical flow,
  • possible object/mask tracking,
  • possible multi-view geometry,
  • more chances to see around partial occlusions,
  • a better confidence signal.

So I would think of the input not as:

single image โ†’ count

but as:

same-target capture set โ†’ count

This is a major design change.

2. The lowest-friction experiment

If you already have a single-image pipeline, or if you are already testing Depth Anything / MiDaS / classical CV, I would not throw that work away.

The smallest useful extension is:

1. Choose one target face.
2. Record a slow 5โ€“10 second video while moving slightly left/right,
   or take 5โ€“15 slightly shifted still images.
3. Extract frames.
4. Run your existing single-image processing on each frame.
5. Extract candidate horizontal bars or horizontal bar levels per frame.
6. Fuse the results across frames.
7. Keep candidates that are stable and plausible across the same-target set.
8. Cluster the remaining front-face candidates by vertical position.
9. Output count + visual overlay + confidence.

In pseudo-code:

frames = extract_frames(video_or_image_set)

per_frame_results = []
for frame in frames:
    result = single_image_pipeline(frame)
    per_frame_results.append(result)

fused = fuse_same_target_results(per_frame_results)
count = count_front_horizontal_levels(fused)

This is useful because it does not require a completely new system. It changes the unit of analysis from one image to one capture set.

3. Methods that become more useful with multiple images/video

Some of the methods you already mentioned are weak as single-image final answers, but become more interesting when repeated over a same-target capture set.

Method In a single image With multiple images / video
Depth Anything / MiDaS Useful relative-depth cue, but not reliable enough as final authority Can be checked for temporal consistency and combined with motion/parallax cues
Classical CV Hough lines / edges may over-detect rear bars Optical flow, feature tracking, line stability, and cross-frame voting become possible
Rebar segmentation / detection Gives visible rebar candidates Candidates can be fused and validated across frames
SAM / SAM-like segmentation Helpful for masks, but fragile on dense repeated bars SAM 2-style video mask propagation or interactive correction becomes more useful
COLMAP / SfM / modern 3D models Not applicable Can be tested as diagnostic geometry cues
RGB-D / stereo Not relevant if only RGB Becomes a strong option if specialized cameras are acceptable

So I would not discard the original ideas. I would change their role.

For example:

Depth Anything as a one-frame decision maker: risky
Depth Anything as a repeated soft cue across frames: more useful

and:

Hough lines on one image: many false positives
line candidates stable across a same-target video: more meaningful

4. A practical branch tree

I would choose the pipeline depending on what capture is possible.

Can you capture multiple images or video of the same target face?

โ”œโ”€ No, single image only
โ”‚  โ””โ”€ Endpoint:
โ”‚     rebar detection/segmentation
โ”‚     + geometric filtering
โ”‚     + optional monocular depth cue
โ”‚     + confidence / human review
โ”‚
โ””โ”€ Yes
   โ”œโ”€ Short video is available
   โ”‚  โ””โ”€ Endpoint:
   โ”‚     per-frame candidates
   โ”‚     + tracking / optical flow / SAM 2 mask propagation
   โ”‚     + temporal consistency
   โ”‚     + row clustering
   โ”‚
   โ”œโ”€ Multiple still images are available
   โ”‚  โ””โ”€ Endpoint:
   โ”‚     same-target inspection set
   โ”‚     + multi-view consistency
   โ”‚     + optional SfM / DUSt3R / MASt3R / VGGT diagnostic
   โ”‚     + fused row candidates
   โ”‚
   โ””โ”€ Specialized camera is possible
      โ””โ”€ Endpoint:
         stereo or RGB-D
         + point cloud / plane fitting
         + target-layer extraction
         + spacing/count validation

I would start with the lowest-cost branch and only move to heavier hardware or heavier 3D reconstruction if the simpler route fails.

5. Suggested priority order

My practical priority order would be:

Priority Option Why
1 Controlled same-target video No special hardware; adds temporal consistency and parallax
2 Multiple same-target still images Easy to collect; supports cross-view checking
3 Rebar-specific detection/segmentation Gives candidate bars before layer selection
4 Optical flow / tracking / temporal voting Low-cost way to use video
5 SAM 2 video propagation Useful for interactive mask propagation / annotation
6 COLMAP / DUSt3R / MASt3R / VGGT Useful diagnostic geometry, but not guaranteed on repetitive rebar
7 Stereo / RGB-D Stronger geometry if special cameras are acceptable
8 Drone Useful for access/safety/repeatability, but not automatically a better CV solution

I would not start with the drone unless access or safety requires it. A drone changes the camera position and may help collect images from safer or more repeatable viewpoints, but it does not automatically solve front/rear bar separation. A controlled handheld same-target video may be more valuable for algorithm development.

6. Path A: single-image fallback

If only one RGB image is available, I would use the earlier kind of pipeline:

image
โ†’ crop/select target face
โ†’ detect or segment rebar candidates
โ†’ keep near-horizontal elongated candidates
โ†’ score front-face likelihood
โ†’ cluster by vertical position
โ†’ count

The front-face score could combine:

apparent thickness
+ edge sharpness
+ continuity across width
+ occlusion order
+ regular spacing
+ optional monocular depth

But I would still treat this as the least robust path. The output should probably include a visual overlay and a confidence score, because there will be ambiguous cases.

7. Path B: same-target video

If video is available, I would try this first.

video of same target face
โ†’ sample frames
โ†’ run candidate detection/segmentation per frame
โ†’ associate candidates across frames
โ†’ keep temporally stable row candidates
โ†’ use motion/parallax to suppress rear/interior candidates
โ†’ cluster rows

This does not require full 3D reconstruction.

It can be implemented with relatively ordinary tools:

  • per-frame detection/segmentation,
  • optical flow,
  • tracker association,
  • temporal voting,
  • row-level clustering.

Ultralytics YOLO has a tracking mode using trackers such as BoT-SORT and ByteTrack:

However, I would be careful with the tracking unit. Tracking each individual thin bar may be fragile. For dense, repeated rebar, I would probably track or stabilize row candidates or regions, not depend too heavily on perfect per-bar IDs.

OpenCV optical flow can also be useful:

But again, I would not use optical flow as a magic answer. I would use it as another cue:

Does this horizontal candidate move like the front layer?
Is it stable across nearby frames?
Does it remain in the same row-level cluster?

8. Path C: SAM 2 as a video/annotation helper

SAM 2 is relevant here because it is designed for promptable segmentation in both images and videos:

I would not assume SAM 2 will automatically separate all dense rebars correctly. The structure is thin, repetitive, and heavily occluded.

But SAM 2 may be useful in this workflow:

first frame:
  prompt target face / front bars / cage region

video:
  propagate mask or region through frames

post-processing:
  count horizontal levels inside the propagated target face

I would especially consider it for:

  • annotation acceleration,
  • interactive correction,
  • propagating a manually selected target face through video,
  • building a training dataset faster.

So the role is not necessarily:

SAM 2 โ†’ final count

but rather:

SAM 2 โ†’ useful masks / annotations / target region propagation

9. Path D: multiple still images and multi-view geometry

If multiple still images of the same target face are available, I would treat them as an inspection set.

same-target still images
โ†’ run candidate detection per image
โ†’ match/fuse row candidates across images
โ†’ optionally run a geometry diagnostic
โ†’ count stable front-face levels

This opens up tools that are impossible with a single image.

For classic multi-view reconstruction, COLMAP is a standard SfM/MVS tool. COLMAP can be useful to test whether there is enough camera motion and texture to recover a meaningful geometry signal.

However, I would not make COLMAP the first production assumption. Rebar cages are difficult for SfM because they contain:

  • repeated patterns,
  • thin lines,
  • many similar intersections,
  • occlusion,
  • background construction clutter.

Repeating structure can cause wrong correspondences or wrong camera poses in SfM systems; this is a known type of failure, not something specific to this task:

So I would treat SfM as:

good diagnostic if it works
not a guaranteed core pipeline

10. Modern 3D foundation models may be worth testing

Because you can collect multiple images, newer 3D models may also be worth testing as diagnostics.

Examples:

DUSt3R is designed for dense 3D reconstruction from arbitrary image collections without known camera calibration or poses:

VGGT predicts key 3D scene attributes such as camera parameters, point maps, depth maps, and 3D point tracks from one, a few, or hundreds of views:

These are not rebar-specific models. I would not assume they solve the task directly. But they may be useful to answer a practical question:

Does the same-target image set contain enough geometric signal
to separate the front layer from the background/rear layer?

If the answer is yes, then geometry can become part of the pipeline. If not, it is better to focus on detection/segmentation and capture control.

11. Path E: RGB-D or stereo, if specialized cameras are acceptable

If specialized cameras are acceptable, I would consider stereo or RGB-D before thinking of a drone as the main algorithmic solution.

The reason is simple:

The hard part is layer separation, not only image access.

RGB-D or stereo can directly help with front/rear separation.

There is relevant work on rebar spacing inspection using RGB-D and point-cloud processing:

That work is interesting because it uses depth/point-cloud processing to filter background rebar layers before measuring the target layer. That is conceptually close to your front/rear separation problem.

There is also related work combining instance segmentation and stereo vision for steel-bar installation inspection:

So if special cameras are realistic, I would think in this order:

normal video / multi-image capture first
โ†’ if front/rear separation is still unreliable:
   stereo or RGB-D
โ†’ drone only if access/safety/repeatability requires it

12. How I would use Depth Anything / MiDaS in this new setting

Your original monocular-depth idea becomes more useful once there are multiple frames.

Single-frame usage:

Depth Anything / MiDaS
โ†’ relative depth map
โ†’ maybe front/rear cue

This is weak as a final decision.

Same-target video usage:

Depth Anything / MiDaS per frame
โ†’ check whether front-layer candidates remain consistently closer
โ†’ combine with candidate continuity and parallax
โ†’ use depth as a soft vote

This is a better role.

There are also video-focused depth models such as Video Depth Anything:

That does not mean it is automatically necessary, but it supports the general point: video depth consistency is a different problem from single-image depth.

So I would phrase it like this:

Monocular depth is not sufficient as a single-frame authority,
but it may become useful as a repeated soft cue across a same-target capture set.

13. How I would use classical CV in this new setting

Classical CV also becomes more useful with video.

Single-image classical CV:

edges / Hough lines / morphology
โ†’ many false positives from rear/interior bars

Video classical CV:

line candidates
+ optical flow
+ frame-to-frame consistency
+ row-level voting

This is much more useful.

For example:

1. Extract near-horizontal candidates in each frame.
2. Cluster them into row candidates.
3. Track row candidates across frames.
4. Keep rows that remain stable and plausible.
5. Downweight rows that appear only in a few frames or move inconsistently.

This keeps classical CV in a realistic role: not the whole solution, but a useful stabilizer.

14. A possible minimal prototype

If I were testing this with ordinary camera/video first, I would implement:

Input:
  same-target short video or 5โ€“15 same-target images

Step 1:
  manually or automatically crop/select the target face

Step 2:
  run existing per-image processing:
    - rebar candidate detection/segmentation
    - optional depth
    - near-horizontal candidate extraction

Step 3:
  aggregate across frames:
    - group candidates by row
    - check temporal consistency
    - check depth consistency if available
    - check motion/parallax behavior if available

Step 4:
  produce:
    - counted horizontal levels
    - overlay on reference image
    - confidence score
    - low-confidence review flag

A very simple scoring idea:

row_score =
    number_of_frames_detected
  + horizontal_continuity_score
  + row_spacing_plausibility_score
  + front_depth_consistency_score
  + motion_consistency_score

Then count rows above a threshold, and always show the overlay.

15. Questions that would decide the branch

The next useful questions are probably:

Question Why it matters
Do you need only count, or also spacing/compliance? Count can be simpler; spacing needs scale/calibration
Is the target always one face of the cage? Same-target capture set assumes this
Can the target face be manually cropped/selected? This greatly reduces difficulty
Can you capture a short slow video during inspection? Enables temporal consistency and parallax
Can you place a known-size marker or use design dimensions? Helps scale and validation
Are RGB-D/stereo cameras acceptable in the field, or only for R&D? Decides whether depth/point cloud routes are realistic
Is a drone needed for access/safety, or mainly for better vision? These are different reasons

16. My revised recommendation

Given your new constraints, I would revise the earlier recommendation to:

Do not treat this as only a single-image depth problem.
Treat each target face as a same-target inspection set.

Start with ordinary camera video or multiple still images.
Run your current per-image model/CV pipeline on frames.
Fuse the evidence across frames.
Use temporal consistency, parallax, optional depth, and row clustering
to select the front horizontal levels.

Only move to stereo/RGB-D if normal same-target capture is not reliable enough.
Use drones mainly for access/safety/repeatability, not as the core CV solution.

This does not give a guaranteed final answer, but it should make the search space much better constrained. The key shift is:

single image:
  semantic + appearance problem

multiple images/video:
  semantic + appearance + temporal + geometric problem

That second formulation gives you many more practical options.

Do you need only count, or also spacing/compliance?
I need to count number of horizontal as well as vertical bars. Distance between two horizontal bars, as well as two consecutive vertical bars. Length of vertical and horizontal bars.

Is the target always one face of the cage?
No, target would be each face of the cage. Cage structure will be different, for example, cage with 4 sides, 5 or six sides, cylindrical; vertically standing cage, tilted cage, slab style etc.

Can the target face be manually cropped/selected?
No, since it will be an automated system, I canโ€™t introduce manual cropping.

Can you capture a short slow video during inspection?
Yes, I can capture a short video.

Can you place a known-size marker or use design dimensions?
If extremely required, I can place known-size marker. But I would like to avoid use of marker for each one of the rebar structure. I can place a few marker throughout construction site.

Are RGB-D/stereo cameras acceptable in the field, or only for R&D?
Yes, RGB-D camera is acceptable in the field as well as R&D.

Is a drone needed for access/safety, or mainly for better vision?
The reason is - those rebar structures are very long, if I capture image from bottom, upper portion will be very thin, impossible to identify individual bars even for human eyes. It will be narrow shaped, for example image numbered 2.

Additional information:

  1. I can use high end GPU, for example, 32GB of GPU. for training I can use that, but not for inference when deployed at client. Then only I can use small GPU, such as 16GB.
  2. I can afford time if computation is time consuming.
  3. If required I can use small local VLM as well.
  4. For future - I can use additional small sensors, while using drones, gyroscope etc. If I use drone, I will be able to capture top view, if that helps in any situation.

Oh. If RGB-D is acceptable, I think this can be made much more concrete:


Your latest constraints clarify the problem a lot.

Since you need:

  • number of horizontal bars,
  • number of vertical bars,
  • horizontal spacing,
  • vertical spacing,
  • horizontal/vertical bar length,
  • multiple cage shapes,
  • no manual cropping,
  • short video capture,
  • RGB-D/stereo acceptable in the field,
  • training on a larger GPU but deployment on a smaller GPU,

I would now treat this less as a โ€œrebar counting from imagesโ€ problem and more as a geometry-aware rebar measurement system.

My direct answer would be:

If RGB-D is acceptable, I would make RGB-D / point cloud geometry the main measurement path. Use RGB for detecting rebars, keypoints, crosspoints, faces, or candidate regions; use depth/point cloud geometry for face/layer assignment, metric spacing, and length measurement.

In other words:

RGB:
  what is rebar?
  where are the visible bars/keypoints/crosspoints?

Depth / point cloud:
  which face/layer do they belong to?
  what is the metric distance/length?
  is this measurement geometrically plausible?

That is much more concrete than trying to solve everything with monocular depth or a pure 2D detector.

1. The problem has changed from โ€œcountingโ€ to โ€œmeasurementโ€

Your latest requirements are broader than a count-only task.

Requirement What it implies
Count horizontal and vertical bars detection / segmentation / row-column grouping
Measure spacing metric scale is needed
Measure bar length endpoints or fitted 3D line segments are needed
Multiple cage geometries structure-type routing is needed
No manual crop automatic ROI / structure / face detection is needed
Short video allowed multi-frame consistency becomes useful
RGB-D acceptable real depth / point cloud geometry becomes a strong option
16GB GPU at deployment keep the deployed model reasonably light

So I would not design the system as:

one image โ†’ one detector โ†’ count

I would design it more like:

RGB-D capture set
โ†’ automatic structure/face detection
โ†’ RGB rebar/keypoint detection
โ†’ depth/point-cloud geometry
โ†’ face/layer assignment
โ†’ count + spacing + length
โ†’ overlay + confidence/review

2. RGB-D gives a clean division of labor

A practical RGB-D system could be divided like this:

Stage Input Main tools Output
Capture RGB-D frames/video RealSense / RGB-D SDK / Open3D capture synchronized RGB + depth
ROI / structure detection RGB detector / segmenter / VLM-assisted QA cage / slab / face candidate regions
Rebar perception RGB YOLO / YOLO-seg / Mask R-CNN / keypoint model bars, masks, endpoints, crosspoints
3D lifting RGB detections + depth camera intrinsics / aligned depth 3D points / 3D segments
Face/layer extraction point cloud plane fitting, clustering, RANSAC, DBSCAN faces/layers
Measurement 3D points/lines line fitting, distance computation count, spacing, length
QA RGB + overlay + metrics confidence rules / VLM optional reviewable result

The dependency-safe version is:

Use deep learning mainly for RGB perception.
Use deterministic geometry first for the depth/point-cloud part.

That means you avoid starting with a heavy 3D point-cloud deep learning stack. You can add it later if needed.

3. A very practical first RGB-D prototype

I would start with one constrained subtype, not all cage types at once.

For example:

Prototype target:
  vertical rectangular cage
  one or two visible faces
  RGB-D video
  known camera intrinsics
  no manual crop at final deployment,
  but manual crop may be allowed during early debugging

A first prototype pipeline:

1. Record RGB-D video of one cage.
2. Align depth to RGB.
3. Detect 2D rebar features in RGB:
     - crosspoints, or
     - bar endpoints, or
     - bar masks/segments.
4. Lift those 2D detections into 3D using depth.
5. Fit planes/layers in the point cloud.
6. Assign bars/keypoints to faces/layers.
7. Cluster bars into horizontal/vertical groups.
8. Measure spacing and length in 3D.
9. Show overlay + confidence + failure reasons.

The core operation is simple:

2D detection:
  pixel coordinate = (u, v)

Depth:
  z = depth(u, v)

Camera intrinsics:
  (u, v, z) โ†’ 3D point (X, Y, Z)

For Intel RealSense, this is the kind of operation handled by rs2_deproject_pixel_to_point(...):

So a very useful pattern is:

detect in RGB
โ†’ read aligned depth at the detected pixels
โ†’ deproject to 3D
โ†’ measure in 3D

4. Related RGB-D rebar work

There is relevant work very close to this direction.

The paper Automatic Quality Inspection of Rebar Spacing Using Vision-Based Deep Learning with RGBD Camera proposes RGB-D + deep learning for rebar spacing inspection:

What is especially relevant is the idea of using depth/point-cloud processing to handle different layers. That is close to your front/rear/layer separation issue.

There is also related work combining instance segmentation and stereo vision for steel-bar installation inspection:

And there is recent work around Rebar-YOLOv8-seg + depth / point cloud for rebar spacing measurement:

I would not assume any of these is a drop-in solution for your exact cage variations, but they support the same overall direction:

RGB recognition
+ depth / point-cloud geometry
+ metric measurement

5. Bar masks vs keypoints vs crosspoints

For your outputs, I would consider more than one detection target.

Detection target Good for Weakness
Bar mask / instance visible bar count, approximate length, face assignment thin/overlapping bars can break masks
Bar centerline count, spacing, row/column clustering needs good line extraction
Endpoints length measurement endpoints may be occluded or outside image
Crosspoints / intersections spacing measurement, grid geometry not always visible on all cage faces
Face / layer plane face assignment depends on depth quality
Whole cage / structure region automatic ROI not enough for measurement alone

For spacing, crosspoints/keypoints can be very useful.

For length, endpoints or 3D line fitting are useful.

For count, row/column line clustering may be more stable than trying to perfectly count every visible fragment.

A reasonable combined strategy is:

Detect:
  - cage / face region
  - visible rebar lines or masks
  - crosspoints / intersections where possible
  - endpoints where visible

Then:
  use RGB-D geometry to assign them to faces/layers
  and measure in 3D.

6. Suggested structure-type routing

Since your targets can be rectangular, 5-sided, 6-sided, cylindrical, tilted, slab-style, etc., I would not try to solve all geometry with one post-processing rule.

I would route by structure type first.

Structure type Suggested measurement strategy
Flat slab mesh plane extraction + 2D grid/keypoint measurement on the plane
Vertical rectangular cage face plane extraction + horizontal/vertical line clustering
Tilted rectangular cage same as rectangular, but use fitted plane coordinates instead of image coordinates
5/6-sided cage detect multiple face planes, process each face separately
Cylindrical cage fit cylinder / unwrap to angular-height coordinates
Very tall cage use drone/video to capture upper sections at usable resolution
Complex cluttered scene automatic structure/ROI detection becomes the first hard problem

A good design is probably:

global pipeline:
  detect structure type
  detect candidate faces/layers
  choose measurement logic per structure type

This is where a small VLM may help, but I would not use it as the measurement engine.

7. Role of a VLM

You mentioned a small local VLM if required.

I would use a VLM for:

  • scene/structure-type routing,
  • sanity checking,
  • explaining overlays,
  • flagging strange cases,
  • choosing between โ€œslab mesh / vertical cage / cylindrical cageโ€ categories,
  • generating inspection notes for human review.

I would not use a VLM as the primary source of numeric measurements.

So:

Good VLM use:
  "This looks like a tilted vertical cage with two visible faces."
  "The top region is too thin/low-resolution for reliable bar counting."
  "The overlay appears to miss the rear face."

Risky VLM use:
  "There are exactly 17 bars and spacing is 143 mm."

For numeric measurements, I would trust geometry and explicit detections more than a language/vision model.

8. Face/layer extraction with RGB-D

This is one of the most important parts.

If you have an RGB-D frame, you can create a point cloud and try:

point cloud
โ†’ remove obvious background
โ†’ find dominant planes/layers
โ†’ assign detected bars/keypoints to the nearest plane/layer

Tools/ideas:

  • RANSAC plane fitting,
  • DBSCAN clustering,
  • normal estimation,
  • line fitting,
  • repeated plane extraction,
  • distance-to-plane filtering,
  • multi-frame averaging.

Open3D is a practical library for this kind of point-cloud work:

A minimal geometry path:

RGB-D frame
โ†’ point cloud
โ†’ plane segmentation
โ†’ distance-to-plane assignment
โ†’ process each face/layer separately

This is much easier to deploy than starting with point-cloud neural networks.

9. Projecting to a face plane

Once you detect a face plane, measurement becomes easier.

For a flat face:

1. Fit the face plane.
2. Define local face coordinates:
     x_axis = horizontal direction on face
     y_axis = vertical direction on face
     z_axis = face normal
3. Project detected 3D points/lines onto that face coordinate system.
4. Count rows/columns in 2D face coordinates.
5. Measure spacing/length in metric units.

This avoids measuring in raw image coordinates, which are distorted by camera perspective.

For tilted cages, this is especially important. A tilted face may look compressed in the image, but in the fitted face coordinate system the spacing becomes more meaningful.

10. Cylindrical cages

For cylindrical cages, a flat-plane strategy will fail.

A rough path:

1. Detect/segment the cylindrical cage region.
2. Fit a cylinder or estimate cylinder axis.
3. Convert 3D points to cylindrical coordinates:
     angle, height, radius
4. Count vertical bars by angular clusters.
5. Count horizontal rings by height clusters.
6. Measure ring spacing by height distance.
7. Estimate circumference / arc spacing if needed.

So I would treat cylindrical cages as a separate branch, not as just another rectangular cage.

11. Drone capture

Your drone motivation makes sense: if a tall cage is captured only from below, the upper part may be too thin even for human inspection.

I would frame the drone as a capture-quality tool, not the core algorithm.

Drone helps with:

  • top view,
  • safer access,
  • more repeatable viewpoints,
  • better angle for tall structures,
  • capturing upper sections at usable resolution,
  • video path around the structure.

Drone does not automatically solve:

  • rebar segmentation,
  • layer assignment,
  • metric measurement,
  • depth errors,
  • face extraction.

So the drone branch could be:

drone video / RGB-D video
โ†’ better viewpoint coverage
โ†’ same RGB-D/geometry pipeline

If the drone can carry RGB-D or stereo, that is stronger. If it only captures RGB, it still helps by improving view angle and resolution.

12. RGB-D feasibility test before full build

Before building the full pipeline, I would do a small field feasibility test.

For a few representative structures, check:

1. Does the sensor return usable depth at the real inspection distance?
2. Does sunlight break the depth stream?
3. Do steel bars create missing depth / noisy depth?
4. Is the point cloud dense enough around thin rebars?
5. Can front/rear/layer separation be seen in the point cloud?
6. Can 2D detections be lifted to stable 3D points?
7. Are spacing/length measurements stable across frames?

This is important because RGB-D sensors can fail in practical field conditions.

Potential failure modes:

Failure mode Why it matters
Missing depth on thin bars cannot measure certain bars directly
Noisy depth on steel/reflective surfaces unstable spacing/length
Sunlight / IR interference outdoor depth degradation
Too much distance sparse/noisy point cloud
Motion blur bad RGB detection
Misalignment RGB-depth wrong 3D points
Occlusion visible RGB point may not have valid depth
Background layers wrong face/layer assignment

Intel RealSense documentation discusses projection, texture mapping, stream alignment, occlusion, and calibration concepts:

There are also real-world reports of outdoor sunlight and reflective-surface issues, so a small field test is worth doing before committing to the full design.

13. Multi-frame RGB-D is better than one RGB-D frame

Even with RGB-D, I would use video if available.

A single RGB-D frame may have missing depth. A short sequence can help:

single RGB-D frame:
  maybe missing depth on some bars

RGB-D video:
  aggregate measurements across frames
  reject outliers
  fill missing observations
  estimate confidence

A simple multi-frame strategy:

for each frame:
  detect keypoints/bars
  lift to 3D
  assign to face/layer
  measure spacing/length

then:
  group measurements by structure element
  take robust median
  compute variance/confidence
  flag unstable measurements

This is useful because field depth data can be noisy. Repeated measurements are valuable.

14. Confidence scoring

For inspection, I would not output only numbers. I would output:

count
spacing
length
overlay
confidence
failure reason if low confidence

Possible confidence factors:

Factor Meaning
RGB detection confidence model certainty
valid depth ratio how much of the detected bar/keypoint has usable depth
plane assignment margin distance to assigned face vs other faces
multi-frame consistency same measurement stable across frames
spacing regularity measurements form plausible pattern
visibility bar not too thin / occluded
calibration/scale quality sensor or marker reliability

A useful output might be:

Face A:
  horizontal bars: 12
  vertical bars: 8
  horizontal spacing median: 145 mm
  vertical spacing median: 200 mm
  confidence: high

Face B:
  horizontal bars: 11 or 12
  confidence: low
  reason: upper region too thin / missing depth

This is much more useful than a single hard count.

15. Training and deployment split

Your GPU situation suggests a normal training/deployment split:

Training:
  32GB GPU
  larger model
  augmentation
  annotation experiments

Deployment:
  16GB GPU
  smaller detector/keypoint model
  deterministic geometry
  optional model export

For deployment, I would avoid making the point-cloud side too heavy.

A deployment-friendly design:

RGB inference:
  lightweight detector / segmentation / keypoint model

Depth geometry:
  CPU/GPU-light Open3D/OpenCV/NumPy operations

Measurement:
  deterministic post-processing

If deployment becomes an issue, models from Ultralytics-style pipelines can often be exported to ONNX/TensorRT/OpenVINO/CoreML formats:

16. Backend / dependency note

RGB-D pipelines can become annoying because they combine several dependency-heavy pieces:

  • sensor SDK,
  • PyTorch/CUDA,
  • OpenCV,
  • Open3D/PCL,
  • GPU runtime,
  • possibly drone SDK,
  • possibly VLM runtime.

I would keep the pipeline modular:

capture module/service:
  RGB-D camera / drone / sensor SDK

RGB inference module/service:
  detector / keypoint / segmentation model

geometry module/service:
  point cloud, plane fitting, clustering, measurement

QA/output module/service:
  overlays, confidence, reports, optional VLM

If dependency conflicts become painful, these can be separated into local services. That avoids forcing the sensor SDK, PyTorch runtime, and geometry stack to live in one fragile environment.

The practical principle:

Use deep learning mainly for RGB perception.
Use deterministic geometry first for RGB-D / point-cloud measurement.
Keep modules separable.

17. Cheaper fallback if RGB-D becomes inconvenient

Since RGB-D is acceptable, I would treat it as the main path.

But if it becomes too expensive, unreliable, or operationally awkward, the lower-cost fallback would be:

Fallback What it gives Limitation
RGB video / multi-image capture temporal consistency, parallax, cross-frame voting weak metric scale
RGB + a few known-size markers scale / pose aid requires marker placement
RGB + design dimensions scale / validation depends on design availability
two calibrated ordinary cameras stereo triangulation calibration/synchronization effort
photogrammetry / COLMAP possible 3D diagnostic repeated rebars may break matching

This fallback is not as clean for metric spacing/length, but it may still help when the problem is mostly visibility and consistency rather than precise measurement.

18. My suggested roadmap

I would sequence the work like this:

Phase Goal Output
0 RGB-D field feasibility know whether depth is usable on real rebars
1 one structure subtype rectangular vertical cage or slab mesh
2 RGB detection/keypoint model bars/crosspoints/endpoints in RGB
3 2D detections to 3D aligned depth + camera intrinsics
4 face/layer assignment plane fitting / clustering
5 measurement count, spacing, length
6 multi-frame robustness median/variance/confidence
7 more structure types tilted, 5/6-sided, cylindrical
8 deployment optimization 16GB GPU, export, modular backend
9 drone/sensor integration better capture geometry for tall structures

I would not start by solving every cage type at once. I would pick one high-value subtype, prove the measurement pipeline, then generalize.

19. Revised recommendation

My revised recommendation would be:

Since RGB-D is acceptable and you need count + spacing + length,
treat the problem as metric rebar measurement, not only detection.

Use RGB for rebar/keypoint/face perception.
Use depth/point clouds for face/layer assignment and metric distances.
Start with deterministic geometry before point-cloud deep learning.
Use video/multiple frames for robustness and confidence.
Handle different cage types with separate geometry branches.
Use drones mainly to improve capture viewpoint for tall structures.

This gives a concrete first route:

RGB-D video
โ†’ align depth to RGB
โ†’ detect 2D rebar keypoints/masks
โ†’ deproject detections to 3D
โ†’ fit faces/layers
โ†’ measure count/spacing/length
โ†’ aggregate across frames
โ†’ output overlay + confidence

That is probably the most practical path I would try first.