Oh. If RGB-D is acceptable, I think this can be made much more concrete:
Your latest constraints clarify the problem a lot.
Since you need:
- number of horizontal bars,
- number of vertical bars,
- horizontal spacing,
- vertical spacing,
- horizontal/vertical bar length,
- multiple cage shapes,
- no manual cropping,
- short video capture,
- RGB-D/stereo acceptable in the field,
- training on a larger GPU but deployment on a smaller GPU,
I would now treat this less as a โrebar counting from imagesโ problem and more as a geometry-aware rebar measurement system.
My direct answer would be:
If RGB-D is acceptable, I would make RGB-D / point cloud geometry the main measurement path. Use RGB for detecting rebars, keypoints, crosspoints, faces, or candidate regions; use depth/point cloud geometry for face/layer assignment, metric spacing, and length measurement.
In other words:
RGB:
what is rebar?
where are the visible bars/keypoints/crosspoints?
Depth / point cloud:
which face/layer do they belong to?
what is the metric distance/length?
is this measurement geometrically plausible?
That is much more concrete than trying to solve everything with monocular depth or a pure 2D detector.
1. The problem has changed from โcountingโ to โmeasurementโ
Your latest requirements are broader than a count-only task.
| Requirement |
What it implies |
| Count horizontal and vertical bars |
detection / segmentation / row-column grouping |
| Measure spacing |
metric scale is needed |
| Measure bar length |
endpoints or fitted 3D line segments are needed |
| Multiple cage geometries |
structure-type routing is needed |
| No manual crop |
automatic ROI / structure / face detection is needed |
| Short video allowed |
multi-frame consistency becomes useful |
| RGB-D acceptable |
real depth / point cloud geometry becomes a strong option |
| 16GB GPU at deployment |
keep the deployed model reasonably light |
So I would not design the system as:
one image โ one detector โ count
I would design it more like:
RGB-D capture set
โ automatic structure/face detection
โ RGB rebar/keypoint detection
โ depth/point-cloud geometry
โ face/layer assignment
โ count + spacing + length
โ overlay + confidence/review
2. RGB-D gives a clean division of labor
A practical RGB-D system could be divided like this:
| Stage |
Input |
Main tools |
Output |
| Capture |
RGB-D frames/video |
RealSense / RGB-D SDK / Open3D capture |
synchronized RGB + depth |
| ROI / structure detection |
RGB |
detector / segmenter / VLM-assisted QA |
cage / slab / face candidate regions |
| Rebar perception |
RGB |
YOLO / YOLO-seg / Mask R-CNN / keypoint model |
bars, masks, endpoints, crosspoints |
| 3D lifting |
RGB detections + depth |
camera intrinsics / aligned depth |
3D points / 3D segments |
| Face/layer extraction |
point cloud |
plane fitting, clustering, RANSAC, DBSCAN |
faces/layers |
| Measurement |
3D points/lines |
line fitting, distance computation |
count, spacing, length |
| QA |
RGB + overlay + metrics |
confidence rules / VLM optional |
reviewable result |
The dependency-safe version is:
Use deep learning mainly for RGB perception.
Use deterministic geometry first for the depth/point-cloud part.
That means you avoid starting with a heavy 3D point-cloud deep learning stack. You can add it later if needed.
3. A very practical first RGB-D prototype
I would start with one constrained subtype, not all cage types at once.
For example:
Prototype target:
vertical rectangular cage
one or two visible faces
RGB-D video
known camera intrinsics
no manual crop at final deployment,
but manual crop may be allowed during early debugging
A first prototype pipeline:
1. Record RGB-D video of one cage.
2. Align depth to RGB.
3. Detect 2D rebar features in RGB:
- crosspoints, or
- bar endpoints, or
- bar masks/segments.
4. Lift those 2D detections into 3D using depth.
5. Fit planes/layers in the point cloud.
6. Assign bars/keypoints to faces/layers.
7. Cluster bars into horizontal/vertical groups.
8. Measure spacing and length in 3D.
9. Show overlay + confidence + failure reasons.
The core operation is simple:
2D detection:
pixel coordinate = (u, v)
Depth:
z = depth(u, v)
Camera intrinsics:
(u, v, z) โ 3D point (X, Y, Z)
For Intel RealSense, this is the kind of operation handled by rs2_deproject_pixel_to_point(...):
So a very useful pattern is:
detect in RGB
โ read aligned depth at the detected pixels
โ deproject to 3D
โ measure in 3D
4. Related RGB-D rebar work
There is relevant work very close to this direction.
The paper Automatic Quality Inspection of Rebar Spacing Using Vision-Based Deep Learning with RGBD Camera proposes RGB-D + deep learning for rebar spacing inspection:
What is especially relevant is the idea of using depth/point-cloud processing to handle different layers. That is close to your front/rear/layer separation issue.
There is also related work combining instance segmentation and stereo vision for steel-bar installation inspection:
And there is recent work around Rebar-YOLOv8-seg + depth / point cloud for rebar spacing measurement:
I would not assume any of these is a drop-in solution for your exact cage variations, but they support the same overall direction:
RGB recognition
+ depth / point-cloud geometry
+ metric measurement
5. Bar masks vs keypoints vs crosspoints
For your outputs, I would consider more than one detection target.
| Detection target |
Good for |
Weakness |
| Bar mask / instance |
visible bar count, approximate length, face assignment |
thin/overlapping bars can break masks |
| Bar centerline |
count, spacing, row/column clustering |
needs good line extraction |
| Endpoints |
length measurement |
endpoints may be occluded or outside image |
| Crosspoints / intersections |
spacing measurement, grid geometry |
not always visible on all cage faces |
| Face / layer plane |
face assignment |
depends on depth quality |
| Whole cage / structure region |
automatic ROI |
not enough for measurement alone |
For spacing, crosspoints/keypoints can be very useful.
For length, endpoints or 3D line fitting are useful.
For count, row/column line clustering may be more stable than trying to perfectly count every visible fragment.
A reasonable combined strategy is:
Detect:
- cage / face region
- visible rebar lines or masks
- crosspoints / intersections where possible
- endpoints where visible
Then:
use RGB-D geometry to assign them to faces/layers
and measure in 3D.
6. Suggested structure-type routing
Since your targets can be rectangular, 5-sided, 6-sided, cylindrical, tilted, slab-style, etc., I would not try to solve all geometry with one post-processing rule.
I would route by structure type first.
| Structure type |
Suggested measurement strategy |
| Flat slab mesh |
plane extraction + 2D grid/keypoint measurement on the plane |
| Vertical rectangular cage |
face plane extraction + horizontal/vertical line clustering |
| Tilted rectangular cage |
same as rectangular, but use fitted plane coordinates instead of image coordinates |
| 5/6-sided cage |
detect multiple face planes, process each face separately |
| Cylindrical cage |
fit cylinder / unwrap to angular-height coordinates |
| Very tall cage |
use drone/video to capture upper sections at usable resolution |
| Complex cluttered scene |
automatic structure/ROI detection becomes the first hard problem |
A good design is probably:
global pipeline:
detect structure type
detect candidate faces/layers
choose measurement logic per structure type
This is where a small VLM may help, but I would not use it as the measurement engine.
7. Role of a VLM
You mentioned a small local VLM if required.
I would use a VLM for:
- scene/structure-type routing,
- sanity checking,
- explaining overlays,
- flagging strange cases,
- choosing between โslab mesh / vertical cage / cylindrical cageโ categories,
- generating inspection notes for human review.
I would not use a VLM as the primary source of numeric measurements.
So:
Good VLM use:
"This looks like a tilted vertical cage with two visible faces."
"The top region is too thin/low-resolution for reliable bar counting."
"The overlay appears to miss the rear face."
Risky VLM use:
"There are exactly 17 bars and spacing is 143 mm."
For numeric measurements, I would trust geometry and explicit detections more than a language/vision model.
8. Face/layer extraction with RGB-D
This is one of the most important parts.
If you have an RGB-D frame, you can create a point cloud and try:
point cloud
โ remove obvious background
โ find dominant planes/layers
โ assign detected bars/keypoints to the nearest plane/layer
Tools/ideas:
- RANSAC plane fitting,
- DBSCAN clustering,
- normal estimation,
- line fitting,
- repeated plane extraction,
- distance-to-plane filtering,
- multi-frame averaging.
Open3D is a practical library for this kind of point-cloud work:
A minimal geometry path:
RGB-D frame
โ point cloud
โ plane segmentation
โ distance-to-plane assignment
โ process each face/layer separately
This is much easier to deploy than starting with point-cloud neural networks.
9. Projecting to a face plane
Once you detect a face plane, measurement becomes easier.
For a flat face:
1. Fit the face plane.
2. Define local face coordinates:
x_axis = horizontal direction on face
y_axis = vertical direction on face
z_axis = face normal
3. Project detected 3D points/lines onto that face coordinate system.
4. Count rows/columns in 2D face coordinates.
5. Measure spacing/length in metric units.
This avoids measuring in raw image coordinates, which are distorted by camera perspective.
For tilted cages, this is especially important. A tilted face may look compressed in the image, but in the fitted face coordinate system the spacing becomes more meaningful.
10. Cylindrical cages
For cylindrical cages, a flat-plane strategy will fail.
A rough path:
1. Detect/segment the cylindrical cage region.
2. Fit a cylinder or estimate cylinder axis.
3. Convert 3D points to cylindrical coordinates:
angle, height, radius
4. Count vertical bars by angular clusters.
5. Count horizontal rings by height clusters.
6. Measure ring spacing by height distance.
7. Estimate circumference / arc spacing if needed.
So I would treat cylindrical cages as a separate branch, not as just another rectangular cage.
11. Drone capture
Your drone motivation makes sense: if a tall cage is captured only from below, the upper part may be too thin even for human inspection.
I would frame the drone as a capture-quality tool, not the core algorithm.
Drone helps with:
- top view,
- safer access,
- more repeatable viewpoints,
- better angle for tall structures,
- capturing upper sections at usable resolution,
- video path around the structure.
Drone does not automatically solve:
- rebar segmentation,
- layer assignment,
- metric measurement,
- depth errors,
- face extraction.
So the drone branch could be:
drone video / RGB-D video
โ better viewpoint coverage
โ same RGB-D/geometry pipeline
If the drone can carry RGB-D or stereo, that is stronger. If it only captures RGB, it still helps by improving view angle and resolution.
12. RGB-D feasibility test before full build
Before building the full pipeline, I would do a small field feasibility test.
For a few representative structures, check:
1. Does the sensor return usable depth at the real inspection distance?
2. Does sunlight break the depth stream?
3. Do steel bars create missing depth / noisy depth?
4. Is the point cloud dense enough around thin rebars?
5. Can front/rear/layer separation be seen in the point cloud?
6. Can 2D detections be lifted to stable 3D points?
7. Are spacing/length measurements stable across frames?
This is important because RGB-D sensors can fail in practical field conditions.
Potential failure modes:
| Failure mode |
Why it matters |
| Missing depth on thin bars |
cannot measure certain bars directly |
| Noisy depth on steel/reflective surfaces |
unstable spacing/length |
| Sunlight / IR interference |
outdoor depth degradation |
| Too much distance |
sparse/noisy point cloud |
| Motion blur |
bad RGB detection |
| Misalignment RGB-depth |
wrong 3D points |
| Occlusion |
visible RGB point may not have valid depth |
| Background layers |
wrong face/layer assignment |
Intel RealSense documentation discusses projection, texture mapping, stream alignment, occlusion, and calibration concepts:
There are also real-world reports of outdoor sunlight and reflective-surface issues, so a small field test is worth doing before committing to the full design.
13. Multi-frame RGB-D is better than one RGB-D frame
Even with RGB-D, I would use video if available.
A single RGB-D frame may have missing depth. A short sequence can help:
single RGB-D frame:
maybe missing depth on some bars
RGB-D video:
aggregate measurements across frames
reject outliers
fill missing observations
estimate confidence
A simple multi-frame strategy:
for each frame:
detect keypoints/bars
lift to 3D
assign to face/layer
measure spacing/length
then:
group measurements by structure element
take robust median
compute variance/confidence
flag unstable measurements
This is useful because field depth data can be noisy. Repeated measurements are valuable.
14. Confidence scoring
For inspection, I would not output only numbers. I would output:
count
spacing
length
overlay
confidence
failure reason if low confidence
Possible confidence factors:
| Factor |
Meaning |
| RGB detection confidence |
model certainty |
| valid depth ratio |
how much of the detected bar/keypoint has usable depth |
| plane assignment margin |
distance to assigned face vs other faces |
| multi-frame consistency |
same measurement stable across frames |
| spacing regularity |
measurements form plausible pattern |
| visibility |
bar not too thin / occluded |
| calibration/scale quality |
sensor or marker reliability |
A useful output might be:
Face A:
horizontal bars: 12
vertical bars: 8
horizontal spacing median: 145 mm
vertical spacing median: 200 mm
confidence: high
Face B:
horizontal bars: 11 or 12
confidence: low
reason: upper region too thin / missing depth
This is much more useful than a single hard count.
15. Training and deployment split
Your GPU situation suggests a normal training/deployment split:
Training:
32GB GPU
larger model
augmentation
annotation experiments
Deployment:
16GB GPU
smaller detector/keypoint model
deterministic geometry
optional model export
For deployment, I would avoid making the point-cloud side too heavy.
A deployment-friendly design:
RGB inference:
lightweight detector / segmentation / keypoint model
Depth geometry:
CPU/GPU-light Open3D/OpenCV/NumPy operations
Measurement:
deterministic post-processing
If deployment becomes an issue, models from Ultralytics-style pipelines can often be exported to ONNX/TensorRT/OpenVINO/CoreML formats:
16. Backend / dependency note
RGB-D pipelines can become annoying because they combine several dependency-heavy pieces:
- sensor SDK,
- PyTorch/CUDA,
- OpenCV,
- Open3D/PCL,
- GPU runtime,
- possibly drone SDK,
- possibly VLM runtime.
I would keep the pipeline modular:
capture module/service:
RGB-D camera / drone / sensor SDK
RGB inference module/service:
detector / keypoint / segmentation model
geometry module/service:
point cloud, plane fitting, clustering, measurement
QA/output module/service:
overlays, confidence, reports, optional VLM
If dependency conflicts become painful, these can be separated into local services. That avoids forcing the sensor SDK, PyTorch runtime, and geometry stack to live in one fragile environment.
The practical principle:
Use deep learning mainly for RGB perception.
Use deterministic geometry first for RGB-D / point-cloud measurement.
Keep modules separable.
17. Cheaper fallback if RGB-D becomes inconvenient
Since RGB-D is acceptable, I would treat it as the main path.
But if it becomes too expensive, unreliable, or operationally awkward, the lower-cost fallback would be:
| Fallback |
What it gives |
Limitation |
| RGB video / multi-image capture |
temporal consistency, parallax, cross-frame voting |
weak metric scale |
| RGB + a few known-size markers |
scale / pose aid |
requires marker placement |
| RGB + design dimensions |
scale / validation |
depends on design availability |
| two calibrated ordinary cameras |
stereo triangulation |
calibration/synchronization effort |
| photogrammetry / COLMAP |
possible 3D diagnostic |
repeated rebars may break matching |
This fallback is not as clean for metric spacing/length, but it may still help when the problem is mostly visibility and consistency rather than precise measurement.
18. My suggested roadmap
I would sequence the work like this:
| Phase |
Goal |
Output |
| 0 |
RGB-D field feasibility |
know whether depth is usable on real rebars |
| 1 |
one structure subtype |
rectangular vertical cage or slab mesh |
| 2 |
RGB detection/keypoint model |
bars/crosspoints/endpoints in RGB |
| 3 |
2D detections to 3D |
aligned depth + camera intrinsics |
| 4 |
face/layer assignment |
plane fitting / clustering |
| 5 |
measurement |
count, spacing, length |
| 6 |
multi-frame robustness |
median/variance/confidence |
| 7 |
more structure types |
tilted, 5/6-sided, cylindrical |
| 8 |
deployment optimization |
16GB GPU, export, modular backend |
| 9 |
drone/sensor integration |
better capture geometry for tall structures |
I would not start by solving every cage type at once. I would pick one high-value subtype, prove the measurement pipeline, then generalize.
19. Revised recommendation
My revised recommendation would be:
Since RGB-D is acceptable and you need count + spacing + length,
treat the problem as metric rebar measurement, not only detection.
Use RGB for rebar/keypoint/face perception.
Use depth/point clouds for face/layer assignment and metric distances.
Start with deterministic geometry before point-cloud deep learning.
Use video/multiple frames for robustness and confidence.
Handle different cage types with separate geometry branches.
Use drones mainly to improve capture viewpoint for tall structures.
This gives a concrete first route:
RGB-D video
โ align depth to RGB
โ detect 2D rebar keypoints/masks
โ deproject detections to 3D
โ fit faces/layers
โ measure count/spacing/length
โ aggregate across frames
โ output overlay + confidence
That is probably the most practical path I would try first.