I’m working on a project where I need to extract readings from Blood Pressure and Glucose Machines using Machine Learning. These devices typically display values using 7-segment digits, which makes OCR challenging.
What I’ve Tried So Far:
Open-source OCR models (e.g., Hugging Face, Tesseract, EasyOCR) – but they struggle with 7-segment digits.
Google Cloud Vision API – This gives much better accuracy, but the problem is:
Different devices show varying amounts of information (e.g., time, date, previous readings, current readings, etc.).
The API returns a long string, making it difficult to extract the specific readings I need.
Additional Challenge:
I also attempted to fine-tune an open-source AI model that accepts image data, but I couldn’t train it on Google Colab’s T4 GPU due to memory limitations.
Need Help With:
How can I accurately extract the correct values (e.g., systolic, diastolic, BPM, glucose level) from the text output of Cloud Vision API?
Are there any efficient open-source models or techniques that handle 7-segment OCR better?
Any recommendations on training an AI model on a lower-memory environment?
I’d really appreciate any guidance or suggestions to overcome these issues. Thanks in advance!
There also seem to be some lightweight methods that extract using image processing with OpenCV etc. without using ML, but how about trying out VLM, which is provided by Google, Microsoft, etc.?
These models are relatively small, so training them doesn’t take as much resources as larger models.
Hi, Thanks for trying to help me. But when I wnat to run Qwen2-VL-2B / 3B/ 7B or others, there is some common problem I face is,
OutOfMemoryError: CUDA out of memory. Tried to allocate 230.66 GiB. GPU 0 has a total capacity of 39.56 GiB of which 3.03 GiB is free. Process 24867 has 36.52 GiB memory in use. Of the allocated memory 35.26 GiB is allocated by PyTorch, and 774.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
While I have used Colab Pro using a 40GB GPU. I have no idea how I can fix this. I do some optimization to save GPU. But nothing positive happened.
Can you tell me how I can fix this issue or run this model on Colab?
Can you release the code for the model loading part?
According to the error message, it seems that the program is trying to allocate about 230GB of VRAM, which is strange no matter how you look at it…
Or, are you loading the model itself multiple times in the loop?
# Fix PyTorch & torchvision CUDA mismatch
!pip uninstall -y torch torchvision torchaudio
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install required libraries
!pip install transformers accelerate peft safetensors
!pip install openai qwen-vl
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
# Model name
model_name = "Qwen/Qwen2-VL-7B"
# Load processor (for handling both text and images)
processor = AutoProcessor.from_pretrained(model_name)
# Load model (correct model type for VL tasks)
model = AutoModelForVision2Seq.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
# Move to GPU
model.to("cuda")
This model loading part runs on my GPU with around 15GB or less. However, when I provide an image for processing, I encounter a CUDA out-of-memory error.
def generate_text(prompt,image, max_new_tokens=1000):
inputs = processor(images=image,text=prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=max_new_tokens)
return processor.batch_decode(output, skip_special_tokens=True)[0]
from google.colab import files
from PIL import Image
# Upload image
uploaded = files.upload()
image_path = list(uploaded.keys())[0]
# Open & resize image
image = Image.open(image_path)#.resize((512, 512)) # Reduce resolution
prompt = "describe and give me full reading from this picture!"
output_text = generate_text(prompt, image)
It seems that the error was probably just the result of forgetting to apply the Chat Template. The pipeline will handle all of that for you, but in many cases it is more memory efficient to do it manually.
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
# Model name
#model_name = "Qwen/Qwen2-VL-7B"
model_name = "Qwen/Qwen2-VL-2B-Instruct"
# Load processor (for handling both text and images)
processor = AutoProcessor.from_pretrained(model_name)
# Load model (correct model type for VL tasks)
model = AutoModelForVision2Seq.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
# Move to GPU
model#.to("cuda") # If you do this, there is no point in having device_map=“auto”, so delete one of them.
def generate_text(prompt, image, max_new_tokens=1000):
import gc
inputs = processor(images=[image], text=[prompt], return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=max_new_tokens)
# Clear GPU cache
inputs.to("cpu")
del inputs
gc.collect()
torch.cuda.empty_cache()
return processor.batch_decode(output, skip_special_tokens=True)[0]
#from google.colab import files
from PIL import Image
# Upload image
#uploaded = files.upload()
#image_path = list(uploaded.keys())[0]
# Open & resize image
#image = Image.open(image_path)#.resize((512, 512)) # Reduce resolution
prompt = "describe and give me full reading from this picture!"
import requests
from io import BytesIO
url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
messages = [{"role": "user", "content": [{"type": "image", "image": url}, {"type": "text", "text": prompt}]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output_text = generate_text(text, image)
print(output_text)