Using Huggingface for computer vision (Tensorflow)?

olaffson · October 26, 2021, 1:17pm

Hi,

I would like to to use huggingface to train some computer vision models. The issue is that I use tensorflow and I can only see pytorch models on the hub.

Is tensorflow supported for computer vision? Is there a TF notebook I could use to train my own models?

Thanks!

nielsr · October 26, 2021, 2:03pm

We’re currently adding a TF implementation of the Vision Transformer: Add TFViTModel by ydshieh · Pull Request #13778 · huggingface/transformers · GitHub

This will also make it easier to add the other vision models (DeiT, BEiT), which are very similar to ViT.

olaffson · October 26, 2021, 2:10pm

thanks @nielsr !! super useful. I imagine there is a similar tensorflow notebook for audio recognition?

cambioclimatico101 · June 2, 2025, 3:04am

Task Type / Scenario Preprocessing Strategy Recommended Model Choice / Tactics

|CV|Basic Image Classification|• Resize all images to a uniform size (e.g. 224×224).

• Normalize pixel values (mean/std of ImageNet if using pretrained).

• Apply basic augmentations (random flip, random crop, color jitter) to reduce overfitting.

• If classes are imbalanced, consider oversampling or class-weighting.|• Transfer-learning on a small-footprint backbone:

– MobileNetV3 or EfficientNet-Lite (fast training within 5–10 min).

– ResNet-18 or ResNet-34 with a custom FC “head.”

• If allowed, fine-tune a pretrained ResNet50/ResNet101 (on ImageNet) for better accuracy.

• If compute is very tight, use a simple sklearn pipeline: extract HOG or color‐histogram features → logistic regression or random forest.|
| — | — |

||Multi-Label / Fine-Grained Classification|• Same resizing/normalization as above.

• Stronger augmentations: random rotation, mixup, cutout to increase variance.

• Ensure label binarization (multi-hot encoding).|• Lightweight CNN (e.g. MobileNetV3) with a multi-sigmoid head (one output per label).

• Alternatively, use a pretrained ResNet/ViT and replace final layer with a multi-head classifier.

• If the number of labels is large, consider freezing most layers and training only head for speed.

• For extra data scarcity, use CLIP embeddings (compute image embeddings with CLIP, then train a simple MLP for multi-label).|
| — | — |

||Object Detection (YOLO, SSD, Faster R-CNN)|• Resize images to the model’s recommended input size (e.g., 640×640 for YOLOv5).

• Normalize with model’s mean/std.

• Augment with mosaic (for YOLO), random scale, random flip; also apply bounding-box scaling if needed.

• Ensure bounding boxes are clamped after augmentation.|• YOLOv5-Nano/Small or YOLOv8-Small (fast inference/training, <5 min per epoch).

• SSD MobileNetV2 (lightweight, but modest accuracy).

• If higher accuracy needed and compute allows: Faster R-CNN with ResNet-50 backbone (freeze early conv layers, train RPN + heads).

• Use pretrained COCO weights (if allowed) then fine-tune on provided dataset.

• For very constrained tasks, use a corner-based (CornerNet) or anchor-free model if prepackaged.|
| — | — |

||Semantic Segmentation (U-Net, DeepLab)|• Resize/crop images to fit GPU memory (e.g. 512×512 or 256×256).

• Normalize using ImageNet stats.

• Apply strong augmentations: random flip, random crop, color augment, random rotation, elastic deformation (if medical).

• If classes are imbalanced, use class-balanced cropping or loss weighting.|• U-Net (if dataset is small, fewer filters to reduce memory).

• DeepLabV3 with MobileNetV3 backbone (light, good mIoU).

• SegFormer-B0/B1 (light transformer seg).

• Use pretrained backbone (ImageNet) and train only decoder head for speed.

• If memory is tight, apply patch-based inference (tile the image, stitch predictions).|
| — | — |

||Instance Segmentation (Mask R-CNN)|• Same as detection + segmentation: resize to ~800 px on shorter side.

• Normalize and augment (random flip).

• For mask training, keep aspect ratio or pad to square.|• Mask R-CNN with ResNet-50 + FPN (pretrained COCO weights, then fine-tune).

• If speed is critical, use Detectron2’s Mask R-CNN with a MobileNet/ResNet-18 backbone.

• Freeze backbone layers and train box/mask heads first; if time remains, unfreeze progressively.

• For very small datasets, extract features via pretrained backbone and train a lightweight mask head separately (e.g., Linear + Conv layers).|
| — | — |

||Image Segmentation (Panoptic / Panoptic Quality)|• Preprocess similarly to semantic segmentation.

• Ensure instance and semantic labels are properly aligned.|• Panoptic-FPN (Mask R-CNN + semantic head) or Detectron2 panoptic models (if available).

• If not, ensemble: run semantic segmentation (e.g., DeepLab) + instance segmentation (Mask R-CNN) and fuse results (post-processing).|
| — | — |

||Transfer Learning for Classification|• Resize/normalize as above.

• Minimal augmentations (flip, crop).

• Fine-tuning strategy: freeze early layers, train new head; if time remains, unfreeze final block.|• ResNet-50/ResNet-101, MobileNetV3, EfficientNet-B0/B1 from ImageNet.

• Replace final FC with task-specific head (e.g., smaller MLP + softmax).

• Use Gradual Unfreezing: train head → unfreeze top conv block → full fine-tune if necessary.

• Use Learning Rate Scheduling (one-cycle lr, cosine annealing) to converge in few epochs.|
| — | — |

||Feature Extraction Using Pretrained Models|• Resize/normalize images to backbone’s input (e.g. 224×224).

• No heavy augment unless you want multiple views for pooling.|• Use ResNet/BiT/ViT: remove final head, extract penultimate features (e.g., 2048-dim for ResNet-50).

• For smaller features: use MobileNetV3 (1280-dim).

• For multimodal: use CLIP to extract image embeddings.

• Pool features over spatial dims (avg pool) or use global avg + max pool.

• Feed extracted features to simple classifiers (logistic regression, SVM, or small MLP).|
| — | — |

||Transfer Learning for Feature Extraction in Retrieval|• Same resizing/normalization.

• Possibly apply center crop to match retrieval dataset distribution.

• If using CLIP, ensure text preprocessing matches (lowercase, tokenize).|• CLIP (ViT-B/32 or RN50) to embed both image & text; compute cosine similarity for retrieval.

• Alternatives: ResNet + contrastive head (if fine-tuning is allowed).

• For pure image retrieval: use pretrained ResNet/ViT features + Faiss-like nearest neighbor search (or sklearn NearestNeighbors).|
| — | — |

||Data Augmentation Techniques (General CV)|• Geometric: random flip (horizontal/vertical), random rotation (±15°), random scale (±10%), random crop/resize.

• Color: brightness/contrast jitter, hue/saturation shift.

• Advanced: cutout, mixup, mosaic (for detection).

• Normalize as per model.|• For classification: basic Augmentations via torchvision.transforms or albumentations (if available).

• For detection: Mosaic & MixUp for YOLO, CutMix for classification.

• For segmentation: elastic deformation (via Albumentations) if medical.

• If doing self-supervised: strong augment pairs (random crop + color jitter + grayscale + Gaussian blur).|
| — | — |

||Self-Supervised Learning (SimCLR, MoCo, BYOL)|• Create two augmented “views” of each image using strong augmentations: random crop, color jitter, flip, blur.

• Normalize identically.

• Potential resizing (224×224) for model input.|• SimCLR (ResNet-50 backbone) with projection head (2-layer MLP).

• MoCo v2 (ResNet backbone + momentum encoder).

• BYOL (ResNet + online/target networks).

• Train for 100–200 epochs if time allows (likely not feasible in 6 h full training; instead, use a smaller dataset or fewer epochs).

• After pretraining, freeze encoder and attach linear classification head, then fine-tune.|
| — | — |

||Vision Transformers (ViT)|• Resize to patch-friendly size (e.g. 224×224).

• Normalize with ImageNet mean/std.

• Minimal augment: random flip, random crop.|• ViT-Base/16 or ViT-Small/16 pretrained on ImageNet; fine-tune for classification.

• Data-efficient Image Transformers (DeiT-Small) if faster.

• For segmentation: use SegFormer or Swin Transformer variants.

• If compute is tight, use a hybrid: ResNet backbone with transformer head (e.g., ResNet+ViT).|
| — | — |

||CLIP (Contrastive Multimodal Learning)|• Resize/normalize images to CLIP’s input (224×224).

• For text: lowercase, tokenize with CLIP tokenizer (Byte-Pair Encoding).

• Minimal augment on images (center crop, resize).|• OpenAI/CLIP ViT-B/32 or RN50: generate image and text embeddings.

• For classification: take CLIP image embedding and compute cosine similarity to class text embeddings (zero-shot or few-shot).

• For retrieval: match image↔text via cosine.

• For few-shot tasks: prompt CLIP with class names to get weights.

• If allowed, fine-tune CLIP’s projection layers on small dataset (freeze backbone).|
| — | — |

||GANs & Generative Models (DCGAN, StyleGAN, Stable Diffusion)|• Scale images to required resolution (e.g. 64×64 for DCGAN, 128×128 or 256×256 for StyleGAN).

• Normalize pixel values to [−1,1].

• If training, apply random horizontal flip.

• For Stable Diffusion: preprocess via CLIP-like normalization (specific to model).|• DCGAN or WGAN-GP (for small image generation tasks) if training from scratch.

• StyleGAN2-ADA if sample quality matters, using adaptive augmentations.

• Stable Diffusion v1/v2 for text-to-image (if inference only): use a pretrained checkpoint and a text prompt to generate images.

• For conditional generation: use cGAN (class-conditional) or Pix2Pix (paired data).

• Note: training large GANs not feasible; focus on inference or fine-tuning a small model.|
| — | — |

|NLP|Word Embeddings (Word2Vec, GloVe, FastText)|• Lowercase, remove punctuation or keep if embeddings expect it.

• Tokenize by whitespace or using NLTK/spaCy.

• Build vocabulary, filter rare words (min_count).

• For subword (FastText), ensure splitting on character n-grams.|• Gensim Word2Vec (CBOW or skip-gram), vector size 100–300, window 5–10.

• GloVe (if precomputed available, load via Gensim or manual).

• FastText for OOV handling (subword) using Gensim’s FastText class.

• For downstream tasks: average word vectors or use TF-IDF + embeddings as features.|
| — | — |

||Text Classification (supervised)|• Clean text: lowercase, remove special chars if using simple models.

• Tokenize with NLTK/spaCy or directly use Hugging Face Tokenizer if fine-tuning.

• For classical ML: compute TF-IDF vectors (unigrams + bigrams).

• For deep models: pad/truncate to max length (e.g. 128 tokens).|• Sklearn pipeline: TfidfVectorizer → LogisticRegression or XGBoost/CatBoost (quick and robust).

• fastText supervised (extremely fast, good baseline).

• BERT-Base or DistilBERT (fine-tuned via Hugging Face Trainer for 2–3 epochs).

• If memory is tight, use TinyBERT or MobileBERT (smaller footprint).

• If multi-label, use a sigmoid head on BERT or multiple binary logistic regressions on TF-IDF features.|
| — | — |

||Sequence Labeling (NER, POS tagging)|• Tokenize using spaCy or NLTK word_tokenize.

• Possibly lowercasing and replace digits with a special token.

• Create BIO/BIOES labels aligned to tokens.

• Pad sequences to same length and create attention masks (for transformer).|• spaCy pretrained pipelines (if allowed) for NER/POS (fast inference).

• BiLSTM + CRF: embed tokens (via pretrained embedding or random), BiLSTM encoder, CRF decoder (if implementing from scratch).

• BERT for token classification (AutoModelForTokenClassification), fine-tune on labeled data (likely 1–2 epochs suffice).

• Flair embeddings (stacked character + word embeddings) + linear layer (if library available).|
| — | — |

||Transformers / Attention-Based Modeling (Seq2Seq, Translation)|• Clean text minimally (preserve casing if using cased model).

• Tokenize with model’s AutoTokenizer (handles subwords).

• Pad/truncate to max length (512 for BERT/BART, 1024 for T5).

• For translation: build parallel corpora pairs, maybe filter very long pairs.|• T5-Small/Small-M or mBART (if multilingual) for translation or general seq2seq.

• MarianMT (many language pairs) via pipeline(“translation”) for quick inference.

• BERT (encoder) + Transformer decoder (if implementing custom).

• If only attention‐mechanism understanding is needed, implement a minimal PyTorch self-attention module (small d_model e.g. 128) to demonstrate QKV operations.|
| — | — |

||Pre-trained NLP Models (BERT, GPT)|• Same as above: use model’s tokenizer, pad/truncate consistently.

• Clean text per model (e.g. BERT-uncased → lowercase).

• Create appropriate inputs: token IDs, attention masks, (segment IDs if BERT).|• BERT-Base/Uncased (110M) for classification/QA.

• DistilBERT (66M) or TinyBERT (15M) if memory/time is tight.

• GPT-2 small (124M) for generation/chat tasks; use pipeline(“text-generation”) for quick setup.

• RoBERTa or ALBERT if accuracy is critical and memory allows.

• For quick inference tasks (e.g. mask fill), use pipeline(“fill-mask”, model=“bert-base-uncased”).|
| — | — |

||Question Answering (Extractive, SQuAD-style)|• Clean context minimally (preserve punctuation).

• Tokenize with model’s tokenizer (e.g. AutoTokenizer).

• For long contexts (>512 tokens), apply sliding window chunking with overlap.

• Map answer spans to token positions (if fine-tuning).|• DistilBERT-Distilled-SQuAD or BERT-Base-Uncased-f-SQuAD via pipeline(“question-answering”) for inference.

• If fine-tuning: AutoModelForQuestionAnswering + Trainer (1–2 epochs on provided QA data).

• For faster inference, use DistilRoBERTa (lightweight) fine-tuned on SQuAD.

• If domain is specialized, fine-tune on provided context–question pairs, freeze most layers except QA head.|
| — | — |

||Question Answering (Generative / Open-Ended)|• Preprocess similar to extractive but also prepare target sequences.

• Tokenize prompt + answer (for training) using BART/T5 tokenizer.

• Add special tokens (, ) as needed.|• T5-Small or BART-Base for generative QA (use AutoModelForSeq2SeqLM).

• Use Trainer to fine-tune on (question ⇒ answer) pairs for 1–2 epochs.

• For inference, use pipeline(“text2text-generation”, model=“t5-small”).

• If only inference needed and model allowed, use a pretrained BART-large-CNN (summarization) to paraphrase context + question into answers.|
| — | — |

||Text Generation / Summarization|• Minimal cleaning (preserve punctuation).

• Tokenize with AutoTokenizer (T5/BART).

• For abstractive summarization, truncate long docs into chunks (512 tokens), generate per chunk, then merge.

• For extractive, select top-K sentences via TF-IDF or embeddings.|• BART-Base/CNN or T5-Small (fine-tune on summarization data if allowed).

• PEGASUS (if available offline) for strong summarization.

• For a quick extractive baseline, use TextRank (via networkx + NLTK) or LexRank (create sentence vectors with TF-IDF, score).

• For abstractive without fine-tuning, use pipeline(“summarization”, model=“facebook/bart-large-cnn”) if allowed.|
| — | — |

||Semantic Similarity / Retrieval (Text)|• Lowercase, remove punctuation (if using TF-IDF).

• Tokenize into words or subwords (if using embeddings).

• For embedding methods: compute sentence embeddings (avg of word2vec, or use BERT CLS‐token).

• Index embeddings with cosine similarity.|• Sentence-Transformers (all-MiniLM-L6-v2) for fast 384-dim embeddings; compute cosine similarity.

• BERT CLS embedding (take [CLS] output, optionally pool).

• GloVe embeddings + average pooling or FastText + char n-gram.

• For retrieval: use Faiss or sklearn NearestNeighbors (cosine).

• For IR tasks: combine BM25 (via rank_bm25 package) with embedding reranking for improved accuracy.|
| — | — |

||Chatbot / Conversational Agent|• Minimal cleaning (preserve user text), possibly remove profanity.

• Tokenize with GPT-2/BART tokenizer if generative; or intent classification pipeline (TF-IDF or BERT tokenization).

• If retrieval-based: preprocess knowledge base (split into sentences, compute embeddings).|• GPT-2 small/medium for generative chatbot (use pipeline(“text-generation”)).

• DialoGPT (Microsoft) for dialogue out-of-the-box if allowed.

• Retrieval-augmented: embed KB sentences with Sentence-Transformers, on query find top-K, then feed them + user prompt into a small GPT for response.

• Intent + Response: train a small BERT classifier on intent examples, map to canned responses.|
| — | — |

||Model Fine-Tuning (LoRA, Adapters)|• Tokenize with AutoTokenizer.

• Clean text lightly as per base model.

• For LoRA: freeze base model, insert LoRA modules into linear layers (no additional data prep).

• For Adapters: insert adapter modules; keep data prep same as normal fine-tuning.|• LoRA via Hugging Face’s peft library or manual implementation: add low-rank matrices to each nn.Linear.

• Adapters via HF Adapters: model.add_adapter(“task”) → model.train_adapter(“task”).

• Use DistilBERT or T5-Small as base to keep memory low.

• Train only LoRA/Adapter params (others frozen) for 1–2 epochs.

• Use 8-bit quantization (QLoRA) if needed (bitsandbytes) to reduce memory when fine-tuning a 7B model within 24 GB VRAM.|
| — | — |

Topic		Replies	Views
Training HF transformer models on custom (not text) data 🤗Transformers	0	213	May 26, 2023
Image Captioning fine tuning 🤗Transformers	0	439	February 25, 2023
Extract visual and contextual features from images Models	5	4392	August 27, 2021
Support for different models in text-to-image pipeline 🤗Transformers	1	546	January 13, 2023
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2166	January 4, 2022

Using Huggingface for computer vision (Tensorflow)?

Related topics