Task Type / Scenario Preprocessing Strategy Recommended Model Choice / Tactics
|CV|Basic Image Classification|• Resize all images to a uniform size (e.g. 224×224).
• Normalize pixel values (mean/std of ImageNet if using pretrained).
• Apply basic augmentations (random flip, random crop, color jitter) to reduce overfitting.
• If classes are imbalanced, consider oversampling or class-weighting.|• Transfer-learning on a small-footprint backbone:
– MobileNetV3 or EfficientNet-Lite (fast training within 5–10 min).
– ResNet-18 or ResNet-34 with a custom FC “head.”
• If allowed, fine-tune a pretrained ResNet50/ResNet101 (on ImageNet) for better accuracy.
• If compute is very tight, use a simple sklearn pipeline: extract HOG or color‐histogram features → logistic regression or random forest.|
| — | — |
||Multi-Label / Fine-Grained Classification|• Same resizing/normalization as above.
• Stronger augmentations: random rotation, mixup, cutout to increase variance.
• Ensure label binarization (multi-hot encoding).|• Lightweight CNN (e.g. MobileNetV3) with a multi-sigmoid head (one output per label).
• Alternatively, use a pretrained ResNet/ViT and replace final layer with a multi-head classifier.
• If the number of labels is large, consider freezing most layers and training only head for speed.
• For extra data scarcity, use CLIP embeddings (compute image embeddings with CLIP, then train a simple MLP for multi-label).|
| — | — |
||Object Detection (YOLO, SSD, Faster R-CNN)|• Resize images to the model’s recommended input size (e.g., 640×640 for YOLOv5).
• Normalize with model’s mean/std.
• Augment with mosaic (for YOLO), random scale, random flip; also apply bounding-box scaling if needed.
• Ensure bounding boxes are clamped after augmentation.|• YOLOv5-Nano/Small or YOLOv8-Small (fast inference/training, <5 min per epoch).
• SSD MobileNetV2 (lightweight, but modest accuracy).
• If higher accuracy needed and compute allows: Faster R-CNN with ResNet-50 backbone (freeze early conv layers, train RPN + heads).
• Use pretrained COCO weights (if allowed) then fine-tune on provided dataset.
• For very constrained tasks, use a corner-based (CornerNet) or anchor-free model if prepackaged.|
| — | — |
||Semantic Segmentation (U-Net, DeepLab)|• Resize/crop images to fit GPU memory (e.g. 512×512 or 256×256).
• Normalize using ImageNet stats.
• Apply strong augmentations: random flip, random crop, color augment, random rotation, elastic deformation (if medical).
• If classes are imbalanced, use class-balanced cropping or loss weighting.|• U-Net (if dataset is small, fewer filters to reduce memory).
• DeepLabV3 with MobileNetV3 backbone (light, good mIoU).
• SegFormer-B0/B1 (light transformer seg).
• Use pretrained backbone (ImageNet) and train only decoder head for speed.
• If memory is tight, apply patch-based inference (tile the image, stitch predictions).|
| — | — |
||Instance Segmentation (Mask R-CNN)|• Same as detection + segmentation: resize to ~800 px on shorter side.
• Normalize and augment (random flip).
• For mask training, keep aspect ratio or pad to square.|• Mask R-CNN with ResNet-50 + FPN (pretrained COCO weights, then fine-tune).
• If speed is critical, use Detectron2’s Mask R-CNN with a MobileNet/ResNet-18 backbone.
• Freeze backbone layers and train box/mask heads first; if time remains, unfreeze progressively.
• For very small datasets, extract features via pretrained backbone and train a lightweight mask head separately (e.g., Linear + Conv layers).|
| — | — |
||Image Segmentation (Panoptic / Panoptic Quality)|• Preprocess similarly to semantic segmentation.
• Ensure instance and semantic labels are properly aligned.|• Panoptic-FPN (Mask R-CNN + semantic head) or Detectron2 panoptic models (if available).
• If not, ensemble: run semantic segmentation (e.g., DeepLab) + instance segmentation (Mask R-CNN) and fuse results (post-processing).|
| — | — |
||Transfer Learning for Classification|• Resize/normalize as above.
• Minimal augmentations (flip, crop).
• Fine-tuning strategy: freeze early layers, train new head; if time remains, unfreeze final block.|• ResNet-50/ResNet-101, MobileNetV3, EfficientNet-B0/B1 from ImageNet.
• Replace final FC with task-specific head (e.g., smaller MLP + softmax).
• Use Gradual Unfreezing: train head → unfreeze top conv block → full fine-tune if necessary.
• Use Learning Rate Scheduling (one-cycle lr, cosine annealing) to converge in few epochs.|
| — | — |
||Feature Extraction Using Pretrained Models|• Resize/normalize images to backbone’s input (e.g. 224×224).
• No heavy augment unless you want multiple views for pooling.|• Use ResNet/BiT/ViT: remove final head, extract penultimate features (e.g., 2048-dim for ResNet-50).
• For smaller features: use MobileNetV3 (1280-dim).
• For multimodal: use CLIP to extract image embeddings.
• Pool features over spatial dims (avg pool) or use global avg + max pool.
• Feed extracted features to simple classifiers (logistic regression, SVM, or small MLP).|
| — | — |
||Transfer Learning for Feature Extraction in Retrieval|• Same resizing/normalization.
• Possibly apply center crop to match retrieval dataset distribution.
• If using CLIP, ensure text preprocessing matches (lowercase, tokenize).|• CLIP (ViT-B/32 or RN50) to embed both image & text; compute cosine similarity for retrieval.
• Alternatives: ResNet + contrastive head (if fine-tuning is allowed).
• For pure image retrieval: use pretrained ResNet/ViT features + Faiss-like nearest neighbor search (or sklearn NearestNeighbors).|
| — | — |
||Data Augmentation Techniques (General CV)|• Geometric: random flip (horizontal/vertical), random rotation (±15°), random scale (±10%), random crop/resize.
• Color: brightness/contrast jitter, hue/saturation shift.
• Advanced: cutout, mixup, mosaic (for detection).
• Normalize as per model.|• For classification: basic Augmentations via torchvision.transforms or albumentations (if available).
• For detection: Mosaic & MixUp for YOLO, CutMix for classification.
• For segmentation: elastic deformation (via Albumentations) if medical.
• If doing self-supervised: strong augment pairs (random crop + color jitter + grayscale + Gaussian blur).|
| — | — |
||Self-Supervised Learning (SimCLR, MoCo, BYOL)|• Create two augmented “views” of each image using strong augmentations: random crop, color jitter, flip, blur.
• Normalize identically.
• Potential resizing (224×224) for model input.|• SimCLR (ResNet-50 backbone) with projection head (2-layer MLP).
• MoCo v2 (ResNet backbone + momentum encoder).
• BYOL (ResNet + online/target networks).
• Train for 100–200 epochs if time allows (likely not feasible in 6 h full training; instead, use a smaller dataset or fewer epochs).
• After pretraining, freeze encoder and attach linear classification head, then fine-tune.|
| — | — |
||Vision Transformers (ViT)|• Resize to patch-friendly size (e.g. 224×224).
• Normalize with ImageNet mean/std.
• Minimal augment: random flip, random crop.|• ViT-Base/16 or ViT-Small/16 pretrained on ImageNet; fine-tune for classification.
• Data-efficient Image Transformers (DeiT-Small) if faster.
• For segmentation: use SegFormer or Swin Transformer variants.
• If compute is tight, use a hybrid: ResNet backbone with transformer head (e.g., ResNet+ViT).|
| — | — |
||CLIP (Contrastive Multimodal Learning)|• Resize/normalize images to CLIP’s input (224×224).
• For text: lowercase, tokenize with CLIP tokenizer (Byte-Pair Encoding).
• Minimal augment on images (center crop, resize).|• OpenAI/CLIP ViT-B/32 or RN50: generate image and text embeddings.
• For classification: take CLIP image embedding and compute cosine similarity to class text embeddings (zero-shot or few-shot).
• For retrieval: match image↔text via cosine.
• For few-shot tasks: prompt CLIP with class names to get weights.
• If allowed, fine-tune CLIP’s projection layers on small dataset (freeze backbone).|
| — | — |
||GANs & Generative Models (DCGAN, StyleGAN, Stable Diffusion)|• Scale images to required resolution (e.g. 64×64 for DCGAN, 128×128 or 256×256 for StyleGAN).
• Normalize pixel values to [−1,1].
• If training, apply random horizontal flip.
• For Stable Diffusion: preprocess via CLIP-like normalization (specific to model).|• DCGAN or WGAN-GP (for small image generation tasks) if training from scratch.
• StyleGAN2-ADA if sample quality matters, using adaptive augmentations.
• Stable Diffusion v1/v2 for text-to-image (if inference only): use a pretrained checkpoint and a text prompt to generate images.
• For conditional generation: use cGAN (class-conditional) or Pix2Pix (paired data).
• Note: training large GANs not feasible; focus on inference or fine-tuning a small model.|
| — | — |
|NLP|Word Embeddings (Word2Vec, GloVe, FastText)|• Lowercase, remove punctuation or keep if embeddings expect it.
• Tokenize by whitespace or using NLTK/spaCy.
• Build vocabulary, filter rare words (min_count).
• For subword (FastText), ensure splitting on character n-grams.|• Gensim Word2Vec (CBOW or skip-gram), vector size 100–300, window 5–10.
• GloVe (if precomputed available, load via Gensim or manual).
• FastText for OOV handling (subword) using Gensim’s FastText class.
• For downstream tasks: average word vectors or use TF-IDF + embeddings as features.|
| — | — |
||Text Classification (supervised)|• Clean text: lowercase, remove special chars if using simple models.
• Tokenize with NLTK/spaCy or directly use Hugging Face Tokenizer if fine-tuning.
• For classical ML: compute TF-IDF vectors (unigrams + bigrams).
• For deep models: pad/truncate to max length (e.g. 128 tokens).|• Sklearn pipeline: TfidfVectorizer → LogisticRegression or XGBoost/CatBoost (quick and robust).
• fastText supervised (extremely fast, good baseline).
• BERT-Base or DistilBERT (fine-tuned via Hugging Face Trainer for 2–3 epochs).
• If memory is tight, use TinyBERT or MobileBERT (smaller footprint).
• If multi-label, use a sigmoid head on BERT or multiple binary logistic regressions on TF-IDF features.|
| — | — |
||Sequence Labeling (NER, POS tagging)|• Tokenize using spaCy or NLTK word_tokenize.
• Possibly lowercasing and replace digits with a special token.
• Create BIO/BIOES labels aligned to tokens.
• Pad sequences to same length and create attention masks (for transformer).|• spaCy pretrained pipelines (if allowed) for NER/POS (fast inference).
• BiLSTM + CRF: embed tokens (via pretrained embedding or random), BiLSTM encoder, CRF decoder (if implementing from scratch).
• BERT for token classification (AutoModelForTokenClassification), fine-tune on labeled data (likely 1–2 epochs suffice).
• Flair embeddings (stacked character + word embeddings) + linear layer (if library available).|
| — | — |
||Transformers / Attention-Based Modeling (Seq2Seq, Translation)|• Clean text minimally (preserve casing if using cased model).
• Tokenize with model’s AutoTokenizer (handles subwords).
• Pad/truncate to max length (512 for BERT/BART, 1024 for T5).
• For translation: build parallel corpora pairs, maybe filter very long pairs.|• T5-Small/Small-M or mBART (if multilingual) for translation or general seq2seq.
• MarianMT (many language pairs) via pipeline(“translation”) for quick inference.
• BERT (encoder) + Transformer decoder (if implementing custom).
• If only attention‐mechanism understanding is needed, implement a minimal PyTorch self-attention module (small d_model e.g. 128) to demonstrate QKV operations.|
| — | — |
||Pre-trained NLP Models (BERT, GPT)|• Same as above: use model’s tokenizer, pad/truncate consistently.
• Clean text per model (e.g. BERT-uncased → lowercase).
• Create appropriate inputs: token IDs, attention masks, (segment IDs if BERT).|• BERT-Base/Uncased (110M) for classification/QA.
• DistilBERT (66M) or TinyBERT (15M) if memory/time is tight.
• GPT-2 small (124M) for generation/chat tasks; use pipeline(“text-generation”) for quick setup.
• RoBERTa or ALBERT if accuracy is critical and memory allows.
• For quick inference tasks (e.g. mask fill), use pipeline(“fill-mask”, model=“bert-base-uncased”).|
| — | — |
||Question Answering (Extractive, SQuAD-style)|• Clean context minimally (preserve punctuation).
• Tokenize with model’s tokenizer (e.g. AutoTokenizer).
• For long contexts (>512 tokens), apply sliding window chunking with overlap.
• Map answer spans to token positions (if fine-tuning).|• DistilBERT-Distilled-SQuAD or BERT-Base-Uncased-f-SQuAD via pipeline(“question-answering”) for inference.
• If fine-tuning: AutoModelForQuestionAnswering + Trainer (1–2 epochs on provided QA data).
• For faster inference, use DistilRoBERTa (lightweight) fine-tuned on SQuAD.
• If domain is specialized, fine-tune on provided context–question pairs, freeze most layers except QA head.|
| — | — |
||Question Answering (Generative / Open-Ended)|• Preprocess similar to extractive but also prepare target sequences.
• Tokenize prompt + answer (for training) using BART/T5 tokenizer.
• Add special tokens (, ) as needed.|• T5-Small or BART-Base for generative QA (use AutoModelForSeq2SeqLM).
• Use Trainer to fine-tune on (question ⇒ answer) pairs for 1–2 epochs.
• For inference, use pipeline(“text2text-generation”, model=“t5-small”).
• If only inference needed and model allowed, use a pretrained BART-large-CNN (summarization) to paraphrase context + question into answers.|
| — | — |
||Text Generation / Summarization|• Minimal cleaning (preserve punctuation).
• Tokenize with AutoTokenizer (T5/BART).
• For abstractive summarization, truncate long docs into chunks (512 tokens), generate per chunk, then merge.
• For extractive, select top-K sentences via TF-IDF or embeddings.|• BART-Base/CNN or T5-Small (fine-tune on summarization data if allowed).
• PEGASUS (if available offline) for strong summarization.
• For a quick extractive baseline, use TextRank (via networkx + NLTK) or LexRank (create sentence vectors with TF-IDF, score).
• For abstractive without fine-tuning, use pipeline(“summarization”, model=“facebook/bart-large-cnn”) if allowed.|
| — | — |
||Semantic Similarity / Retrieval (Text)|• Lowercase, remove punctuation (if using TF-IDF).
• Tokenize into words or subwords (if using embeddings).
• For embedding methods: compute sentence embeddings (avg of word2vec, or use BERT CLS‐token).
• Index embeddings with cosine similarity.|• Sentence-Transformers (all-MiniLM-L6-v2) for fast 384-dim embeddings; compute cosine similarity.
• BERT CLS embedding (take [CLS] output, optionally pool).
• GloVe embeddings + average pooling or FastText + char n-gram.
• For retrieval: use Faiss or sklearn NearestNeighbors (cosine).
• For IR tasks: combine BM25 (via rank_bm25 package) with embedding reranking for improved accuracy.|
| — | — |
||Chatbot / Conversational Agent|• Minimal cleaning (preserve user text), possibly remove profanity.
• Tokenize with GPT-2/BART tokenizer if generative; or intent classification pipeline (TF-IDF or BERT tokenization).
• If retrieval-based: preprocess knowledge base (split into sentences, compute embeddings).|• GPT-2 small/medium for generative chatbot (use pipeline(“text-generation”)).
• DialoGPT (Microsoft) for dialogue out-of-the-box if allowed.
• Retrieval-augmented: embed KB sentences with Sentence-Transformers, on query find top-K, then feed them + user prompt into a small GPT for response.
• Intent + Response: train a small BERT classifier on intent examples, map to canned responses.|
| — | — |
||Model Fine-Tuning (LoRA, Adapters)|• Tokenize with AutoTokenizer.
• Clean text lightly as per base model.
• For LoRA: freeze base model, insert LoRA modules into linear layers (no additional data prep).
• For Adapters: insert adapter modules; keep data prep same as normal fine-tuning.|• LoRA via Hugging Face’s peft library or manual implementation: add low-rank matrices to each nn.Linear.
• Adapters via HF Adapters: model.add_adapter(“task”) → model.train_adapter(“task”).
• Use DistilBERT or T5-Small as base to keep memory low.
• Train only LoRA/Adapter params (others frozen) for 1–2 epochs.
• Use 8-bit quantization (QLoRA) if needed (bitsandbytes) to reduce memory when fine-tuning a 7B model within 24 GB VRAM.|
| — | — |