Debugging inf/NaN Loss in Multi-Process Optuna/PyTorch Lightning HPO in Colab

Update: Currently working on Perplexity comparisons. Initial issues resolved.

Hello Hugging Face Community,

I’m the founder of a new AI startup, Exorobourii, and I’m developing a novel hybrid Transformer architecture called VSM-PSO-Attn. The core idea is to use Particle Swarm Optimization to optimize the attention mechanism’s QKV weights during training.

The project is built with PyTorch Lightning, Hydra, and uses a custom nn.Module for the PSO-Attention layer.

The Core Problem & Paradox

I’ve hit a roadblock specifically with hyperparameter optimization using Optuna in a Google Colab environment.

What Works: A full, multi-epoch training run in a single process (e.g., a standard trainer.fit() call) works perfectly. The model trains, the loss decreases, and there are no NaNs.

What Fails: When I try to run an HPO study with Optuna (study.optimize(…)), every single trial fails almost instantly. The progress bar shows trials completing in seconds, but they all return an inf or NaN loss.

This paradox—working in a single process but failing in a multi-process HPO context—is what I’m struggling to debug.

My Prime Hypothesis: DataLoader Workers in Child Processes

My leading theory is that this is a data loading issue specific to the multi-process environment.

Optuna spawns a new process for each trial.

Inside that trial, my LightningDataModule’s DataLoader (with num_workers > 0) tries to spawn its own worker processes.

I suspect this “double-fork” of processes is failing silently. The DataLoader workers might be unable to access the dataset file on Google Drive (np.memmap object) or are returning corrupted/empty batches.

The model then receives this garbage data, leading to a NaN loss, and the trial fails.

Environment Details:

Python: 3.12

Platform: Google Colab (L4 GPU)

Key Libraries:

lightning: 2.2.5

torch: 2.3.0+cu121

optuna: 3.6.1

hydra-core: 1.3.2

My Question for the Community:

Does my hypothesis about DataLoader workers being the cause seem plausible?

Has anyone successfully run a multi-process Optuna HPO study with PyTorch Lightning in Colab, and if so, what is the most robust pattern for handling data loading?

Are there known issues with np.memmap when used with nested DataLoader workers in a spawn context?

Thank you so much for any insights or guidance you can provide. This is a critical step for my project, and I’d be incredibly grateful for your help.

Best,

J

Founder, Exorobourii

1 Like

Colab-specific errors are common, so if that were the cause, it would be easy…


Short version up front:

  • Yes, your “DataLoader workers inside Optuna trials” hypothesis is not only plausible, it matches well-known failure patterns in Jupyter/Colab.
  • There are stable, well-tested patterns for Lightning + Optuna HPO in notebooks: they all avoid nested multiprocessing (serial trials, num_workers=0, simple Trainer setup).
  • np.memmap + DataLoader + multiple workers is fragile: each worker may create its own mapping of the file, blowing up memory and interacting badly with spawn/fork semantics.

Then I’ll go through your three questions and give you a concrete step-by-step plan for VSM-PSO-Attn.


1. Is the “DataLoader workers in child processes” hypothesis plausible?

1.1. What actually happens in your stack

You effectively have several layers that can involve multiprocessing or multithreading:

  1. Optuna

    • study.optimize(objective, n_trials=..., n_jobs=...)
    • When n_jobs > 1, Optuna runs multiple trials in parallel. Historically this has been done with threads; users sometimes wrap Optuna in their own multiprocessing, or deploy samplers on external executors. Either way, you’re adding outer concurrency around training.
  2. PyTorch Lightning Trainer

    • Lightning can spawn processes for distributed strategies (DDP, ddp_spawn, etc.).
    • Lightning docs (even older versions) explicitly warn that interactive environments (Jupyter/Colab) are not compatible with “real” DDP and recommend simpler setups.
  3. PyTorch DataLoader

    • num_workers > 0 creates worker processes that run your Dataset.__getitem__ in parallel.
    • These workers are separate OS processes, not threads.

In your case, a single HPO trial might look like:

Jupyter / Colab kernel
→ Optuna worker (thread or process, depending on setup)
→ Lightning Trainer (possibly with its own spawning strategy)
→ DataLoader workers (each worker is a process)

That is exactly the sort of nested multiprocessing that is brittle in notebooks.

1.2. What we know about DataLoader in notebooks

There is a long history of “DataLoader + num_workers>0 + Jupyter/Colab” misbehaving:

  • A StackOverflow question shows Jupyter notebooks freezing when num_workers > 0; top answer: “Jupyter notebooks have known issues with multiprocessing; set num_workers=0 or move to a .py script.”
  • Multiple PyTorch issues and forum threads report that DataLoader freezes or fails when using num_workers>0 in multi-thread or multi-process contexts; everything works with num_workers=0.

Your symptom is slightly different (NaN/inf loss instead of a visible hang), but the pattern is similar:

  • Single process (trainer.fit() alone) works perfectly.
  • When the code is invoked from inside an Optuna trial, trials “finish” instantly with bad metrics.

A plausible mechanism:

  1. In the Optuna context, DataLoader workers fail to start or fail to read properly from the memmap file (or they crash silently).
  2. The DataLoader yields empty or corrupt batches (e.g., tensors with all zeros, or even wrong shapes that trigger later NaNs).
  3. Your LightningModule computes a loss on this garbage and the objective returns NaN or inf, so Optuna marks the trial as failed.

Because Optuna interprets NaN / inf as a failed objective and simply continues to the next trial, you see “lots of trials completing in seconds with NaN/inf” rather than a crash. Optuna’s docs are explicit that NaN is treated as a failed trial, not as a fatal error.

So yes: your hypothesis that nested DataLoader workers in child processes are the culprit is not just plausible, it is very consistent with known interactions between:

  • notebooks/Colab,
  • PyTorch DataLoader with num_workers>0, and
  • outer parallelism (threads/processes).

2. Has anyone run Lightning + Optuna HPO in Colab, and what pattern is robust?

There are multiple working examples of PyTorch Lightning + Optuna HPO in notebook environments (Kaggle, Colab-style). They are not doing PSO over QKV, but structurally the Optuna integration is the same.

2.1. Common pattern in working examples

Typical examples:

  • MachineLearningMastery: “PyTorch Lightning Hyperparameter Optimization with Optuna” – shows how to define an objective(trial), create a LightningModule and Trainer inside it, then run Optuna. It’s written as a notebook-friendly tutorial.
  • Kaggle: “Complete Optuna Hparam Tuning – PyTorch Lightning” – a full notebook using Lightning + Optuna to tune a model, again with a standard objective function.
  • Japanese and Zenn blogs combining Lightning + Optuna – demonstrate training BERT and CNN models with Lightning and Optuna in notebook-like settings.

From these and Optuna’s own examples repo, certain patterns are consistent:

  1. Trials are serial (or minimally parallel) inside notebooks

    • study.optimize(..., n_jobs=1) in most notebook examples.
    • Parallel multi-process HPO is usually reserved for script/cluster environments, not Colab.
  2. DataLoaders are simple

    • Many examples just use num_workers=0 or small values.
    • They do not combine “outer” parallelism (e.g., multi-process trials) with “inner” DataLoader worker multiprocessing.
  3. New model + trainer per trial

    • objective(trial) constructs:

      • LightningModule with hyperparameters from trial.
      • DataModule/DataLoaders (with simple num_workers).
      • Trainer(accelerator="gpu" or "cpu", devices=1, max_epochs=...).
    • After training, they read a metric from trainer.callback_metrics and return it.

  4. No complicated distributed strategies in the objective

    • They avoid DDP / ddp_spawn inside the Optuna objective when running in notebooks.
    • Lightning’s own documentation strongly discourages DDP from interactive environments, noting that DataLoader with num_workers>0 can bottleneck or misbehave within DDP spawn.

This is why single trainer.fit() works for you: you are effectively in the “simple Lightning training” case. When you wrap it in Optuna and keep num_workers>0 and possibly multiple trials in parallel, you cross into a configuration that those examples explicitly avoid.

2.2. A robust pattern you can adopt for VSM-PSO-Attn in Colab

In Colab, the most robust pattern for Lightning + Optuna is:

  1. Run trials serially

    • Ensure study.optimize(objective, n_trials=..., n_jobs=1).
  2. Use single-process data loading inside trials

    • Your LightningDataModule should accept num_workers as an argument.
    • For HPO runs in Colab, pass num_workers=0 from the objective.
    • Each DataLoader inside the DataModule uses that num_workers.
  3. Use a minimal Trainer setup

    • In objective(trial):

      def objective(trial):
          # Sample hyperparameters
          lr = trial.suggest_float("lr", 1e-5, 1e-3, log=True)
          batch_size = trial.suggest_int("batch_size", 16, 128)
      
          model = VSMPSOAttnModule(lr=lr, ...)
          datamodule = VSMPSODataModule(
              batch_size=batch_size,
              num_workers=0,          # critical in Colab
              memmap_path=...,        # your dataset
          )
      
          trainer = lightning.Trainer(
              accelerator="gpu" if torch.cuda.is_available() else "cpu",
              devices=1,
              max_epochs=trial.suggest_int("epochs", 2, 6),
              logger=False,
              enable_checkpointing=False,
          )
      
          trainer.fit(model, datamodule=datamodule)
      
          val_loss = trainer.callback_metrics["val_loss"].item()
      
          del trainer, model, datamodule
          if torch.cuda.is_available():
              torch.cuda.empty_cache()
      
          return val_loss
      
    • Note:

      • devices=1.
      • No explicit distributed strategy.
      • No nested spawns.
  4. Free resources between trials

    • Explicitly delete the Trainer, model, and datamodule.
    • Clear CUDA cache if needed.
    • Disable persistent_workers=True in your DataLoader while debugging, to avoid lingering worker processes between trials.

If you make those changes and trials suddenly stop failing instantly (they take as long as a “normal” training run and return finite losses), that is strong confirmation that your original configuration was hitting exactly the nested multiprocessing/DataLoader problem.


3. Are there known issues with np.memmap + nested DataLoader workers in a spawn context?

There are no official “this exact pattern is forbidden” warnings, but there are multiple closely related issues:

3.1. np.memmap with DataLoader workers

Several PyTorch forum threads and Q&A posts discuss np.memmap datasets with num_workers>0:

  • A user with a 32 GB memmap file found that each DataLoader worker effectively loaded its own copy, causing huge memory usage; suggestions included:

    • Avoid loading the entire memmap in __init__.
    • Open the memmap lazily inside each worker (__getitem__ or a worker-local initializer).
    • Limit the number of workers.
  • Another discussion shows that np.memmap can defeat the whole point of “using more workers,” because each worker separately maps the file and can thrash memory. They recommend chunked loading and careful Dataset design rather than storing a big memmap array as a Dataset attribute.

The high-level lesson:

  • With num_workers>0, each worker process potentially opens the memmap independently.
  • Each worker has its own address space; if your Dataset.__init__ pre-loads or wraps the entire memmap, you can accidentally multiply RAM usage by the number of workers.
  • If the memmap is on a slow or networked filesystem (e.g., Google Drive FUSE mount), concurrent access from multiple workers can be very slow or brittle.

3.2. DataLoader + multiprocessing issues (independent of memmap)

Even without memmap, there are well-documented problems when:

  • DataLoader uses num_workers>0, and
  • you run multiple DataLoaders in parallel threads/processes.

Examples:

  • A GitHub issue where multiple threads each use a DataLoader with num_workers>0 leads to freezing; using num_workers=0 fixes it.
  • Threads on PyTorch discuss how DataLoader + num_workers>0 can hang or throw errors in many non-trivial multiprocessing scenarios, especially on Windows or in Jupyter notebooks.

Your configuration adds a memmap and a FUSE filesystem on top of this, so you are stacking three things that all stress multiprocessing:

  • Jupyter/Colab (non-standard process model).
  • DataLoader with worker processes.
  • memmap on a remote or pseudo filesystem.

3.3. Safer Dataset pattern for memmap + DataLoader

The pattern that is generally recommended for np.memmap + DataLoader is:

class MemmapDataset(Dataset):
    def __init__(self, path, shape, dtype="float32"):
        self.path = path
        self.shape = shape
        self.dtype = dtype
        self._mm = None      # lazy

    def _lazy_init(self):
        if self._mm is None:
            self._mm = np.memmap(
                self.path,
                mode="r",     # read-only for safety
                dtype=self.dtype,
                shape=self.shape,
            )

    def __len__(self):
        return self.shape[0]

    def __getitem__(self, idx):
        self._lazy_init()
        x = self._mm[idx]
        # convert x to torch.Tensor, process, etc.
        return x

This design (explicitly recommended in memmap-related discussions) ensures:

  • The Dataset only stores metadata (path, shape, dtype).
  • Each worker creates its own memmap lazily when needed.
  • You do not try to pickle a huge memmap array across worker processes.

Combine this with num_workers=0 in Optuna trials in Colab, and you avoid:

  • nested multiproc + memmap,
  • and the majority of the known problem cases.

Practical debugging steps specific to your VSM-PSO-Attn setup

Below is a concrete sequence you can follow in your notebook to both validate the hypothesis and converge to a stable HPO setup.

Step 1 – Make trials strictly single process

In your Colab:

  1. Ensure study.optimize(objective, n_trials=..., n_jobs=1).

  2. In the Optuna objective(trial):

    • Create the DataModule with num_workers=0 (and verify that every DataLoader in it uses that argument).

    • Create a Trainer with:

      • devices=1,
      • accelerator="gpu" or "cpu",
      • no DDP/ddp_spawn/strategy customizations.

Outcome to look for:

  • If trials now run for realistic training time and produce finite losses (no more instant NaN/inf), then your nested multiprocessing hypothesis is confirmed.

Step 2 – Replace the dataset with a synthetic DataModule

Define a simple LightningDataModule for debugging:

class SyntheticDM(L.LightningDataModule):
    def __init__(self, batch_size=32):
        super().__init__()
        self.batch_size = batch_size

    def setup(self, stage=None):
        self.train_data = torch.randn(1024, seq_len, d_model)
        self.train_targets = torch.randint(num_classes, (1024,))
        self.val_data = torch.randn(256, seq_len, d_model)
        self.val_targets = torch.randint(num_classes, (256,))

    def train_dataloader(self):
        ds = TensorDataset(self.train_data, self.train_targets)
        return DataLoader(ds, batch_size=self.batch_size, shuffle=True, num_workers=0)

    def val_dataloader(self):
        ds = TensorDataset(self.val_data, self.val_targets)
        return DataLoader(ds, batch_size=self.batch_size, shuffle=False, num_workers=0)

Run Optuna HPO using this synthetic module, keeping PSO-Attn active.

  • If synthetic data works but your memmap DataModule fails, the problem is definitely connected to memmap + DataLoader + nested multiproc.

  • If synthetic also fails in HPO (even with num_workers=0), then the culprit is more likely:

    • how the objective is written (return value, exception handling),
    • the PSO hyperparameter ranges (some trials explode numerically),
    • or interactions with Hydra (e.g., re-initializing configs per trial).

Step 3 – Add explicit NaN/inf checks in training_step

In your LightningModule:

def training_step(self, batch, batch_idx):
    x, y = batch

    if not torch.isfinite(x).all():
        raise RuntimeError("Non-finite input batch")

    logits = self(x)
    loss = self.loss_fn(logits, y)

    if not torch.isfinite(loss):
        raise RuntimeError("Non-finite loss")

    self.log("train_loss", loss)
    return loss

Then in your Optuna objective:

def objective(trial):
    try:
        # build model, dm, trainer
        trainer.fit(model, datamodule=dm)
        val_loss = trainer.callback_metrics["val_loss"].item()
        return val_loss
    except Exception as e:
        # optionally log e
        raise  # let Optuna record the exception

This ensures that:

  • You see real errors if the batch is NaN/inf.
  • Trials that fail because of data/PSO issues produce stack traces instead of quietly returning NaN.

Step 4 – Tighten PSO + learning rate ranges

Because PSO-over-QKV is aggressive, some hyperparameter combinations can legitimately explode:

  • Very large LR or PSO coefficients (w, c1, c2) → huge Q/K/V → massive attention logits → NaNs.
  • In single manual runs you use a “safe” combo; in Optuna, the search may sample extreme values.

As a test, narrow the search space around known-good values:

  • Lock PSO coefficients to one configuration that you know is stable.
  • Only let Optuna vary “boring” things (learning rate, batch size).
  • Once that works, gradually widen PSO-specific ranges.

If narrowing the search space makes HPO stable, you’ve identified hyperparameter-driven NaNs as a secondary factor.


Direct answers to your three questions

1. Does the DataLoader worker hypothesis seem plausible?

Yes. Combining:

  • notebook/Colab,
  • Optuna’s trial loop, and
  • Lightning DataLoaders with num_workers>0 and memmap,

matches known scenarios where DataLoaders freeze, misbehave, or blow up memory when used with outer parallelism. Jupyter/Colab + num_workers>0 is specifically known to be problematic, with common advice being “use num_workers=0 in notebooks or move to a .py script.”

Given that:

  • Single trainer.fit() runs fine.
  • Only Optuna trials fail instantly with NaN/inf.

This is exactly the pattern you’d expect from nested multiprocessing/DataLoader problems rather than from your PSO-Attention logic itself.

2. Has anyone run multi-process Optuna HPO with Lightning in Colab, and what’s the robust data loading pattern?

Yes, there are multiple Lightning + Optuna notebook examples (Kaggle, tutorials, blog posts). In practice they:

  • Run trials serially (n_jobs=1) in notebooks.
  • Use DataLoaders with num_workers=0 (or very small) inside each trial.
  • Use a single Trainer per trial with devices=1 and no complex distributed strategies.
  • Construct fresh model + DataModule per trial and free them afterwards.

That pattern is what you should copy for your PSO-based model in Colab. True multi-process HPO (many trials in parallel) is best done in a non-notebook script on a server/cluster.

3. Are there known issues with np.memmap + nested DataLoader workers in spawn context?

There is not a single official “do not do this” sentence, but there are strong related warnings:

  • With num_workers>0, each DataLoader worker may open its own np.memmap, leading to very high memory usage and possible instability.
  • DataLoader and multiprocessing have many known caveats; combining them with outer parallelism (like multiple Optuna jobs or threads) can cause freezes or erratic behavior.

The safer approach is:

  • Use num_workers=0 during HPO (especially in Colab).
  • Implement the dataset with lazy memmap opening (store path, shape, dtype, open inside each worker).
  • Move to a proper Python script with tuned num_workers and no nested multiproc if you later need performance-oriented HPO at scale.

If you reconfigure Optuna to run serially (n_jobs=1), set num_workers=0 in your HPO DataModule, and keep the Trainer single-device/no DDP, you will isolate the PSO-QKV part from the multiprocessing stack. If HPO becomes stable under these conditions, the paradox is resolved: the bug was the environment (nested multiprocessing + memmap), not your PSO-Attention itself.

1 Like

Wow. I can’t thank you enough for the detailed breakdown here.

Let me parse this (gotta hit that day job lol) and respond in full.

But mostly, truly, thank you for your insight. It is invaluable to me.

Current Status: Working through Perplexity scoring // baseline Transformer model at 303 on WikiText-2, hybrid at 378, pre-HPO runs. I’ll share a link to the full code base once I transition to a git repo.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.