Colab-specific errors are common, so if that were the cause, it would be easy…
Short version up front:
- Yes, your “DataLoader workers inside Optuna trials” hypothesis is not only plausible, it matches well-known failure patterns in Jupyter/Colab.
- There are stable, well-tested patterns for Lightning + Optuna HPO in notebooks: they all avoid nested multiprocessing (serial trials,
num_workers=0, simple Trainer setup).
np.memmap + DataLoader + multiple workers is fragile: each worker may create its own mapping of the file, blowing up memory and interacting badly with spawn/fork semantics.
Then I’ll go through your three questions and give you a concrete step-by-step plan for VSM-PSO-Attn.
1. Is the “DataLoader workers in child processes” hypothesis plausible?
1.1. What actually happens in your stack
You effectively have several layers that can involve multiprocessing or multithreading:
-
Optuna
study.optimize(objective, n_trials=..., n_jobs=...)
- When
n_jobs > 1, Optuna runs multiple trials in parallel. Historically this has been done with threads; users sometimes wrap Optuna in their own multiprocessing, or deploy samplers on external executors. Either way, you’re adding outer concurrency around training.
-
PyTorch Lightning Trainer
- Lightning can spawn processes for distributed strategies (DDP,
ddp_spawn, etc.).
- Lightning docs (even older versions) explicitly warn that interactive environments (Jupyter/Colab) are not compatible with “real” DDP and recommend simpler setups.
-
PyTorch DataLoader
num_workers > 0 creates worker processes that run your Dataset.__getitem__ in parallel.
- These workers are separate OS processes, not threads.
In your case, a single HPO trial might look like:
Jupyter / Colab kernel
→ Optuna worker (thread or process, depending on setup)
→ Lightning Trainer (possibly with its own spawning strategy)
→ DataLoader workers (each worker is a process)
That is exactly the sort of nested multiprocessing that is brittle in notebooks.
1.2. What we know about DataLoader in notebooks
There is a long history of “DataLoader + num_workers>0 + Jupyter/Colab” misbehaving:
- A StackOverflow question shows Jupyter notebooks freezing when
num_workers > 0; top answer: “Jupyter notebooks have known issues with multiprocessing; set num_workers=0 or move to a .py script.”
- Multiple PyTorch issues and forum threads report that DataLoader freezes or fails when using
num_workers>0 in multi-thread or multi-process contexts; everything works with num_workers=0.
Your symptom is slightly different (NaN/inf loss instead of a visible hang), but the pattern is similar:
- Single process (
trainer.fit() alone) works perfectly.
- When the code is invoked from inside an Optuna trial, trials “finish” instantly with bad metrics.
A plausible mechanism:
- In the Optuna context, DataLoader workers fail to start or fail to read properly from the memmap file (or they crash silently).
- The DataLoader yields empty or corrupt batches (e.g., tensors with all zeros, or even wrong shapes that trigger later NaNs).
- Your LightningModule computes a loss on this garbage and the objective returns NaN or inf, so Optuna marks the trial as failed.
Because Optuna interprets NaN / inf as a failed objective and simply continues to the next trial, you see “lots of trials completing in seconds with NaN/inf” rather than a crash. Optuna’s docs are explicit that NaN is treated as a failed trial, not as a fatal error.
So yes: your hypothesis that nested DataLoader workers in child processes are the culprit is not just plausible, it is very consistent with known interactions between:
- notebooks/Colab,
- PyTorch DataLoader with
num_workers>0, and
- outer parallelism (threads/processes).
2. Has anyone run Lightning + Optuna HPO in Colab, and what pattern is robust?
There are multiple working examples of PyTorch Lightning + Optuna HPO in notebook environments (Kaggle, Colab-style). They are not doing PSO over QKV, but structurally the Optuna integration is the same.
2.1. Common pattern in working examples
Typical examples:
- MachineLearningMastery: “PyTorch Lightning Hyperparameter Optimization with Optuna” – shows how to define an
objective(trial), create a LightningModule and Trainer inside it, then run Optuna. It’s written as a notebook-friendly tutorial.
- Kaggle: “Complete Optuna Hparam Tuning – PyTorch Lightning” – a full notebook using Lightning + Optuna to tune a model, again with a standard
objective function.
- Japanese and Zenn blogs combining Lightning + Optuna – demonstrate training BERT and CNN models with Lightning and Optuna in notebook-like settings.
From these and Optuna’s own examples repo, certain patterns are consistent:
-
Trials are serial (or minimally parallel) inside notebooks
study.optimize(..., n_jobs=1) in most notebook examples.
- Parallel multi-process HPO is usually reserved for script/cluster environments, not Colab.
-
DataLoaders are simple
- Many examples just use
num_workers=0 or small values.
- They do not combine “outer” parallelism (e.g., multi-process trials) with “inner” DataLoader worker multiprocessing.
-
New model + trainer per trial
-
objective(trial) constructs:
- LightningModule with hyperparameters from
trial.
- DataModule/DataLoaders (with simple
num_workers).
Trainer(accelerator="gpu" or "cpu", devices=1, max_epochs=...).
-
After training, they read a metric from trainer.callback_metrics and return it.
-
No complicated distributed strategies in the objective
- They avoid DDP /
ddp_spawn inside the Optuna objective when running in notebooks.
- Lightning’s own documentation strongly discourages DDP from interactive environments, noting that DataLoader with
num_workers>0 can bottleneck or misbehave within DDP spawn.
This is why single trainer.fit() works for you: you are effectively in the “simple Lightning training” case. When you wrap it in Optuna and keep num_workers>0 and possibly multiple trials in parallel, you cross into a configuration that those examples explicitly avoid.
2.2. A robust pattern you can adopt for VSM-PSO-Attn in Colab
In Colab, the most robust pattern for Lightning + Optuna is:
-
Run trials serially
- Ensure
study.optimize(objective, n_trials=..., n_jobs=1).
-
Use single-process data loading inside trials
- Your LightningDataModule should accept
num_workers as an argument.
- For HPO runs in Colab, pass
num_workers=0 from the objective.
- Each DataLoader inside the DataModule uses that
num_workers.
-
Use a minimal Trainer setup
-
In objective(trial):
def objective(trial):
# Sample hyperparameters
lr = trial.suggest_float("lr", 1e-5, 1e-3, log=True)
batch_size = trial.suggest_int("batch_size", 16, 128)
model = VSMPSOAttnModule(lr=lr, ...)
datamodule = VSMPSODataModule(
batch_size=batch_size,
num_workers=0, # critical in Colab
memmap_path=..., # your dataset
)
trainer = lightning.Trainer(
accelerator="gpu" if torch.cuda.is_available() else "cpu",
devices=1,
max_epochs=trial.suggest_int("epochs", 2, 6),
logger=False,
enable_checkpointing=False,
)
trainer.fit(model, datamodule=datamodule)
val_loss = trainer.callback_metrics["val_loss"].item()
del trainer, model, datamodule
if torch.cuda.is_available():
torch.cuda.empty_cache()
return val_loss
-
Note:
devices=1.
- No explicit distributed strategy.
- No nested spawns.
-
Free resources between trials
- Explicitly delete the Trainer, model, and datamodule.
- Clear CUDA cache if needed.
- Disable
persistent_workers=True in your DataLoader while debugging, to avoid lingering worker processes between trials.
If you make those changes and trials suddenly stop failing instantly (they take as long as a “normal” training run and return finite losses), that is strong confirmation that your original configuration was hitting exactly the nested multiprocessing/DataLoader problem.
3. Are there known issues with np.memmap + nested DataLoader workers in a spawn context?
There are no official “this exact pattern is forbidden” warnings, but there are multiple closely related issues:
3.1. np.memmap with DataLoader workers
Several PyTorch forum threads and Q&A posts discuss np.memmap datasets with num_workers>0:
-
A user with a 32 GB memmap file found that each DataLoader worker effectively loaded its own copy, causing huge memory usage; suggestions included:
- Avoid loading the entire memmap in
__init__.
- Open the memmap lazily inside each worker (
__getitem__ or a worker-local initializer).
- Limit the number of workers.
-
Another discussion shows that np.memmap can defeat the whole point of “using more workers,” because each worker separately maps the file and can thrash memory. They recommend chunked loading and careful Dataset design rather than storing a big memmap array as a Dataset attribute.
The high-level lesson:
- With
num_workers>0, each worker process potentially opens the memmap independently.
- Each worker has its own address space; if your
Dataset.__init__ pre-loads or wraps the entire memmap, you can accidentally multiply RAM usage by the number of workers.
- If the memmap is on a slow or networked filesystem (e.g., Google Drive FUSE mount), concurrent access from multiple workers can be very slow or brittle.
3.2. DataLoader + multiprocessing issues (independent of memmap)
Even without memmap, there are well-documented problems when:
- DataLoader uses
num_workers>0, and
- you run multiple DataLoaders in parallel threads/processes.
Examples:
- A GitHub issue where multiple threads each use a DataLoader with
num_workers>0 leads to freezing; using num_workers=0 fixes it.
- Threads on PyTorch discuss how DataLoader +
num_workers>0 can hang or throw errors in many non-trivial multiprocessing scenarios, especially on Windows or in Jupyter notebooks.
Your configuration adds a memmap and a FUSE filesystem on top of this, so you are stacking three things that all stress multiprocessing:
- Jupyter/Colab (non-standard process model).
- DataLoader with worker processes.
- memmap on a remote or pseudo filesystem.
3.3. Safer Dataset pattern for memmap + DataLoader
The pattern that is generally recommended for np.memmap + DataLoader is:
class MemmapDataset(Dataset):
def __init__(self, path, shape, dtype="float32"):
self.path = path
self.shape = shape
self.dtype = dtype
self._mm = None # lazy
def _lazy_init(self):
if self._mm is None:
self._mm = np.memmap(
self.path,
mode="r", # read-only for safety
dtype=self.dtype,
shape=self.shape,
)
def __len__(self):
return self.shape[0]
def __getitem__(self, idx):
self._lazy_init()
x = self._mm[idx]
# convert x to torch.Tensor, process, etc.
return x
This design (explicitly recommended in memmap-related discussions) ensures:
- The Dataset only stores metadata (path, shape, dtype).
- Each worker creates its own memmap lazily when needed.
- You do not try to pickle a huge memmap array across worker processes.
Combine this with num_workers=0 in Optuna trials in Colab, and you avoid:
- nested multiproc + memmap,
- and the majority of the known problem cases.
Practical debugging steps specific to your VSM-PSO-Attn setup
Below is a concrete sequence you can follow in your notebook to both validate the hypothesis and converge to a stable HPO setup.
Step 1 – Make trials strictly single process
In your Colab:
-
Ensure study.optimize(objective, n_trials=..., n_jobs=1).
-
In the Optuna objective(trial):
Outcome to look for:
- If trials now run for realistic training time and produce finite losses (no more instant NaN/inf), then your nested multiprocessing hypothesis is confirmed.
Step 2 – Replace the dataset with a synthetic DataModule
Define a simple LightningDataModule for debugging:
class SyntheticDM(L.LightningDataModule):
def __init__(self, batch_size=32):
super().__init__()
self.batch_size = batch_size
def setup(self, stage=None):
self.train_data = torch.randn(1024, seq_len, d_model)
self.train_targets = torch.randint(num_classes, (1024,))
self.val_data = torch.randn(256, seq_len, d_model)
self.val_targets = torch.randint(num_classes, (256,))
def train_dataloader(self):
ds = TensorDataset(self.train_data, self.train_targets)
return DataLoader(ds, batch_size=self.batch_size, shuffle=True, num_workers=0)
def val_dataloader(self):
ds = TensorDataset(self.val_data, self.val_targets)
return DataLoader(ds, batch_size=self.batch_size, shuffle=False, num_workers=0)
Run Optuna HPO using this synthetic module, keeping PSO-Attn active.
-
If synthetic data works but your memmap DataModule fails, the problem is definitely connected to memmap + DataLoader + nested multiproc.
-
If synthetic also fails in HPO (even with num_workers=0), then the culprit is more likely:
- how the objective is written (return value, exception handling),
- the PSO hyperparameter ranges (some trials explode numerically),
- or interactions with Hydra (e.g., re-initializing configs per trial).
Step 3 – Add explicit NaN/inf checks in training_step
In your LightningModule:
def training_step(self, batch, batch_idx):
x, y = batch
if not torch.isfinite(x).all():
raise RuntimeError("Non-finite input batch")
logits = self(x)
loss = self.loss_fn(logits, y)
if not torch.isfinite(loss):
raise RuntimeError("Non-finite loss")
self.log("train_loss", loss)
return loss
Then in your Optuna objective:
def objective(trial):
try:
# build model, dm, trainer
trainer.fit(model, datamodule=dm)
val_loss = trainer.callback_metrics["val_loss"].item()
return val_loss
except Exception as e:
# optionally log e
raise # let Optuna record the exception
This ensures that:
- You see real errors if the batch is NaN/inf.
- Trials that fail because of data/PSO issues produce stack traces instead of quietly returning NaN.
Step 4 – Tighten PSO + learning rate ranges
Because PSO-over-QKV is aggressive, some hyperparameter combinations can legitimately explode:
- Very large LR or PSO coefficients (
w, c1, c2) → huge Q/K/V → massive attention logits → NaNs.
- In single manual runs you use a “safe” combo; in Optuna, the search may sample extreme values.
As a test, narrow the search space around known-good values:
- Lock PSO coefficients to one configuration that you know is stable.
- Only let Optuna vary “boring” things (learning rate, batch size).
- Once that works, gradually widen PSO-specific ranges.
If narrowing the search space makes HPO stable, you’ve identified hyperparameter-driven NaNs as a secondary factor.
Direct answers to your three questions
1. Does the DataLoader worker hypothesis seem plausible?
Yes. Combining:
- notebook/Colab,
- Optuna’s trial loop, and
- Lightning DataLoaders with
num_workers>0 and memmap,
matches known scenarios where DataLoaders freeze, misbehave, or blow up memory when used with outer parallelism. Jupyter/Colab + num_workers>0 is specifically known to be problematic, with common advice being “use num_workers=0 in notebooks or move to a .py script.”
Given that:
- Single
trainer.fit() runs fine.
- Only Optuna trials fail instantly with NaN/inf.
This is exactly the pattern you’d expect from nested multiprocessing/DataLoader problems rather than from your PSO-Attention logic itself.
2. Has anyone run multi-process Optuna HPO with Lightning in Colab, and what’s the robust data loading pattern?
Yes, there are multiple Lightning + Optuna notebook examples (Kaggle, tutorials, blog posts). In practice they:
- Run trials serially (
n_jobs=1) in notebooks.
- Use DataLoaders with
num_workers=0 (or very small) inside each trial.
- Use a single Trainer per trial with
devices=1 and no complex distributed strategies.
- Construct fresh model + DataModule per trial and free them afterwards.
That pattern is what you should copy for your PSO-based model in Colab. True multi-process HPO (many trials in parallel) is best done in a non-notebook script on a server/cluster.
3. Are there known issues with np.memmap + nested DataLoader workers in spawn context?
There is not a single official “do not do this” sentence, but there are strong related warnings:
- With
num_workers>0, each DataLoader worker may open its own np.memmap, leading to very high memory usage and possible instability.
- DataLoader and multiprocessing have many known caveats; combining them with outer parallelism (like multiple Optuna jobs or threads) can cause freezes or erratic behavior.
The safer approach is:
- Use
num_workers=0 during HPO (especially in Colab).
- Implement the dataset with lazy memmap opening (store path, shape, dtype, open inside each worker).
- Move to a proper Python script with tuned
num_workers and no nested multiproc if you later need performance-oriented HPO at scale.
If you reconfigure Optuna to run serially (n_jobs=1), set num_workers=0 in your HPO DataModule, and keep the Trainer single-device/no DDP, you will isolate the PSO-QKV part from the multiprocessing stack. If HPO becomes stable under these conditions, the paradox is resolved: the bug was the environment (nested multiprocessing + memmap), not your PSO-Attention itself.