Is this possible?

I thought about extracting the neurons from a model like gpt2-xl, creating an infinite space, and following a center-focused computation logic. In that space, I would create a sphere that expands with each added neuron. I wanted to construct each new neuron tree to grow outwards from the center like lightning. The plan was to move the model into this sphere and then ‘close’ the sphere.

Afterwards, I would add parallels and meridians to the sphere, like those on Earth, and then add an MoE (Mixture of Experts) controller tied to the Graticule structures formed by these parallels and meridians. I also wanted to add depth to these Graticule structures—I deemed a maximum of 10 layers to be appropriate. I aimed to scale the maximum depth proportionally to the longest neuron tree and apply it across all Graticule layers.

But the key point here, and the part I struggled with the most, is creating a model with weights that are open, modifiable, and trainable. This was a prerequisite for my goal of storing and transporting the model in an expandable VHD (Virtual Hard Disk) file, as the weights themselves needed to be in a modifiable file format.

I must say, trying to run inference with an AMD graphics card at this stage is a real pain. I tried 5 or 6 times with different methods and eventually gave up. I hope someone else gives it a try. I’m curious about both its performance and whether different training methods could be created for it.

a little research.
Gemini Deep Research

Evaluation of Inference Applications for Dynamic LLM Architectures: Low VRAM and MoE Layer Management on AMD GPUs

1. Introduction: A Vision for Dynamic LLM Architectures

This report addresses the innovative Large Language Model (LLM) architecture concept proposed by the user and its advanced technical requirements. The user’s vision includes a “sphere” model that “evolves within an infinite space,” expanding as neurons are added, radiating from a central point like “lightning-like neuron trees.” This sphere is enclosed by a surface featuring “Graticule layers” and “10 layers in depth,” with the number of parallels and meridians equal to those on Earth. A core technical requirement for this architecture is the ability to dynamically load, infer, and then release specific MoE (Mixture of Experts) layers, such as the “2nd layer of the 18th Graticule,” to the GPU.

The primary goals for this architecture include achieving high inference performance with low VRAM (Video Random Access Memory) consumption and supporting open-weight MoE models to allow flexibility in custom training strategies. On the hardware side, given the user’s use of an AMD graphics card, the research specifically investigates inference applications with Vulkan API support. The purpose of this report is to systematically review and analyze existing LLM inference applications (such as LM Studio, llama.cpp, Ollama, etc.) against these highly specific and advanced requirements. In light of the findings, a feasibility assessment for the user’s proposed architecture will be provided, along with actionable recommendations.

2. The AMD GPU LLM Inference Ecosystem: ROCm vs. Vulkan

This section delves into AMD’s software support for LLM inference, critically comparing ROCm and Vulkan backends, particularly in the context of consumer-grade AMD GPUs.

Overview of AMD’s LLM Support

AMD is a significant player in open-source AI, being a founding member of the Torch Foundation. It offers support for major frameworks like TensorFlow, Jax, and Triton. ROCm is AMD’s open-source software stack for GPU computing, providing a robust and scalable environment for AI and high-performance computing (HPC). Recent versions like ROCm 6.0 and the upcoming 7.0 have brought significant improvements in performance and compatibility.

vLLM, a successful LLM inference and serving engine, has been optimized for AMD Instinct GPUs (e.g., MI300X) via ROCm. vLLM V1 introduces architectural improvements that enhance flexibility and scalability while retaining core features. These include an asynchronous scheduler that non-blockingly separates CPU-intensive operations (token/de-tokenization, image preprocessing) from the GPU-intensive model inference process, and advanced features like chunked-prefill and prefix-caching enabled by default. These advancements demonstrate AMD’s commitment to high-performance LLM inference on its professional-grade hardware.

The Rise of Vulkan for Consumer AMD GPUs

While ROCm is AMD’s native compute platform, its support for consumer-grade AMD GPUs has historically been limited or challenging to set up. Many AMD GPUs (e.g., RX 6600M, 7900, 7600) do not fully support ROCm, prompting users to seek alternative solutions. This situation has led to the community embracing Vulkan as a practical bridge for consumer cards, despite AMD’s official strategy focusing ROCm on high-performance data center GPUs. In the user’s hardware context, Vulkan’s practical accessibility and existing community support might be more decisive than theoretical peak performance figures.

Vulkan, though primarily a graphics library, has emerged as a viable and often more accessible backend for LLM inference across a wide range of AMD cards, especially consumer ones. Community efforts have enabled Vulkan support in popular tools like LM Studio (which uses llama.cpp as its backend) and Ollama (via actively developed forks). TabbyML explicitly supports Vulkan to provide GPU acceleration on cards not supported by CUDA or ROCm.

Performance Assessment: ROCm vs. Vulkan on AMD

There are conflicting anecdotal reports regarding ROCm and Vulkan performance. Some users report ROCm being “noticeably faster” for LLM inference , while others state that “Vulkan works fine” and performance can be comparable or even faster than ROCm in certain scenarios. This suggests that the performance difference is not always definitive. For a user focused on low VRAM and dynamic loading, Vulkan’s ease of setup and broad compatibility (as evidenced by its widespread adoption in llama.cpp and community forks) might be a more critical factor than marginal gains from a potentially more difficult ROCm setup. This necessitates a focus on efficiency and memory management as much as raw throughput.

The following table presents concrete performance measurements (tokens per second) obtained when running the Llama 2 7B model using llama.cpp’s Vulkan backend on various AMD Radeon GPUs (e.g., RX 7900 XT, RX 6800 XT, RX 7600 XT). This data is crucial for assessing the practical applicability of Vulkan for the intended application and setting realistic performance expectations.

Table 2.1: AMD GPU Vulkan Performance in llama.cpp (Llama 2 7B, Q4_0)

Chip (GPU Model) pp512 t/s (Prompt Processing Tokens/Second) tg128 t/s (Token Generation Tokens/Second) Commit (Version ID)
AMD Radeon RX 7900 XT 2941.58 ± 17.17 123.18 ± 0.40 71e74a3
AMD Radeon RX 7800 XT 1260.54 ± 10.51 107.53 ± 0.07 ee02ad0
AMD Radeon RX 6900 XT 1257.98 ± 1.55 101.42 ± 0.02 44e18ef
AMD Radeon RX 6800 XT 1533.60 ± 2.47 95.56 ± 0.72 N/A
AMD Radeon RX 6750 XT 1040.58 ± 0.35 81.98 ± 0.03 228f34c
AMD Radeon RX 7600 XT 632.88 ± 0.70 58.44 ± 0.01 3b24d26
AMD Radeon RX 6600 XT 574.65 ± 0.86 53.92 ± 0.11 091592d
AMD Radeon RX 6600M 439.42 ± 0.34 54.69 ± 0.03 2739a71

E-Tablolar’a aktar

Source:

3. Mixture of Experts (MoE) Models: Principles and Inference Challenges

This section addresses the MoE architecture, its benefits, and the inherent VRAM challenges central to the user’s query.

Principles of MoE Architecture

Mixture of Experts (MoE) models offer a paradigm shift in scaling LLMs, allowing for larger total parameter counts without a proportional increase in computational cost during inference. The core concept involves “sparse activation,” where only a small subset of specialized subnetworks, known as “experts” (typically Feed-Forward Networks or FFNs), are activated for each input token. A “gating network” or “router” dynamically determines the most relevant

k experts (e.g., 2 out of 8) for a given token and directs computation accordingly. The outputs from these selected experts are then combined. This sparse activation leads to significant computational savings per token (lower FLOPs) compared to dense models of equivalent total parameter size.

Inherent VRAM Challenges of MoE Inference

Despite the computational efficiency derived from sparse activation, a major challenge for MoE models during inference is their high VRAM requirements. The dynamic nature of the router’s expert selection necessitates that all potential experts across all MoE layers are simultaneously loaded into GPU VRAM for fast inference. If an expert’s parameters are not in VRAM, they must be fetched from slower system RAM or storage, introducing significant delays.

This creates a crucial distinction: while MoE models are often touted as computationally efficient (“faster,” “more efficient” ), the

total parameter count remains high, and all experts must reside in VRAM for optimal inference speed. This forms a tension where MoE is computationally efficient but memory capacity-demanding. This provides a critical clarification that the user’s “low VRAM” goal must be understood in terms of the active parameter set rather than the entire model’s footprint.

For instance, Mixtral-8x7B, despite having 46 billion total parameters (but only 13 billion active per token), requires at least 92GB of VRAM for FP16 deployment because all experts must be accessible. This “all experts in VRAM” requirement limits the deployment of large MoE models on consumer devices with limited GPU memory, despite their theoretical computational efficiencies.

The user’s vision of dynamically loading and releasing specific “Graticule layers” implies fetching them only when needed. However, research explicitly states that this approach would lead to significant performance degradation: loading selected experts on demand from disk would be “awfully slow”. Even fetching 11.3B parameters from CPU RAM via PCIe 4.0 x16 can incur a substantial latency of 0.7 seconds per step. This is unacceptable for real-time LLM inference and directly contradicts the user’s “high performance” goal. This points to a fundamental trade-off, requiring either loading everything for speed or accepting severe performance degradation with true dynamic loading from slow storage.

4. VRAM Optimization and Dynamic Layer/Expert Management Strategies

This section examines existing techniques for VRAM optimization and critically evaluates the feasibility and performance implications of dynamic, on-demand loading of specific layers or experts, directly addressing the user’s core architectural concept.

Built-in VRAM Optimization Techniques

  • Quantization: This is the primary method for reducing the memory footprint of LLMs. By storing model weights at lower precision (e.g., INT4, FP16 instead of FP32), it shrinks model size (e.g., 8x for INT4) and can accelerate computation. Quantization is widely adopted for LLM inference.
  • Offloading (CPU/Disk): For models too large to fit entirely into GPU VRAM, offloading involves moving parts of the model (weights, KV cache) to slower CPU memory or even disk. llama.cpp specifically supports distributing model layers between the CPU and GPU. Offloading a portion of layers (e.g., 50%) can significantly boost inference speed compared to CPU-only inference, with performance scaling proportionally to the percentage of layers offloaded to the GPU. For MoE models, keeping active expert weights and the KV cache on the GPU is crucial for significant speedup, even if the majority of total model weights reside in system RAM.

Layer Dropping and Its Variants

  • Concept: This technique involves removing less critical layers from a pre-trained LLM to reduce computational load and memory usage. This enables faster inference with minimal performance degradation (often retaining 95-99% of original performance even after removing 25-50% of layers) and provides a 2x to 5x speedup.
  • Static vs. Dynamic Layer Dropping:
    • Static Layer Dropping: A fixed set of layers is removed for all inputs, simplifying implementation and reducing computational costs during inference.
    • Dynamic Layer Dropping: The layers to be skipped are adjusted based on the characteristics of each input, often using a “router” to dynamically determine layer usage. While more complex, this method can adapt better to diverse inputs.
  • Distinction from On-Demand Expert Loading: Layer dropping, even when dynamic, typically refers to skipping computations for certain layers within an already loaded model. It differs from the user’s concept of loading and unloading specific MoE experts from external memory.

Feasibility of On-Demand Expert Loading (MoE Specific)

As previously noted, loading selected experts on demand from disk would be “awfully slow” and impractical for real-time inference. Even fetching from CPU RAM via PCIe can introduce significant latency (e.g., 0.7 seconds per step for 11.3B parameters over PCIe 4.0 x16).

The router in an MoE model can select different experts for consecutive tokens, which would necessitate continuous fetching and potentially “swapping in and out,” thereby “killing” any benefit. The probability of switching between expert groups is generally hard to predict, limiting the effectiveness of prefetching strategies. Theoretically, if a “single full vertical slice of the model” can be loaded into memory, it might be possible to “hot swap” experts from storage with “just barely” a severe slowdown. However, this is a compromise where performance is sacrificed for VRAM reduction.

Emerging MoE architectures like FloE (Mixture of Lookup Experts) are being explored. FloE aims to re-parameterize experts as “computation-free LUTs” (Look-Up Tables) during inference, thereby significantly reducing memory footprint and communication overhead by eliminating the need to load them into VRAM for computation. This approach addresses the dynamic loading problem through a fundamental architectural change.

General Dynamic Loading/Scaling Challenges

General LLM inference systems face “cold start” issues due to large model weights (tens of GBs) and heavy container images (8-10 GB), leading to prolonged startup and loading times. This implies that dynamically loading model components would encounter similar I/O and initialization bottlenecks. Current systems often lack “on-demand streaming” for model files, requiring them to be fully downloaded and written to disk before inference can commence.

The user’s specific request to “load the 2nd layer of the 18th Graticule to the model GPU, perform inference, and then release it” implies a highly granular, runtime-dependent memory management strategy. While frameworks like llama.cpp offer static layer/tensor offloading (configured at model load time) , they lack an API for such dynamic, per-step swapping. This means the user’s vision of true, per-step dynamic loading is not a ready-made feature in popular existing frameworks and would require significant custom development. This points to a system design and implementation challenge rather than a product selection problem.

Furthermore, there is a clear contradiction between the user’s explicit goal of “achieving high performance with low VRAM” and the method of dynamically loading layers/experts. Research clearly indicates a strong negative correlation between dynamic loading from slow memory and inference speed. This means the proposed

method directly undermines one of the primary goals. The fundamental takeaway is that the user must prioritize: either accept higher VRAM usage for high performance (by keeping all active experts on the GPU) or accept significantly lower performance for very low VRAM via dynamic loading from disk. This “trade-off triangle” is unavoidable with current technology.

5. Evaluation of Existing LLM Inference Applications on AMD GPUs

This section provides a detailed analysis of key LLM inference applications, specifically evaluating their Vulkan support on AMD GPUs, MoE model capabilities, and fine-grained layer/tensor offloading features.

5.1. llama.cpp

llama.cpp explicitly supports Vulkan as a backend for AMD GPU acceleration. This is a significant advantage, especially for AMD GPU users with consumer-grade cards where ROCm setup can be problematic. Compiling

llama.cpp with Vulkan support typically involves setting GGML_VULKAN=ON during CMake configuration. Performance benchmarks using the Vulkan backend on various AMD Radeon GPUs (e.g., RX 7900 XT, RX 6800 XT, RX 7600 XT) are available, demonstrating its practical viability for LLM inference.

llama.cpp is the primary engine for GGUF (GGML Ultra Fast) models, a recommended format for efficient loading and inference. It also supports large MoE models like Qwen-3-235B-A22B, which are structured in the GGUF format. GGUF models are organized into “blocks” (layers), with each block containing various “tensors” (e.g., attention tensors, FFN expert tensors). The current

llama.cpp implementation for MoE models does not optimally utilize NUMA architecture, which can lead to performance bottlenecks in multi-socket systems.

For VRAM management, llama.cpp offers command-line options:

  • --gpu-layers (-ngl): Allows specifying the number of model layers (blocks) to offload to the GPU. If multiple GPUs are present, layers are typically assigned evenly.
  • --override-tensor (-ot): This powerful flag provides more granular control by allowing specific tensors within layers to be offloaded to the CPU, even if the rest of the layer remains on the GPU. For MoE models, this is particularly useful for offloading large FFN expert tensors (which are less GPU-intensive and larger) to the CPU while keeping smaller, GPU-intensive attention tensors on the GPU. Example usage: -ot ".ffn_.*_exps.=CPU".

This offloading is a static configuration made at model load time, not a dynamic, per-token or per-layer runtime API. Given the user’s specific needs,

llama.cpp’s direct Vulkan support, its ability to handle open-weight GGUF MoE models, and its provision of precise (though static) offloading options like --gpu-layers and --override-tensor make it the most suitable and flexible choice. This combination is the closest available tool for fine-grained control, especially for a doctoral researcher who might require low-level access and customization.

5.2. LM Studio

LM Studio is a user-friendly application that typically uses llama.cpp as its backend. Users have reported successfully running LM Studio with

llama.cpp and Vulkan acceleration on AMD Radeon GPUs. However, some users have noted that LM Studio might inaccurately report VRAM usage (e.g., 0 VRAM) even when GPU computation is active.

As it leverages llama.cpp’s capabilities, LM Studio inherits MoE model support and layer/tensor offloading features. It provides a visual indicator (a green rocket ship for full offloading, blue for partial) to show how much of the model is loaded into VRAM. However, while offering a simplified user experience, LM Studio may provide less fine-grained control compared to the direct

llama.cpp CLI for advanced, experimental scenarios.

5.3. Ollama

Officially, Ollama’s GPU support is limited to CUDA (NVIDIA) and Apple Metal, excluding native Vulkan support for AMD/Intel GPUs. However, actively developed community forks, such as

ollama-vulkan (developed by @Whyvl), enable Vulkan support by compiling Ollama with LLAMA_VULKAN=1. This requires manual compilation steps.

Ollama supports GGUF models and includes CPU and memory optimizations. Specific details regarding fine-grained MoE dynamic layer management or tensor offloading capabilities beyond general GGUF support are not explicitly mentioned in the provided research materials.

5.4. Other Relevant Frameworks

  • vLLM: Primarily optimized for NVIDIA CUDA and AMD Instinct GPUs via ROCm. While it excels at high-throughput serving with features like PagedAttention , it does not focus on Vulkan for consumer AMD cards or fine-grained dynamic layer/expert loading for VRAM optimization in the user’s specific conceptual model.
  • TabbyML: Has explicitly announced Vulkan support to provide GPU acceleration for cards not supported by CUDA or ROCm. It offers pre-built Vulkan binaries for ease of use. However, the provided research material does not contain information about its specific support for MoE models and dynamic layer loading.

Overall, despite the ability of existing inference frameworks to configure layer/tensor placement, the research explicitly indicates a lack of information regarding

dynamic loading/unloading of specific layers or experts at runtime for MoE models. This is a critical limitation for the user’s vision of “loading the 2nd layer of the 18th Graticule to the model GPU, performing inference, and then releasing it.” Current mechanisms allow for static partitioning of the model between GPU and CPU, but not on-demand swapping of individual MoE experts during inference. This highlights a significant technical gap between the user’s conceptual model and the capabilities of existing off-the-shelf inference frameworks.

6. Feasibility Assessment of the User’s Proposed Model

This section synthesizes the findings to evaluate the feasibility of the user’s innovative “sphere” model architecture with existing LLM inference applications, focusing on the interplay between low VRAM, high performance, and dynamic layer management.

Alignment with Existing Capabilities

  • Vulkan on AMD: The good news is that Vulkan is a well-supported and performant backend for LLM inference on AMD GPUs, particularly through llama.cpp and its derivatives. This fundamental requirement is met.
  • Open-Weight MoE Models: Standard open-weight MoE models (in GGUF format) are fully supported by llama.cpp. This provides the flexibility the user desires for custom training strategies.
  • Partial Layer/Tensor Offloading (Static): Existing frameworks, especially llama.cpp, offer robust mechanisms (--gpu-layers, --override-tensor) to statically offload specific layers or even fine-grained tensors (like FFN experts) to CPU RAM. This is crucial for managing VRAM on consumer GPUs and achieving better performance than full CPU inference. Quantization also aids in VRAM reduction.

Challenges of Dynamic, Per-Layer/Per-Expert Loading (Runtime)

  • Lack of Native Runtime Dynamic Loading API: The user’s vision of “loading the 2nd layer of the 18th Graticule to the model GPU, performing inference, and then releasing it” (load, infer, release) is not natively supported as a runtime dynamic feature in current mainstream LLM inference frameworks (like llama.cpp). Existing offloading mechanisms are configured at model initialization.
  • Unacceptable I/O Latency: Even if a custom implementation were attempted, the fundamental bottleneck of I/O speed (fetching experts from disk or even CPU RAM) would severely compromise the “high performance” goal. Loading 11.3B parameters per step from CPU RAM over PCIe 4.0 x16 can take 0.7 seconds, and from disk, it could exceed 10 seconds. This makes real-time, fine-grained expert swapping impractical.
  • MoE’s “All in VRAM” Requirement: For optimal performance, MoE models fundamentally require all potential experts to be accessible in fast VRAM, as the router’s selection is dynamic and unpredictable at inference time. The “low VRAM” goal must be balanced against this reality; if high performance is desired, this applies more to the active parameter set and KV cache, not the total model footprint.
  • Conceptual vs. Technical Match: The “infinite space,” “growing sphere,” and “Graticule layers” represent a highly abstract and evolving model architecture. Current inference runtimes are designed for static, pre-defined computational graphs. Adapting to a dynamically growing and evolving model that adds/removes neurons/layers would require fundamental changes to the core design of the inference engine, far beyond simple configuration.

The Trade-off Triangle: VRAM, Performance, and Dynamic Control

The user’s core objective of “achieving high performance with low VRAM” through dynamic loading of specific layers presents a significant trade-off. While dynamic loading aims to reduce the active VRAM footprint, the latency incurred by moving data from slower memory (CPU RAM, disk) directly contradicts the high-performance requirement.

Current technology clearly indicates that the user’s vision of dynamic, on-demand MoE layer/expert loading goes significantly beyond the existing capabilities of off-the-shelf LLM inference frameworks. Therefore, the fundamental conclusion is that achieving this vision will likely require a dedicated research and development effort, involving modifications to existing inference engines (like llama.cpp at the ggml level) or even the design of a new custom inference runtime. This represents a shift from a product selection problem to a system design and implementation challenge.

The most viable strategy for MoE inference with current tools is to fit all active parameters and the KV cache onto the GPU. If the

total model size exceeds VRAM, statically offloading less frequently used or less GPU-intensive components (like FFN experts) to the CPU is the current practical compromise. The emergence of research like FloE suggests that the most effective solutions for highly dynamic, low-VRAM MoE inference may come from

co-designing the model architecture and the inference runtime. Instead of forcing a standard MoE model into an unsuitable dynamic loading paradigm, the user’s novel “sphere” model could be designed from the ground up to be inherently more amenable to efficient dynamic memory management (e.g., by making “Graticule MoE layers” computation-free or highly compressible for fast swapping). This implies a deeper, more integrated research direction for the user.

7. Recommendations for Implementation and Future Research

This concluding section provides concrete and actionable recommendations for the user to realize their ambitious model vision, distinguishing between immediate practical steps and longer-term research avenues.

7.1. Immediate Practical Steps (Leveraging Existing Frameworks)

  • Primary Framework Choice: llama.cpp: Given its strong Vulkan support for AMD GPUs , native GGUF MoE model compatibility , and fine-grained static offloading capabilities , llama.cpp (or its direct CLI) is the most suitable tool for the user’s experiments. While LM Studio offers a more user-friendly interface, direct llama.cpp CLI is recommended for deeper control and debugging.
  • Model Format: GGUF: Always use GGUF formatted models for llama.cpp due to their efficiency and comprehensive metadata.
  • Strategic Static Offloading for VRAM Optimization:
    • Maximize GPU Layers: Use --gpu-layers (-ngl) to offload as many model layers (blocks) as possible to your AMD GPU, prioritizing attention layers. This ensures the most computationally intensive parts run on faster hardware.
    • Targeted Tensor Offloading: Employ --override-tensor (-ot) to offload large FFN expert tensors (e.g., -ot ".ffn_.*_exps.=CPU") to CPU RAM. FFN experts are generally larger and less critical to keep on the GPU for performance compared to attention tensors. This is the best current practice for balancing VRAM and performance in MoE models.
    • Prioritize Active Parameters and KV Cache on GPU: The key to performance for MoE models is ensuring that the currently active parameters and the KV cache reside entirely in GPU VRAM. This will define the practical limits of “low VRAM” for high performance.
  • Quantization Experiments: Experiment with different quantization levels (e.g., Q4_K_M, IQ4_XS) to further reduce the model’s VRAM footprint. This is a highly effective way to fit larger models into limited VRAM.
  • System Resources: Ensure sufficient system RAM (CPU RAM) and a fast SSD (NVMe) to minimize latency when offloading layers/tensors to the CPU or for initial model loading.

7.2. Approach Towards Dynamic Loading Vision (Long-Term/Research)

  • Re-evaluate “On-Demand” Granularity: Current technology makes true dynamic loading of individual MoE experts per token from disk/RAM impractical due to severe I/O latency. Consider if “Graticule layers” can be defined as larger, more stable computational units that are loaded perhaps only once per complex query or conversational turn. This would reduce the frequency of costly memory transfers.
  • Explore Architectural Innovations: The user’s “sphere” model is a novel architecture. Investigate cutting-edge MoE architectures like FloE (Mixture of Lookup Experts). FloE aims to address the VRAM and communication overhead challenges of sparse experts by re-parameterizing them as computation-free Look-Up Tables (LUTs) during inference. This approach fundamentally changes how experts are accessed and could align with the user’s dynamic, low-VRAM inference goal by eliminating the need to load full expert weights into VRAM for computation. This would require modifying not just the inference framework, but the model architecture itself.
  • Custom Runtime Development (Advanced): To achieve the exact “load, infer, release” behavior for layers/experts as envisioned, the user would likely need to delve deep into llama.cpp’s underlying ggml library and potentially modify its source code or develop a custom runtime layer. This would be a significant and complex development effort, requiring in-depth knowledge of ggml’s tensor and memory management. Research into predictive prefetching or caching of experts could be explored, though current research indicates the difficulty of predicting expert activation patterns across different requests.
  • Hybrid Memory Management: While full dynamic loading from disk is too slow, a hybrid approach could involve keeping the most frequently accessed “core” layers/experts and the KV cache on the GPU, a larger “warm” set in fast system RAM, and the full model on a very fast NVMe SSD. Intelligent caching and prefetching could manage transfers between RAM and GPU.

7.3. Implications for Custom Training Strategies

The open-weight GGUF models and the open-source philosophy of llama.cpp inherently provide excellent flexibility for custom training and fine-tuning. The concept of “Graticule layers” and a “growing sphere” implies a dynamic training process (e.g., continuous learning, modular growth). This would necessitate a training framework capable of dynamically adding, modifying, or fine-tuning specific “Graticule” or “depth-wise” MoE layers and then exporting these changes into a compatible GGUF inference format. This goes beyond standard fine-tuning techniques like LoRA, which typically apply to static model updates.

The vision of “every layer being open” (meaning open for training) aligns with the inherent modularity of MoE. However, building a training pipeline that can selectively train or update specific “Graticule” MoE layers and integrate them into a growing sphere would be a substantial research and engineering undertaking. The user’s vision of dynamic, on-demand MoE layer/expert loading with high performance at low VRAM goes significantly beyond the ready-made capabilities of current LLM inference frameworks. Therefore, achieving this vision will require a dedicated research and development effort, involving modifications to existing inference engines (like llama.cpp at the ggml level) or even the design of a new custom inference runtime. This represents a shift from a product selection problem to a system design and implementation challenge.

The discussion of FloE highlights an important point: the most effective solutions for highly dynamic, low-VRAM MoE inference may come from co-designing the model architecture and the inference runtime. Instead of forcing a standard MoE model into an unsuitable dynamic loading paradigm, the user’s novel “sphere” model could be designed from the ground up to be inherently more amenable to efficient dynamic memory management (e.g., by making “Graticule MoE layers” computation-free or highly compressible for fast swapping). This implies a deeper, more integrated research direction for the user.

I can see you’ve put a lot of creative thought into this spherical model design with the graticule-based MoE system, although it’s quite an ambitious approach to neural network organization.

yeah there are some technical challenges, particularly with the AMD GPU inference issues. That’s definitely a common frustration many developers encounter when working outside the typical NVIDIA ecosystem.

just help me to clarify a few things:

  1. Are you looking to implement this from scratch, or were you planning to modify an existing framework like PyTorch or TensorFlow?

  2. For the modifiable weights requirement, have you considered formats like ONNX or custom serialization methods that might work well with your VHD storage approach?

  3. Regarding the AMD GPU issues - would it be helpful to explore CPU-based inference for now, or look into AMD’s ROCm platform as an alternative?

I’d be happy to discuss with my research team and will found some potential solutions or help break down the implementation into smaller, manageable steps. As far I know, this kind of novel architecture design is exactly the type of challenge that could lead to interesting breakthroughs.

let me know quickly which aspect you’d like to tackle first, and I’ll do my best to discuss first with my team support this project.