What type of generative model fits my use case?

Hi, I’m a cloud architect with GenAI Inference and fine-tuning experience. I’m new to HF and training models. I’m passionate about learning by creating, so I would like to develop a new generative model to create continuous line art.

Kinetic art sand tables draw one continuous line in sand to produce an image from a track file [1]. A track file is a long list (5-10 thousand) of theta, rho coordinates in a .THR text file [2]. I used LLM-assisted development to prepare a dataset of more than 1,000 tracks, and can add more. Each track has its original THR text file converted to PNG and SVG formats. [3] This dataset is a combination of text, tabular, image, and linear data, which is making model choice more difficult.

I would like to train a model that already has a large vocabulary to produce new tracks creatively. Where should I continue my research? If you have any suggestions or questions for me, please reply! I’m Looking forward to discussing with this community.

[1] Sisyphus Industries – A Sisyphus table merges kinetic art, technology and design in stunning meditative beauty. It’s a computer-controlled Zen garden for your living room.

[2] JSisyphus/example tracks/BinarySupport.thr at master · SlightlyLoony/JSisyphus · GitHub

[3]

Hmm… Complicated case…


What you are actually modeling

A sand-table “image” is the side effect of a single continuous trajectory.

  • A .thr track is a plain-text list of vertices as (theta, rho) pairs. (GitHub)
  • Motion between successive points is fundamentally polar, and Sisyphus explicitly notes that moves between points are spiral moves (pure rho or pure theta are special cases). (Sisyphus Industries)
  • There are hard validity constraints: the official programming logic PDF states files must start and end with rho = 0 or 1, and theta should be a “smooth flow” (small differentials) or you get “jump spirals.” (Sisyphus Industries)

So your dataset is “multimodal” in storage (THR text, numeric pairs, PNG renders, SVG), but the generative object is one thing: a long-horizon continuous path with constraints.

That framing makes model choice much easier.


The best-fitting model type for your use case

You want a trajectory generative model. Concretely, pick one of these families:

1) Tokenized trajectory + causal language model (most practical first choice)

Treat the track like a time series, convert numbers into tokens, and train a decoder-only model to predict the next tokens.

Why this fits:

  • Tracks are long (5k–10k steps). Autoregressive models naturally generate long sequences.
  • You want a “large vocabulary.” You can create a large motion vocabulary (bins or learned codes) instead of relying on English subwords.
  • You can enforce constraints during decoding (clamp rho, cap step size, enforce start/end rules).

This is very close in spirit to Chronos, which explicitly describes scaling + quantization into tokens, LM training with cross-entropy, and autoregressive sampling back into numeric values. (Hugging Face)

If you do one thing first, do this.

2) Learned discrete latent codes (VQ-VAE) + a prior model (best scaling path)

If you want better global structure and less drift over 10k steps, compress the trajectory into discrete codes, then model the codes.

VQ-VAE is the canonical reference: the encoder outputs discrete codes and you learn a prior over them. (arXiv)

Why it fits:

  • You can turn a 10k-step track into a few hundred tokens (each code represents a chunk of steps).
  • Priors over shorter sequences are easier to train and keep coherent.

3) Diffusion over trajectories (strong global coherence, slower sampling)

Diffuser frames planning as iterative denoising of entire trajectories, not step-by-step generation. (arXiv)

Why it fits:

  • Whole-trajectory generation reduces autoregressive “early mistakes ruin the rest” failure mode.
  • You can incorporate constraints and preferences as guidance terms.

Tradeoff: more complex and slower than a token LM.


Key design choice that matters more than the architecture: representation

If you feed floats as text (e.g., 1.2345 0.6789), you waste modeling power on formatting and token count explodes. Your first goal is a representation that is:

  • compact
  • stable under angle wraparound
  • easy to constrain and repair

Recommended canonicalization

  1. Unwrap theta so it evolves continuously (avoid the 2π discontinuity).
  2. Convert to deltas per step: (Δtheta, Δrho)
  3. Resample so step statistics are consistent (optional but usually helps)

This aligns directly with the “theta values must be a smooth flow” guidance to avoid “jump spirals.” (Sisyphus Industries)

Two practical tokenization strategies

  • Fixed bins: quantize Δtheta and Δrho into N bins each (N=256…4096). Each step becomes 2 tokens.
  • Chunk codes: learn a VQ-VAE codebook where 1 code represents (say) 32–128 steps. (arXiv)

Bins get you moving fast. VQ-VAE gets you scale and coherence.


“Large vocabulary” in your context: what it should mean

A pretrained LLM’s “large vocabulary” is mostly natural language subwords, which is not what you need.

What you actually want is:

  • a large set of motion primitives (bins or codes)
  • optional text conditioning (style tags, prompts) for creativity

So you can still use an LLM backbone, but the vocabulary that matters is your motion vocabulary.


How to use PNG and SVG without turning this into a multimodal training headache

Use the rendered PNG/SVG as supervision and scoring, not necessarily as primary model input.

Differentiable rendering for losses and refinement

DiffVG is a differentiable vector graphics rasterizer that “bridges” vector and raster domains, enabling raster-based loss functions for learning/editing. (GitHub)

What that buys you:

  • render generated paths to an image
  • penalize artifacts (over-dense areas, sharp corners, self-crossing if you want)
  • optionally train a refiner that improves a rough trajectory

Using powerful vision priors as critics (optional but effective)

CLIPDraw shows you can optimize vector strokes using a pretrained language-image encoder as a metric, biasing toward simple recognizable shapes. (arXiv)

VectorFusion shows how to distill knowledge from a text-to-image diffusion model into SVG by optimizing a differentiable vector rasterizer. (arXiv)

You do not need to adopt these methods end-to-end. For your case, they’re most useful as:

  • reranking scores
  • reward models for preference tuning
  • generators of synthetic “good” exemplars

Long-sequence modeling options that match 5k–10k steps

If you stay at “2 tokens per step,” 10k steps is ~20k tokens plus metadata. That is feasible, but you should still plan for long-context behavior.

Strong references:

  • Transformer-XL adds segment-level recurrence to learn dependencies beyond fixed context and reduce fragmentation. (ACL Anthology)
  • Mamba (selective state space model) emphasizes linear scaling in sequence length and strong long-context behavior. (arXiv)
  • Hyena proposes long convolutions + gating as a subquadratic attention replacement for long contexts. (arXiv)

Practical guidance:

  • Start with a small decoder-only Transformer baseline (easiest tooling).
  • If you hit memory/time limits, evaluate Mamba/Hyena-style backbones.

If you insist on starting from an existing HF LLM backbone

This is workable only after you adopt a compact motion tokenization. Then you can fine-tune a long-context model and keep your motion vocabulary as additional tokens.

Examples of long-context open models on HF:

  • Qwen2.5-7B-Instruct lists 131,072 token context length on its model card. (Hugging Face)
  • Mistral-Nemo-Instruct-2407 lists 128k context window and Apache-2.0 licensing. (Hugging Face)
  • Qwen2.5-7B-Instruct-1M lists support up to 1M tokens (with obvious compute implications). (Hugging Face)

This path can give you:

  • a strong pretrained prior for structured generation
  • nice instruction conditioning (“make a floral rosette, dense center, low edge density”)

But again: do not feed raw float text if you can avoid it.


A clear research path to continue from here

Track generation literature (vector strokes / trajectories)

  • SketchRNN is the classic “stroke-based drawings in vector format” model and a good conceptual baseline for sequence generation. (arXiv)
  • Diffuser is the clearest trajectory-diffusion reference. (arXiv)

Vector graphics + raster losses

  • DiffVG is the core differentiable rendering tool. (GitHub)

Text-to-vector and vector diffusion guidance (useful as priors/critics)

  • VectorFusion (Text-to-SVG via diffusion + differentiable rasterizer optimization). (arXiv)
  • DiffSketcher (text-guided vector sketch synthesis through latent diffusion). (GitHub)
  • SVGDreamer (text-guided SVG generation with diffusion, particle methods, and reward reweighting). (arXiv)

Tokenization and training mechanics in HF

  • HF LLM course on training tokenizers (conceptual and practical). (Hugging Face)
  • Transformers Trainer for feature-complete PyTorch training loops. (Hugging Face)

A concrete build plan that works with your current dataset

Phase 0: Build a “track compiler” and validator

Do this before modeling. It becomes your unit test.

Validate:

Phase 1: Baseline model (quantized deltas + small causal LM)

  1. unwrap theta
  2. compute (Δtheta, Δrho)
  3. quantize to bins
  4. train a small decoder-only Transformer to predict next tokens

Add constrained decoding:

  • clamp rho
  • cap step size
  • force correct ending (last token must bring rho to 0 or 1)

Phase 2: Add image-based scoring

Render the generated track (your existing PNG pipeline is enough) and compute simple metrics:

  • density uniformity (avoid over-etching)
  • radial distribution (center vs edge occupancy)
  • symmetry proxies (optional)

Later, add differentiable losses with DiffVG. (GitHub)

Phase 3: Upgrade to VQ-VAE codes (optional, but likely worth it)

  • Train VQ-VAE to compress chunks of deltas into discrete codes. (arXiv)
  • Train a prior over codes (Transformer-XL, Mamba, etc.). (ACL Anthology)

This is where coherence usually improves.


Common pitfalls you should plan for

  1. Theta wraparound discontinuities
    Fix: unwrap theta before deltas.

  2. Autoregressive drift over 10k steps
    Fixes: chunk codes (VQ-VAE), hierarchical generation, diffusion refinement, or periodic “anchor constraints.”

  3. Model learns formatting instead of geometry
    Fix: do not train on float text. Quantize or codebook.

  4. You generate tracks that are “valid” but boring
    Fix: sampling temperature control plus a novelty-aware reranker; optionally use a critic inspired by CLIPDraw/VectorFusion. (arXiv)


Curated links (copy/paste)

Sisyphus track format and constraints
- https://github.com/SlightlyLoony/JSisyphus
- https://sisyphus-industries.com/community/community-tracks/sisyphus-programming-guide-so-far/
- https://sisyphus-industries.com/wp-content/uploads/wpforo/default_attachments/1568584503-sisyphus-table-programming-logic3.pdf

Trajectory / stroke generative modeling
- https://arxiv.org/abs/1704.03477

Tokenized numeric sequences as LM
- https://huggingface.co/amazon/chronos-t5-large

Long-sequence backbones
- https://aclanthology.org/P19-1285/
- https://arxiv.org/abs/2312.00752
- https://arxiv.org/abs/2302.10866

Trajectory diffusion
- https://arxiv.org/abs/2205.09991

Differentiable vector rendering
- https://github.com/BachiLi/diffvg

Vector synthesis using diffusion priors
- https://arxiv.org/abs/2211.11319
- https://github.com/ximinng/VectorFusion-pytorch
- https://github.com/ximinng/DiffSketcher
- https://github.com/ximinng/SVGDreamer
- https://huggingface.co/blog/xingxm/svgdreamer

HF training building blocks
- https://huggingface.co/learn/llm-course/en/chapter6/1
- https://huggingface.co/learn/llm-course/en/chapter6/2
- https://huggingface.co/docs/transformers/main/main_classes/trainer

Long-context open LLM backbones (only after motion tokenization)
- https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
- https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
- https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M

Summary bullets

  • Your best-fit model type is a trajectory generator, not a generic multimodal model. .thr is the primary object. (GitHub)
  • Start with quantized delta tokens + a causal LM, inspired by Chronos tokenization, then sample and validate. (Hugging Face)
  • Upgrade to VQ-VAE discrete codes + a prior for better long-range coherence. (arXiv)
  • Use PNG/SVG mainly for render-and-score, with DiffVG when you want differentiable image-space losses. (GitHub)
  • Long-sequence modeling references worth studying: Transformer-XL, Mamba, Hyena, Diffuser. (ACL Anthology)

Thanks for your reply. I agree that a time series forecasting model would be ideal, but I’m not aware of a text encoder that works with time series models to produce “a drawing of a chameleon” for example. If I missed this in your message, please share a simpler explanation.

Thanks

1 Like

Indeed… the information is insufficient. For now, I’ve expanded the search area.
I hope you find a model closer to your purpose…