for now:
Your project is strong because it starts from a puzzle that already has hard rules and real strategy, instead of trying to invent fun from unconstrained text generation. That matters. A lot of “LLM game” ideas collapse because the model is asked to be both the rules engine and the entertainer. Your setup is better because the rules can stay exact while the model layer adds ranking, hints, and adaptive interaction. The current tooling also fits that split well: Hugging Face Datasets can load plain CSV data without a custom dataset script, Sentence Transformers recommends a retrieve-then-rerank pipeline for harder selection tasks, and Qwen’s current embedding/reranker family is explicitly built for text embedding and ranking. (Sbert)
What your recent runs already proved
Your project is already past the “can I wire HF into this?” stage. You have shown that a public HF CSV lexicon can load cleanly, that legality checks can be enforced deterministically, and that the search layer can find cover-all chains. The odd outputs you saw are not a failure of Hugging Face integration. They are a sign that your lexicon is broad and your score is solver-oriented, not player-oriented. The English-Valid-Words dataset card says it contains valid English words with frequency, stem, and stem valid probability, which makes it a good bootstrap lexicon, but not automatically a polished game vocabulary. (Hugging Face)
The most important design decision
The best decision for your project is to keep legality symbolic. I would not let the language model decide whether a move is valid. Public Letter Boxed solver repos that actually work do this symbolically with backtracking, chaining, or bitmasking, not with free-form generation. Hugging Face’s constrained beam search and prefix_allowed_tokens_fn are useful tools, but they are best treated as optional control layers, not as the final judge of a puzzle’s rules. (GitHub)
That means your architecture should stay split like this:
- rules engine: board, legality, chaining, search
- lexicon layer: which words are allowed and which are player-friendly
- ranking layer: which legal move is best for the current player state
- assistant layer: how to explain, hint, and pace the experience
That is the version of the project I would bet on.
Where the real value is
The direct solver is useful, but the most valuable part of your project is the assistant behavior. A solver asks, “What works?” A good assistant asks, “What works, what feels fair, and what helps without spoiling?” That difference is where your project becomes more than a clone. Sentence Transformers’ retrieve-and-rerank guidance maps almost perfectly onto this: first get the candidate set efficiently, then rerank for precision. In your game, the symbolic engine produces the legal set, and the semantic layer reranks that legal set for hint quality, thematic fit, or beginner-friendliness. (Sbert)
My view on your current lexicon problem
Your biggest short-term problem is not the model. It is the vocabulary surface. The HF lexicon you are using is broad enough to include words that are technically valid but poor for casual gameplay. That is why you started getting outputs that are structurally efficient but aesthetically bad. I would treat English-Valid-Words as a bootstrap source, not as the final player-facing lexicon. A better long-term approach is to intersect a broad lexicon with a more curated common-word list. SCOWL-style wordlists are useful here because they are explicitly organized by commonness: the en-wl/SCOWL project says size 35 is a recommended small list, 50 medium, 70 large, and 80 starts to include the strange and unusual words people like to use in word games. That is almost exactly the distinction your demo exposed. (Hugging Face)
So my recommendation is:
- keep the broad HF lexicon for coverage and internal search
- build a clean gameplay lexicon on top of it for hints and first suggestions
- use frequency and stem-validity metadata to suppress ugly entries
- optionally intersect with a SCOWL-style common-word list for a “human mode”
That will improve the game more than changing models will.
My view on models for your project
For demos and local experiments, sentence-transformers/all-MiniLM-L6-v2 is a good fit because it is intended for sentence and short paragraph encoding and is used for retrieval, clustering, and sentence similarity. That is the right job description for “semantic hinting” or “theme-aware reranking.” It is not a legality model, and it does not need to be. (Hugging Face)
For a stronger production path, I would move toward a real embedding-plus-reranker stack such as Qwen/Qwen3-Embedding-0.6B with Qwen/Qwen3-Reranker-0.6B. The Qwen model cards say the series is specifically designed for text embedding and ranking tasks, with sizes from 0.6B to 8B, and inherits multilingual and long-context strengths from the Qwen3 base models. That makes it a better long-term fit than a small general-purpose sentence embedder once your candidate set and scoring logic are already solid. (Hugging Face)
The key point is that the model should rank already legal candidates. It should not generate the legal set from scratch.
Where I would not spend time yet
I would not fine-tune early. The project still has higher-leverage work in:
- lexicon curation
- score design
- hint ladder design
- board generation and evaluation
Fine-tuning a model to enumerate legal words would be the wrong abstraction. Exact combinatorial legality is cheaper and more reliable in code. If you fine-tune anything later, fine-tune a reranker or a hint model, not the legality engine. Sentence Transformers’ docs are very clear that rerankers are second-stage precision tools, and that is much closer to your actual bottleneck. (Sbert)
The project directions I think are strongest
I see three especially strong directions.
1. Puzzle assistant
This is the safest and most immediately useful version. It validates words, explains failures, ranks next moves, and offers spoiler-controlled hints. It is easy to test and easy to understand.
2. Semantic variant
This is the most original version. The next word is not only legal by letters; it must also be semantically related, contrastive, or theme-consistent. This is where embeddings and rerankers become central rather than optional.
3. Board generator
This is where the project becomes more than a helper. You can score candidate boards by solvability, number of short solutions, branching factor, and the quality of beginner-friendly hints. Solver repos show how to solve boards; your bigger opportunity is to generate boards that are actually fun. (GitHub)
The real hard problem
The hardest problem in your project is not legality. It is taste.
A mathematically efficient move is not always a good move for a player. Your current scores already showed that. So I would explicitly separate:
- solver score: shortest or strongest completion
- assistant score: common, elegant, hintable, human-friendly
- semantic score: theme fit or conceptual continuity
If you do not separate those, the assistant will keep sounding like a brute-force optimizer.
What I would build next
I would do the next phase in this order:
- keep the current symbolic engine
- add stronger lexicon filters for player-facing suggestions
- create a clean assistant-mode score that penalizes overlong or obscure first moves
- only then turn semantic reranking back on
- after that, add a hint ladder: structural hint, semantic hint, constrained shortlist, explanation
That order keeps you focused on the actual player experience rather than on model novelty.
My bottom line
Your project is good because it uses models where models actually help: ranking, hinting, theming, and adaptation. It avoids the trap of asking the model to replace exact rules. The current Hugging Face ecosystem supports this kind of system well, and your latest runs already showed that the technical foundation works. The next leap is not a bigger model. It is a better vocabulary policy and a better assistant score. (Hugging Face)
The shortest version of my advice is:
Keep rules symbolic. Curate the lexicon aggressively. Use embeddings to improve candidate quality, not legality. Treat the assistant as a game designer, not a raw solver.
Below is a single-file demo.
It is designed around three current facts:
- Hugging Face Datasets can load plain CSV files with the generic
csv loader, so you do not need a custom dataset script for a dataset like this. (Hugging Face)
Maximax67/English-Valid-Words explicitly says it contains valid English words plus frequency, stem, and stem valid probability. (Hugging Face)
sentence-transformers/all-MiniLM-L6-v2 is intended for sentence and short paragraph encoding for retrieval, clustering, and similarity, and inputs longer than 256 word pieces are truncated. That makes it fine for lightweight semantic reranking of short word candidates. (Hugging Face)
# deps:
# pip install datasets sentence-transformers transformers torch numpy
#
# demo goals:
# - one file
# - no argparse
# - public Hugging Face dataset
# - no dataset builder script required
# - CPU-safe by default
# - GPU-safer if CUDA is available
# - cleaner vocabulary than the earlier demos
# - separate "assistant" scoring from "solver" scoring
#
# URLs used:
# Dataset page:
# https://huggingface.co/datasets/Maximax67/English-Valid-Words
# Raw CSV:
# https://huggingface.co/datasets/Maximax67/English-Valid-Words/resolve/main/valid_words_sorted_by_frequency.csv
# Model page:
# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# HF datasets docs:
# https://huggingface.co/docs/datasets/main/en/dataset_script
# Sentence Transformers semantic similarity docs:
# https://sbert.net/docs/sentence_transformer/usage/semantic_textual_similarity.html
#
# Why these choices:
# - HF docs say generic loaders are provided for CSV / JSON / text style data,
# so a plain CSV dataset can be loaded without a custom dataset script.
# - The chosen dataset exposes word/frequency/stem fields, which is useful
# for filtering common playable words.
# - The chosen embedding model is small and practical for a 10GB / 16GB setup,
# and is meant for similarity / retrieval style tasks rather than generation.
#
# Notes:
# - This demo keeps legality fully symbolic.
# - The HF model is optional and only used to rerank a short legal shortlist.
# - On CPU, float32 is preferred for safety.
# - On CUDA, float16 is attempted to save VRAM.
# - This is "assistant-first", not "brute-force shortest-solver-first".
from __future__ import annotations
import math
from collections import defaultdict
import numpy as np
import torch
from datasets import load_dataset
# ============================================================
# 1) EDITABLE SETTINGS
# ============================================================
# Friendlier default board than abc/def/ghi/jkl
BOARD_SIDES = ["aeo", "rtn", "sli", "cdp"]
# Use a standard Letter Boxed-like rule:
# consecutive letters cannot come from the same side
FORBID_SAME_SIDE_ADJACENT = True
# This is NOT standard Letter Boxed, but some variants want it.
# Keep False for a friendlier game.
FORBID_REPEATED_LETTERS_IN_WORD = False
# Standard chaining rule: next word starts with previous word's last letter
REQUIRE_LAST_TO_FIRST_CHAIN = True
MIN_WORD_LEN = 3
MAX_WORD_LEN = 8
# Stronger lexical filters for cleaner gameplay
MIN_FREQUENCY = 1_000_000
MIN_STEM_VALID_PROB = 0.60
# Scan only part of the dataset for a safer demo
MAX_WORDS_TO_SCAN = 150_000
# Search / output
MAX_CHAIN_LEN = 4
MAX_RESULTS = 10
TOP_K = 12
# Assistant vs solver mode:
# - "assistant" prefers more common, cleaner, easier-to-hint words
# - "solver" prefers bigger coverage and fast completion
MODE = "assistant" # or "solver"
# Optional semantic rerank
USE_SEMANTIC_RERANK = False
SEMANTIC_THEME = "movement, travel, transport"
SEMANTIC_SHORTLIST = 40
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
# Partial chain, if any
CURRENT_CHAIN = []
# Example words to validate
EXAMPLE_WORDS = ["stone", "crane", "plane", "trade", "cider", "loop"]
# ============================================================
# 2) PUBLIC HF DATA SOURCE
# ============================================================
WORD_CSV_URL = (
"https://huggingface.co/datasets/Maximax67/English-Valid-Words/"
"resolve/main/valid_words_sorted_by_frequency.csv"
)
# ============================================================
# 3) HELPERS
# ============================================================
def normalize(text: str) -> str:
return "".join(ch.lower() for ch in text if ch.isalpha())
def find_column(columns, exact_names, contains_tokens, required=True):
lower_map = {c.lower(): c for c in columns}
for name in exact_names:
if name.lower() in lower_map:
return lower_map[name.lower()]
for c in columns:
cl = c.lower()
if any(tok in cl for tok in contains_tokens):
return c
if required:
raise ValueError(f"Could not find expected column in {columns}")
return None
def safe_float(x, default=None):
try:
if x in (None, ""):
return default
return float(x)
except Exception:
return default
def safe_int(x, default=0):
try:
if x in (None, ""):
return default
return int(float(x))
except Exception:
return default
# ============================================================
# 4) BOARD / RULE ENGINE
# ============================================================
BOARD_SIDES = [normalize(s) for s in BOARD_SIDES if normalize(s)]
BOARD_LETTERS = set("".join(BOARD_SIDES))
SIDE_OF = {ch: i for i, side in enumerate(BOARD_SIDES) for ch in side}
CURRENT_CHAIN = [normalize(w) for w in CURRENT_CHAIN if normalize(w)]
if len(BOARD_LETTERS) != sum(len(s) for s in BOARD_SIDES):
raise ValueError("Board letters must be unique across sides for this demo.")
VOWELS = set("aeiou")
def invalid_reason(word: str) -> str | None:
w = normalize(word)
if len(w) < MIN_WORD_LEN:
return f"too short (minimum is {MIN_WORD_LEN})"
if len(w) > MAX_WORD_LEN:
return f"too long (maximum is {MAX_WORD_LEN})"
bad = sorted({ch for ch in w if ch not in BOARD_LETTERS})
if bad:
return f"contains letters not on the board: {', '.join(bad)}"
if FORBID_REPEATED_LETTERS_IN_WORD and len(set(w)) != len(w):
return "repeats a letter, which this variant forbids"
if FORBID_SAME_SIDE_ADJACENT:
for a, b in zip(w, w[1:]):
if SIDE_OF[a] == SIDE_OF[b]:
return f"uses the same side twice in a row at '{a}{b}'"
return None
def is_legal_word(word: str) -> bool:
return invalid_reason(word) is None
def chain_ok(prev_word: str, next_word: str) -> bool:
if not REQUIRE_LAST_TO_FIRST_CHAIN:
return True
return normalize(prev_word)[-1] == normalize(next_word)[0]
def explain_word(word: str, prev_word: str | None = None) -> str:
reason = invalid_reason(word)
if reason is not None:
return f"INVALID: {reason}"
if prev_word is not None and not chain_ok(prev_word, word):
return (
f"INVALID CHAIN: previous word ends with '{normalize(prev_word)[-1]}', "
f"but '{normalize(word)}' starts with '{normalize(word)[0]}'."
)
return "VALID"
# ============================================================
# 5) HUMAN-FRIENDLY FILTERS
# ============================================================
def looks_human_friendly(word: str, stem_prob: float | None) -> bool:
# Require at least one classic vowel
if not any(ch in VOWELS for ch in word):
return False
# Reject very abbreviation-like short forms
if len(word) <= 3 and sum(ch not in VOWELS for ch in word) >= 3:
return False
# If stem-validity exists, require a reasonably confident value
if stem_prob is not None and stem_prob < MIN_STEM_VALID_PROB:
return False
return True
# ============================================================
# 6) LOAD HF CSV WITH GENERIC CSV LOADER
# ============================================================
print("Loading word list from Hugging Face CSV...")
ds = load_dataset("csv", data_files=WORD_CSV_URL, split="train")
print("Columns:", ds.column_names)
word_col = find_column(
ds.column_names,
exact_names=["Word", "word"],
contains_tokens=["word"],
)
freq_col = find_column(
ds.column_names,
exact_names=["Frequency count", "frequency count", "frequency", "freq", "count"],
contains_tokens=["frequency", "freq", "count"],
required=False,
)
stem_col = find_column(
ds.column_names,
exact_names=["Stem", "stem"],
contains_tokens=["stem"],
required=False,
)
stem_prob_col = find_column(
ds.column_names,
exact_names=["Stem valid probability", "stem valid probability"],
contains_tokens=["stem valid probability", "probability"],
required=False,
)
# ============================================================
# 7) BUILD FILTERED LEGAL LEXICON
# ============================================================
lexicon = []
seen = set()
for i, row in enumerate(ds):
if i >= MAX_WORDS_TO_SCAN:
break
word = normalize(str(row[word_col]))
if not word or word in seen:
continue
seen.add(word)
freq = safe_int(row.get(freq_col) if freq_col else None, default=0)
stem = normalize(str(row.get(stem_col))) if stem_col and row.get(stem_col) is not None else ""
stem_prob = safe_float(row.get(stem_prob_col) if stem_prob_col else None, default=None)
# Stronger filtering than earlier demos
if freq < MIN_FREQUENCY:
continue
if len(word) > MAX_WORD_LEN:
continue
if not looks_human_friendly(word, stem_prob):
continue
if not is_legal_word(word):
continue
lexicon.append(
{
"word": word,
"freq": freq,
"stem": stem,
"stem_prob": stem_prob,
"letters": set(word),
}
)
print(f"Loaded {len(lexicon):,} filtered legal words.")
# ============================================================
# 8) FAST INDICES
# ============================================================
by_start = defaultdict(list)
for item in lexicon:
by_start[item["word"][0]].append(item)
for ch in by_start:
by_start[ch].sort(key=lambda x: (-len(x["letters"]), -x["freq"], x["word"]))
# ============================================================
# 9) SCORING
# ============================================================
def assistant_score(item, uncovered_letters, is_first_move: bool) -> float:
"""
Assistant mode:
- prefer common words
- prefer decent continuation count
- still value new coverage
- penalize overlong first words
"""
new_cover = len(item["letters"] & uncovered_letters)
continuation_count = len(by_start.get(item["word"][-1], []))
length_penalty = 0.0
if is_first_move and len(item["word"]) > 7:
length_penalty = 3.0 + 0.8 * (len(item["word"]) - 7)
return (
3.5 * new_cover
+ 0.45 * continuation_count
+ 0.45 * math.log1p(item["freq"])
- length_penalty
)
def solver_score(item, uncovered_letters) -> float:
"""
Solver mode:
- aggressively reward big coverage
- still prefer continuation count and commonness
"""
new_cover = len(item["letters"] & uncovered_letters)
continuation_count = len(by_start.get(item["word"][-1], []))
return (
6.0 * new_cover
+ 0.25 * continuation_count
+ 0.20 * math.log1p(item["freq"])
+ 0.10 * len(item["word"])
)
def score_item(item, uncovered_letters, is_first_move: bool) -> float:
if MODE == "assistant":
return assistant_score(item, uncovered_letters, is_first_move)
return solver_score(item, uncovered_letters)
def candidate_pool(chain_words):
if not chain_words:
return lexicon
last_word = chain_words[-1]
used_words = set(chain_words)
pool = []
for item in by_start.get(last_word[-1], []):
if item["word"] not in used_words and chain_ok(last_word, item["word"]):
pool.append(item)
return pool
def rank_candidates(chain_words):
used_letters = set("".join(chain_words))
uncovered = BOARD_LETTERS - used_letters
is_first_move = len(chain_words) == 0
ranked = []
for item in candidate_pool(chain_words):
sc = score_item(item, uncovered, is_first_move)
ranked.append(
{
**item,
"symbolic_score": sc,
"score": sc,
}
)
ranked.sort(key=lambda x: x["score"], reverse=True)
return ranked
# ============================================================
# 10) OPTIONAL SEMANTIC RERANK
# ============================================================
def maybe_semantic_rerank(candidates, theme_text):
if not USE_SEMANTIC_RERANK or not theme_text or not candidates:
return candidates
try:
from sentence_transformers import SentenceTransformer
except ImportError:
print("sentence-transformers not installed; skipping semantic rerank.")
return candidates
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading semantic model on {device}: {EMBED_MODEL}")
model = SentenceTransformer(EMBED_MODEL, device=device)
# CPU-safe preference
if device == "cpu":
try:
model = model.float()
except Exception:
pass
else:
# CUDA-safer preference
try:
model = model.half()
except Exception:
pass
short = candidates[:SEMANTIC_SHORTLIST]
texts = [theme_text] + [c["word"] for c in short]
embs = model.encode(
texts,
batch_size=32,
convert_to_numpy=True,
normalize_embeddings=True,
show_progress_bar=False,
)
query = embs[0]
docs = embs[1:]
sims = docs @ query # normalized => dot product == cosine similarity
reranked = []
for cand, sim in zip(short, sims):
item = dict(cand)
item["semantic_score"] = float(sim)
item["score"] = item["symbolic_score"] + 1.20 * float(sim)
reranked.append(item)
reranked.sort(key=lambda x: x["score"], reverse=True)
return reranked + candidates[SEMANTIC_SHORTLIST:]
# ============================================================
# 11) DFS SOLVER
# ============================================================
def solve_cover_all(max_results=MAX_RESULTS):
target = BOARD_LETTERS
results = []
# Start from a moderately strong seed set
seed_words = sorted(
lexicon,
key=lambda x: (-len(x["letters"]), -x["freq"], x["word"])
)[:500]
def dfs(chain_items, covered_letters):
if len(results) >= max_results:
return
if covered_letters == target:
results.append([x["word"] for x in chain_items])
return
if len(chain_items) >= MAX_CHAIN_LEN:
return
if not chain_items:
candidates = seed_words
else:
used_words = {x["word"] for x in chain_items}
next_start = chain_items[-1]["word"][-1]
candidates = [
x for x in by_start.get(next_start, [])
if x["word"] not in used_words
]
uncovered = target - covered_letters
is_first_move = len(chain_items) == 0
scored = sorted(
candidates,
key=lambda x: score_item(x, uncovered, is_first_move),
reverse=True,
)[:120]
for nxt in scored:
dfs(chain_items + [nxt], covered_letters | nxt["letters"])
dfs([], set())
seen = set()
dedup = []
for chain in results:
key = tuple(chain)
if key not in seen:
seen.add(key)
dedup.append(chain)
return dedup
# ============================================================
# 12) HINTS
# ============================================================
def make_hint(chain_words, ranked):
if not ranked:
return "No legal next move found."
used_letters = set("".join(chain_words))
uncovered = sorted(BOARD_LETTERS - used_letters)
best = ranked[0]
if not chain_words:
return (
f"Structural hint: start with a common word of length {len(best['word'])} "
f"that covers letters like {', '.join(sorted(best['letters'] & set(uncovered)))}."
)
return (
f"Structural hint: the next word should start with '{best['word'][0]}' "
f"and helps cover {', '.join(sorted(best['letters'] & set(uncovered)))}."
)
# ============================================================
# 13) RUN DEMO
# ============================================================
print("\nBoard sides:", BOARD_SIDES)
print("Board letters:", "".join(sorted(BOARD_LETTERS)))
print("Current chain:", CURRENT_CHAIN if CURRENT_CHAIN else "(empty)")
print("Mode:", MODE)
print("Semantic rerank:", "ON" if USE_SEMANTIC_RERANK else "OFF")
print("\nValidation examples:")
for word in EXAMPLE_WORDS:
prev = CURRENT_CHAIN[-1] if CURRENT_CHAIN else None
print(f" {word:>8} -> {explain_word(word, prev)}")
ranked = rank_candidates(CURRENT_CHAIN)
ranked = maybe_semantic_rerank(ranked, SEMANTIC_THEME)
print("\nTop next moves:")
if not ranked:
print(" No legal next moves found.")
else:
for item in ranked[:TOP_K]:
stem_prob_text = f"{item['stem_prob']:.3f}" if item["stem_prob"] is not None else "None"
extra = f", semantic={item['semantic_score']:.3f}" if "semantic_score" in item else ""
print(
f" {item['word']:<12} "
f"score={item['score']:.2f}, "
f"freq={item['freq']}, "
f"stem_prob={stem_prob_text}{extra}"
)
print("\nShort cover-all chains:")
solutions = solve_cover_all()
if not solutions:
print(" No chain found with current depth / search budget.")
else:
for i, chain in enumerate(solutions, 1):
covered = "".join(sorted(set("".join(chain))))
print(f" {i}. {' -> '.join(chain)} [covers: {covered}]")
print("\nHint:")
print(" ", make_hint(CURRENT_CHAIN, ranked))