Naming is the primary interface between an LLM's semantic understanding and your organization's structured artifacts. Bad naming degrades agent performance the same way bad tokenization degrades model training. 中文版:
At a quantitative investment firm, the same concept — "market data" — decomposes along exchange, asset class, instrument, frequency, vendor, and storage format axes simultaneously. When we started deploying Claude Code across our research workflow, one issue became acute: every ls, grep, and git log the agent runs produces output that enters the context window, and the names in that output are the first layer — often the only layer — of information the agent uses to understand the system.
A path like mktdata/shfe/tick/cu and a path like data_feed_03_raw are both tolerable for humans. The first requires knowing that "shfe" means the Shanghai Futures Exchange; the second requires knowing what feed 03 maps to. But for an LLM, the two have fundamentally different parseability. This article explains why, and proposes an actionable naming standard.
Modern LLMs use Byte Pair Encoding (Sennrich et al., 2016, "Neural Machine Translation of Rare Words with Subword Units"). The key mechanism is two-stage encoding: a regex-based pre-tokenizer splits text into chunks, then BPE merge operations run independently within each chunk — merges never cross chunk boundaries.
This means delimiter choice directly determines tokenization output:
market_data → pre-tokenizer produces market + _data; BPE typically keeps this as 2–3 tokensmarket-data → pre-tokenizer produces market + data or market + + data; roughly 3–5 tokensmktdata → may split into mkt + data or even mk + t + data, fragmenting the semanticsThe core pattern: common English words (market, data, tick, config) are almost certainly single tokens in any major LLM vocabulary. Domain-specific abbreviations (shfe, mktdata) will likely fragment into meaningless subword pieces. The model can reconstruct meaning from fragments — but it's less reliable than processing whole semantic units.
This isn't just theoretical. Wang et al. (2023, Penn State & CMU, arXiv:2307.12488) ran experiments showing that anonymizing variable names in Python code caused significant MRR drops on CodeBERT's code search task. The more striking finding: deliberately misleading names (names suggesting wrong functionality) degraded performance worse than random names — proving the model actively uses name semantics for reasoning, not just structural cues.
Le et al. (2025, arXiv:2510.03178, "When Names Disappear") validated similar conclusions on newer models (GPT-4o, DeepSeek V3, Llama 4). They decompose code understanding into a structural semantics channel (formal behavior) and a naming channel (human-readable intent). Removing the naming channel dropped class-level summarization accuracy by up to 30 percentage points — DeepSeek V3 went from 87.7 to 58.7 on ClassEval.
The classic word2vec relationship king - man + woman ≈ queen demonstrates that consistent semantic relationships create linear structure in embedding space. Applied to naming: if an LLM repeatedly sees the pattern {exchange}/tick/{instrument} — shfe/tick/cu, cme/tick/es, ice/tick/cl — it builds a representation where "tick" occupies a stable position along a frequency dimension, and a novel combination like nymex/tick/gc can be understood by analogy. When naming is inconsistent (shfe_ticks_cu, CME-ES-TICK, ice/realtime/cl), each variation requires independent memorization.
Min et al. (2022, EMNLP, arXiv:2202.12837) provide further support. Their experiments across 12 models including GPT-3 showed that in few-shot in-context learning, the format and distribution of examples matter more than whether individual labels are correct — randomly replacing labels barely hurt performance. For naming, this is the key insight: a filesystem where every path follows {category}/{source}/{type}/{symbol} acts as a many-shot prompt. Each path reinforces the format pattern.