Strategy
Every Research-Backed Way to Be More Convincing to an LLM (The Complete Cheat Sheet)
Eytan Buchman
2026-03-10
10 min read
We spent two articles going deep on how LLMs process content differently than humans. Here is every research-backed tactic in one place — 34 tactics from 19 studies, with the specific data, the models tested, and the papers behind them.
Bookmark this.
Formatting & Structure
| # | Tactic | What To Do | Key Data Point | Research | Model(s) Tested | Date |
|---|---|---|---|---|---|---|
| 1 | Use clean, consistent separators | Choose separators (spaces, dashes, newlines) deliberately; avoid unpredictable punctuation between fields | passage {} answer {} hit 82.6% accuracy vs. passage:{} answer:{} at 4.3% — same model, same task | Sclar et al. | LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.5 | 2023 |
| 2 | Bold your key claims | Use bold for important statements, numbers, and conclusions | Bold text hit up to 99% win rate vs. non-bold (Skywork-Critic); GPT-4 Turbo: 89.5% | Zhang et al. | GPT-4 Turbo, Skywork-Critic, ArmoRM, Pairwise-Llama-3 | 2025 |
| 3 | Use bullet/numbered lists | Structure key points as lists rather than prose | Lists hit up to 93.5% win rate (Pairwise-model); GPT-4 Turbo: 75.75%; even debiased models still showed 84% list preference | Zhang et al. | GPT-4 Turbo, Skywork-Critic, Pairwise-Llama-3, OffsetBias-RM | 2025 |
| 4 | Add hyperlinks | Include relevant links to sources, related content, and references | Hyperlinks hit 87.25% win rate on GPT-4 Turbo; 84.75% on Pairwise-model | Zhang et al. | GPT-4 Turbo, Pairwise-Llama-3, Zephyr-Mistral-7B | 2025 |
| 5 | Use exclamation marks (sparingly) | Add occasional exclamation marks for emphasis on key points | Exclamation marks hit 80.5% win rate on GPT-4 Turbo; 77.75% on Skywork-Critic | Zhang et al. | GPT-4 Turbo, Skywork-Critic, Zephyr-Mistral-7B | 2025 |
| 6 | Prioritize structure over label copy | Focus on clear H1/H2/H3 hierarchy and grouped sections — the words in your headers matter less than having them | Random/nonsensical labels ("similar tennis") performed as well as correct labels; attention analysis showed models barely read descriptive nouns | Tang et al. | XGLM-7.5B, Alpaca-7B, Llama3.1-8B, Mistral-7B, GPT-3.5 | 2025 |
| 7 | Group content into multiple labeled sections | Use two or more clearly delineated sections rather than one flat block | Ensemble format with two labeled groups outperformed single-block prompts across commonsense, math, and reasoning tasks — even with random labels | Tang et al. | XGLM-7.5B, Alpaca-7B, Llama3.1-8B, Mistral-7B, GPT-3.5 | 2025 |
| 8 | Provide clean, extractable text | Use structured HTML, clean Markdown, clear heading hierarchy — make content easy to parse, quote, and cite | WebGPT's accuracy improved dramatically when it could extract clean, structured text; messy formatting meant worse quotes and worse answers | Nakano et al. | GPT-3 (175B) | 2021 |
| 9 | Use Markdown over plain text | When serving content to AI, Markdown with semantic markers (tables, headings, hierarchies) outperforms stripped plain text | "Plain-text conversion strips essential semantic markers... vital for deep document understanding"; LLMs get structure right (89% Key F1) but values wrong (46%) | Brach et al. | GPT-4o-mini, Qwen3-1.7B/4B/30B | 2026 |
| 10 | Keep structural complexity under the cliff-edge | Stay under schema depth 7 and under 200 distinct data fields for LLM-facing content | Validation rates stay ~95% for moderate schemas but crash to ~20% at depth >=7; failures are non-linear cliffs, not gradual declines | Brach et al. | GPT-4o-mini, Qwen3-1.7B/4B/30B | 2026 |
Content & Length
| # | Tactic | What To Do | Key Data Point | Research | Model(s) Tested | Date |
|---|---|---|---|---|---|---|
| 11 | Be comprehensive (longer wins) | Include full detail; don't rely on scannable summaries alone | All LLM judges showed verbosity bias; once length difference exceeded ~40 tokens, preference scores consistently exceeded 0.7 | Chen et al. | GPT-4, GPT-4-Turbo, Claude-2, PaLM-2, LLaMA2-70B | 2024 |
| 12 | Maintain logical rigor | Ensure every claim adds up; avoid misleading comparisons or hand-wavy logic | GPT-4 catches factual errors 94% of the time vs. humans at 79%; factual errors cause the single largest penalties (5+ point drop on a 10-pt scale) | Chen et al., Gao et al. | GPT-4, GPT-5.1, Claude Sonnet 4.5 | 2024-2026 |
| 13 | Use an affirmative, confident tone | Open with phrases like "Here's what we found:" rather than hedging; avoid "might," "perhaps," "it's possible" | Affirmative tone hit 88.75% win rate on GPT-4 Turbo; LLMs are mathematically trained to reward confidence over abstention (guessing ALWAYS beats IDK under binary grading) | Zhang et al., Kalai et al. (OpenAI) | GPT-4 Turbo, Skywork-Critic; theoretical (all LLMs) | 2025 |
| 14 | Repeat key claims across passages | State important facts more than once, in different contexts and phrasings | Repeating a low-credibility source's claim once flipped preferences away from a government source (gap of 30-34 points); repetition even overrides source attribution | Schuster et al. | Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3 | 2026 |
| 15 | Use bandwagon/consensus signals | Phrases like "90% of experts agree" or "most research confirms" amplify LLM trust | Bandwagon signals flipped even OpenAI o1's correct answers; fabricated consensus overrides correct reasoning | Wang et al. | Qwen3-1.7B/4B, OpenAI o1 | 2026 |
Citations & Authority
| # | Tactic | What To Do | Key Data Point | Research | Model(s) Tested | Date |
|---|---|---|---|---|---|---|
| 16 | Cite your sources — for everything | Add references for every claim, stat, and comparison; the act of having citations boosts perceived quality | Fake references fooled GPT-4 69% of the time, Claude-2 89%; humans only 39% | Chen et al. | GPT-4, Claude-2, PaLM-2, LLaMA2-70B, humans | 2024 |
| 17 | Cite well-known, highly cited sources | Prefer famous sources over obscure ones — LLMs have internalized a "highly cited = good" bias | LLM-suggested references were ~1,326 citations more popular (median) than ground-truth references | Algaba et al. | GPT-4, GPT-4o, Claude 3.5 | 2025 |
| 18 | Favor established venues | When citing, prefer arXiv, NeurIPS, AAAI, and major journals — LLMs over-represent these in training | LLMs over-indexed on arXiv and NeurIPS when generating references; strong venue bias | Algaba et al. | GPT-4, GPT-4o, Claude 3.5 | 2025 |
| 19 | Attribute to institutional sources | Government and institutional sources outrank individual and social media sources | Strict hierarchy: Government > Newspaper > Person > Social Media, consistent across 11/13 models (Kendall's W = 0.74) | Schuster et al. | Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3 | 2026 |
| 20 | Add circulation/follower counts | Include credibility signals like audience size when attributing sources | High-circulation newspapers preferred over low-circulation; high-follower social accounts over low-follower; controlled for big-number effect | Schuster et al. | Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3 | 2026 |
| 21 | Use specific expert credentials | "Board-certified physician" > "doctor" > "medical professional"; the more specific, the stronger | Board-certified physician endorsement swung accuracy by +0.458 (correct) / -0.447 (incorrect) on MedQA | Mammen et al. | Phi-4-Reasoning, DeepSeek-R1, LLaMA-3.1, Gemma, Mistral | 2026 |
| 22 | Use "Expert" and "Specialist" labels | Expert Power labels outperform Legitimate Power labels (Judge, Manager) | DeepSeek R1 reached 100% agreement with "Expert" labels; Expert Power > Referent Power > Legitimate Power | Choi et al. | GPT-4o, DeepSeek R1 | 2026 |
| 23 | Avoid inaccurate or irrelevant citations | Bad citations are punished MORE harshly than good ones are rewarded | Incorrect/irrelevant reference dropped GPT-4o score from 9.12 to 3.94 (5.18-point drop on a 10-pt scale) | Gao et al. | GPT-4o, GPT-5.1, Claude Sonnet 4.5 | 2026 |
| 24 | Include verifiable reference details | Structure citations with title, author, year, and link — make them checkable | WebGPT was trained to collect references during browsing; reward model valued referenced claims over unreferenced | Nakano et al. | GPT-3 (175B) | 2021 |
Framing & Presentation
| # | Tactic | What To Do | Key Data Point | Research | Model(s) Tested | Date |
|---|---|---|---|---|---|---|
| 25 | Frame claims positively | "This product delivers reliable results" > "This product doesn't deliver unreliable results" | LLMs show 2x more bias under negative framing than positive; positive framing reduces safety scrutiny by ~2x | Lim et al. | LLaMA-3, Qwen2.5, Gemma3, Mistral, Falcon (13 models, 3B-70B) | 2026 |
| 26 | Know your evaluating model family | LLaMA tends to agree, GPT tends to reject, Qwen is mixed — optimize framing accordingly | All 14 LLM judges showed framing bias; model families have hardcoded directional tendencies (LLaMA: +0.19 to +2.41pp acquiescence; GPT: -0.57 to -1.38pp) | Hwang et al. | GPT-4o/5, Qwen 2.5 (1.5B-72B), LLaMA 3.1/3.2/3.3 | 2026 |
| 27 | Use emojis (model-dependent) | Add emojis for GPT-4/Skywork models; avoid for Zephyr/FsfairX-based systems | GPT-4 Turbo: 86.75% win rate for emoji; Skywork: 97.25%; but Zephyr: only 26.5% (anti-emoji bias) | Zhang et al. | GPT-4 Turbo, Skywork-Critic, Zephyr-Mistral-7B, FsfairX | 2025 |
Position & Order
| # | Tactic | What To Do | Key Data Point | Research | Model(s) Tested | Date |
|---|---|---|---|---|---|---|
| 28 | Put your strongest content first | Lead with your best argument or most important information | GPT-3.5-Turbo: 0.95 first-position preference; Llama3-8B flips judgment 76.2% of the time when answer order is reversed | Chen et al., Feng et al. | GPT-3.5/4/5, LLaMA-3, Gemini, Claude, Qwen, DeepSeek | 2024-2025 |
| 29 | Present separate supporting passages rather than merging | Two separate passages from different sources are far more effective than listing sources in one header | Two-source format: preference gap of 33.9 points; merged single-header format: only 6.17 points | Schuster et al. | Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3 | 2026 |
Meta-Tactics (Testing & Optimization)
| # | Tactic | What To Do | Key Data Point | Research | Model(s) Tested | Date |
|---|---|---|---|---|---|---|
| 30 | Test formatting — don't assume | The formatting space is non-smooth; small changes produce unpredictable effects | Only 32-34% of formatting "triples" showed monotonic performance — barely better than random | Sclar et al. | LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.5 | 2023 |
| 31 | Test per model — biases differ | Format preferences are weakly correlated between models; what works for one may not work for another | Relative model rankings completely reverse ~14% of the time; 76% of reversals are statistically significant | Sclar et al. | LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.5 | 2023 |
| 32 | Formatting beats content quality for preference | If content parity is close, the better-formatted version wins — even if the content is worse | GPT-4 preferred factually worse content formatted with bold + lists over factually better plain content | Zhang et al. | GPT-4 Turbo, ArmoRM, Pairwise-Llama-3 | 2025 |
| 33 | Don't tell models to "resist bias" | Explicit debiasing prompts often backfire — they can drop accuracy without fixing the underlying bias | Debiasing prompts dropped accuracy from 66.2% to 40.9%; models produce "performative independence" language without actual reasoning | Wang et al. | Qwen3-1.7B/4B | 2026 |
| 34 | Use multi-model panels, not debates | When using LLM-as-judge, aggregate across models; avoid debate formats | Multi-agent panels improved performance by up to 15%; ChatEval debates degraded performance by 45-162% | Feng et al. | Gemini-2.5, GPT-5, Claude-3, Qwen3, DeepSeek | 2025 |
This is the design problem Switch exists to solve — detecting who's visiting your site and serving the right experience to humans vs. agents. For the full narrative behind these tactics: Part One and Part Two.
References
- Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). "Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design." arXiv:2310.11324
- Chen, G. H., et al. (2024). "Humans or LLMs as the Judge? A Study on Judgement Biases." arXiv:2402.10669
- Nakano, R., et al. (2021). "WebGPT: Browser-assisted question-answering with human feedback." arXiv:2112.09332
- Tang, C., et al. (2025). "Prompt Format Beats Descriptions." Findings of EMNLP 2025. ACL Anthology
- Zhang, X., et al. (2025). "From Lists to Emojis: How Format Bias Affects Model Alignment." ACL 2025. ACL Anthology
- Algaba, A., et al. (2025). "LLMs Reflect Human Citation Patterns with a Heightened Citation Bias." Findings of NAACL 2025. ACL Anthology
- Kalai, A. T., et al. (2025). "Why Language Models Hallucinate." OpenAI
- Lai, P., et al. (2025). "Beyond the Surface (LAGER)." NeurIPS 2025. arXiv:2508.03550
- Feng, Y., et al. (2025). "SAGE: Are We on the Right Way to Assessing LLM-as-a-Judge?" arXiv:2512.16041
- Cheng, A., et al. (2025). "The FACTS Leaderboard." Google DeepMind
- Schuster, J., Gautam, V., & Markert, K. (2026). "Whose Facts Win?" arXiv:2601.03746
- Choi, J., et al. (2026). "Belief in Authority." arXiv:2601.04790
- Mammen, P. M., et al. (2026). "Trust Me, I'm an Expert." arXiv:2601.13433
- Hwang, Y., et al. (2026). "When Wording Steers the Evaluation." arXiv:2601.13537
- Wang, H., et al. (2026). "Teaching Large Reasoning Models Effective Reflection." arXiv:2601.12720
- Wang, Q., et al. (2026). "Making Bias Non-Predictive." arXiv:2602.01528
- Lim, K., Kim, S., & Whang, S. E. (2026). "DeFrame." arXiv:2602.04306
- Brach, W., et al. (2026). "ScrapeGraphAI-100k." arXiv:2602.15189
- Gao, J., et al. (2026). "Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems." arXiv:2510.12462
- Churina, S., et al. (2026). "Layer of Truth." arXiv:2510.26829
- Anthropic. (2026). "The Persona Selection Model." Anthropic Research