Strategy

Every Research-Backed Way to Be More Convincing to an LLM (The Complete Cheat Sheet)

Eytan Buchman

2026-03-10

10 min read

We spent two articles going deep on how LLMs process content differently than humans. Here is every research-backed tactic in one place — 34 tactics from 19 studies, with the specific data, the models tested, and the papers behind them.

Bookmark this.

Formatting & Structure

#	Tactic	What To Do	Key Data Point	Research	Model(s) Tested	Date
1	Use clean, consistent separators	Choose separators (spaces, dashes, newlines) deliberately; avoid unpredictable punctuation between fields	`passage {} answer {}` hit 82.6% accuracy vs. `passage:{} answer:{}` at 4.3% — same model, same task	Sclar et al.	LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.5	2023
2	Bold your key claims	Use bold for important statements, numbers, and conclusions	Bold text hit up to 99% win rate vs. non-bold (Skywork-Critic); GPT-4 Turbo: 89.5%	Zhang et al.	GPT-4 Turbo, Skywork-Critic, ArmoRM, Pairwise-Llama-3	2025
3	Use bullet/numbered lists	Structure key points as lists rather than prose	Lists hit up to 93.5% win rate (Pairwise-model); GPT-4 Turbo: 75.75%; even debiased models still showed 84% list preference	Zhang et al.	GPT-4 Turbo, Skywork-Critic, Pairwise-Llama-3, OffsetBias-RM	2025
4	Add hyperlinks	Include relevant links to sources, related content, and references	Hyperlinks hit 87.25% win rate on GPT-4 Turbo; 84.75% on Pairwise-model	Zhang et al.	GPT-4 Turbo, Pairwise-Llama-3, Zephyr-Mistral-7B	2025
5	Use exclamation marks (sparingly)	Add occasional exclamation marks for emphasis on key points	Exclamation marks hit 80.5% win rate on GPT-4 Turbo; 77.75% on Skywork-Critic	Zhang et al.	GPT-4 Turbo, Skywork-Critic, Zephyr-Mistral-7B	2025
6	Prioritize structure over label copy	Focus on clear H1/H2/H3 hierarchy and grouped sections — the words in your headers matter less than having them	Random/nonsensical labels ("similar tennis") performed as well as correct labels; attention analysis showed models barely read descriptive nouns	Tang et al.	XGLM-7.5B, Alpaca-7B, Llama3.1-8B, Mistral-7B, GPT-3.5	2025
7	Group content into multiple labeled sections	Use two or more clearly delineated sections rather than one flat block	Ensemble format with two labeled groups outperformed single-block prompts across commonsense, math, and reasoning tasks — even with random labels	Tang et al.	XGLM-7.5B, Alpaca-7B, Llama3.1-8B, Mistral-7B, GPT-3.5	2025
8	Provide clean, extractable text	Use structured HTML, clean Markdown, clear heading hierarchy — make content easy to parse, quote, and cite	WebGPT's accuracy improved dramatically when it could extract clean, structured text; messy formatting meant worse quotes and worse answers	Nakano et al.	GPT-3 (175B)	2021
9	Use Markdown over plain text	When serving content to AI, Markdown with semantic markers (tables, headings, hierarchies) outperforms stripped plain text	"Plain-text conversion strips essential semantic markers... vital for deep document understanding"; LLMs get structure right (89% Key F1) but values wrong (46%)	Brach et al.	GPT-4o-mini, Qwen3-1.7B/4B/30B	2026
10	Keep structural complexity under the cliff-edge	Stay under schema depth 7 and under 200 distinct data fields for LLM-facing content	Validation rates stay ~95% for moderate schemas but crash to ~20% at depth >=7; failures are non-linear cliffs, not gradual declines	Brach et al.	GPT-4o-mini, Qwen3-1.7B/4B/30B	2026

Content & Length

#	Tactic	What To Do	Key Data Point	Research	Model(s) Tested	Date
11	Be comprehensive (longer wins)	Include full detail; don't rely on scannable summaries alone	All LLM judges showed verbosity bias; once length difference exceeded ~40 tokens, preference scores consistently exceeded 0.7	Chen et al.	GPT-4, GPT-4-Turbo, Claude-2, PaLM-2, LLaMA2-70B	2024
12	Maintain logical rigor	Ensure every claim adds up; avoid misleading comparisons or hand-wavy logic	GPT-4 catches factual errors 94% of the time vs. humans at 79%; factual errors cause the single largest penalties (5+ point drop on a 10-pt scale)	Chen et al., Gao et al.	GPT-4, GPT-5.1, Claude Sonnet 4.5	2024-2026
13	Use an affirmative, confident tone	Open with phrases like "Here's what we found:" rather than hedging; avoid "might," "perhaps," "it's possible"	Affirmative tone hit 88.75% win rate on GPT-4 Turbo; LLMs are mathematically trained to reward confidence over abstention (guessing ALWAYS beats IDK under binary grading)	Zhang et al., Kalai et al. (OpenAI)	GPT-4 Turbo, Skywork-Critic; theoretical (all LLMs)	2025
14	Repeat key claims across passages	State important facts more than once, in different contexts and phrasings	Repeating a low-credibility source's claim once flipped preferences away from a government source (gap of 30-34 points); repetition even overrides source attribution	Schuster et al.	Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3	2026
15	Use bandwagon/consensus signals	Phrases like "90% of experts agree" or "most research confirms" amplify LLM trust	Bandwagon signals flipped even OpenAI o1's correct answers; fabricated consensus overrides correct reasoning	Wang et al.	Qwen3-1.7B/4B, OpenAI o1	2026

Citations & Authority

#	Tactic	What To Do	Key Data Point	Research	Model(s) Tested	Date
16	Cite your sources — for everything	Add references for every claim, stat, and comparison; the act of having citations boosts perceived quality	Fake references fooled GPT-4 69% of the time, Claude-2 89%; humans only 39%	Chen et al.	GPT-4, Claude-2, PaLM-2, LLaMA2-70B, humans	2024
17	Cite well-known, highly cited sources	Prefer famous sources over obscure ones — LLMs have internalized a "highly cited = good" bias	LLM-suggested references were ~1,326 citations more popular (median) than ground-truth references	Algaba et al.	GPT-4, GPT-4o, Claude 3.5	2025
18	Favor established venues	When citing, prefer arXiv, NeurIPS, AAAI, and major journals — LLMs over-represent these in training	LLMs over-indexed on arXiv and NeurIPS when generating references; strong venue bias	Algaba et al.	GPT-4, GPT-4o, Claude 3.5	2025
19	Attribute to institutional sources	Government and institutional sources outrank individual and social media sources	Strict hierarchy: Government > Newspaper > Person > Social Media, consistent across 11/13 models (Kendall's W = 0.74)	Schuster et al.	Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3	2026
20	Add circulation/follower counts	Include credibility signals like audience size when attributing sources	High-circulation newspapers preferred over low-circulation; high-follower social accounts over low-follower; controlled for big-number effect	Schuster et al.	Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3	2026
21	Use specific expert credentials	"Board-certified physician" > "doctor" > "medical professional"; the more specific, the stronger	Board-certified physician endorsement swung accuracy by +0.458 (correct) / -0.447 (incorrect) on MedQA	Mammen et al.	Phi-4-Reasoning, DeepSeek-R1, LLaMA-3.1, Gemma, Mistral	2026
22	Use "Expert" and "Specialist" labels	Expert Power labels outperform Legitimate Power labels (Judge, Manager)	DeepSeek R1 reached 100% agreement with "Expert" labels; Expert Power > Referent Power > Legitimate Power	Choi et al.	GPT-4o, DeepSeek R1	2026
23	Avoid inaccurate or irrelevant citations	Bad citations are punished MORE harshly than good ones are rewarded	Incorrect/irrelevant reference dropped GPT-4o score from 9.12 to 3.94 (5.18-point drop on a 10-pt scale)	Gao et al.	GPT-4o, GPT-5.1, Claude Sonnet 4.5	2026
24	Include verifiable reference details	Structure citations with title, author, year, and link — make them checkable	WebGPT was trained to collect references during browsing; reward model valued referenced claims over unreferenced	Nakano et al.	GPT-3 (175B)	2021

Framing & Presentation

#	Tactic	What To Do	Key Data Point	Research	Model(s) Tested	Date
25	Frame claims positively	"This product delivers reliable results" > "This product doesn't deliver unreliable results"	LLMs show 2x more bias under negative framing than positive; positive framing reduces safety scrutiny by ~2x	Lim et al.	LLaMA-3, Qwen2.5, Gemma3, Mistral, Falcon (13 models, 3B-70B)	2026
26	Know your evaluating model family	LLaMA tends to agree, GPT tends to reject, Qwen is mixed — optimize framing accordingly	All 14 LLM judges showed framing bias; model families have hardcoded directional tendencies (LLaMA: +0.19 to +2.41pp acquiescence; GPT: -0.57 to -1.38pp)	Hwang et al.	GPT-4o/5, Qwen 2.5 (1.5B-72B), LLaMA 3.1/3.2/3.3	2026
27	Use emojis (model-dependent)	Add emojis for GPT-4/Skywork models; avoid for Zephyr/FsfairX-based systems	GPT-4 Turbo: 86.75% win rate for emoji; Skywork: 97.25%; but Zephyr: only 26.5% (anti-emoji bias)	Zhang et al.	GPT-4 Turbo, Skywork-Critic, Zephyr-Mistral-7B, FsfairX	2025

Position & Order

#	Tactic	What To Do	Key Data Point	Research	Model(s) Tested	Date
28	Put your strongest content first	Lead with your best argument or most important information	GPT-3.5-Turbo: 0.95 first-position preference; Llama3-8B flips judgment 76.2% of the time when answer order is reversed	Chen et al., Feng et al.	GPT-3.5/4/5, LLaMA-3, Gemini, Claude, Qwen, DeepSeek	2024-2025
29	Present separate supporting passages rather than merging	Two separate passages from different sources are far more effective than listing sources in one header	Two-source format: preference gap of 33.9 points; merged single-header format: only 6.17 points	Schuster et al.	Qwen2.5 (7B-72B), OLMo-2, LLaMA-3, Gemma-3	2026

Meta-Tactics (Testing & Optimization)

#	Tactic	What To Do	Key Data Point	Research	Model(s) Tested	Date
30	Test formatting — don't assume	The formatting space is non-smooth; small changes produce unpredictable effects	Only 32-34% of formatting "triples" showed monotonic performance — barely better than random	Sclar et al.	LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.5	2023
31	Test per model — biases differ	Format preferences are weakly correlated between models; what works for one may not work for another	Relative model rankings completely reverse ~14% of the time; 76% of reversals are statistically significant	Sclar et al.	LLaMA-2-7B/13B/70B, Falcon-7B, GPT-3.5	2023
32	Formatting beats content quality for preference	If content parity is close, the better-formatted version wins — even if the content is worse	GPT-4 preferred factually worse content formatted with bold + lists over factually better plain content	Zhang et al.	GPT-4 Turbo, ArmoRM, Pairwise-Llama-3	2025
33	Don't tell models to "resist bias"	Explicit debiasing prompts often backfire — they can drop accuracy without fixing the underlying bias	Debiasing prompts dropped accuracy from 66.2% to 40.9%; models produce "performative independence" language without actual reasoning	Wang et al.	Qwen3-1.7B/4B	2026
34	Use multi-model panels, not debates	When using LLM-as-judge, aggregate across models; avoid debate formats	Multi-agent panels improved performance by up to 15%; ChatEval debates degraded performance by 45-162%	Feng et al.	Gemini-2.5, GPT-5, Claude-3, Qwen3, DeepSeek	2025

This is the design problem Switch exists to solve — detecting who's visiting your site and serving the right experience to humans vs. agents. For the full narrative behind these tactics: Part One and Part Two.

References

Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). "Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design." arXiv:2310.11324
Chen, G. H., et al. (2024). "Humans or LLMs as the Judge? A Study on Judgement Biases." arXiv:2402.10669
Nakano, R., et al. (2021). "WebGPT: Browser-assisted question-answering with human feedback." arXiv:2112.09332
Tang, C., et al. (2025). "Prompt Format Beats Descriptions." Findings of EMNLP 2025. ACL Anthology
Zhang, X., et al. (2025). "From Lists to Emojis: How Format Bias Affects Model Alignment." ACL 2025. ACL Anthology
Algaba, A., et al. (2025). "LLMs Reflect Human Citation Patterns with a Heightened Citation Bias." Findings of NAACL 2025. ACL Anthology
Kalai, A. T., et al. (2025). "Why Language Models Hallucinate." OpenAI
Lai, P., et al. (2025). "Beyond the Surface (LAGER)." NeurIPS 2025. arXiv:2508.03550
Feng, Y., et al. (2025). "SAGE: Are We on the Right Way to Assessing LLM-as-a-Judge?" arXiv:2512.16041
Cheng, A., et al. (2025). "The FACTS Leaderboard." Google DeepMind
Schuster, J., Gautam, V., & Markert, K. (2026). "Whose Facts Win?" arXiv:2601.03746
Choi, J., et al. (2026). "Belief in Authority." arXiv:2601.04790
Mammen, P. M., et al. (2026). "Trust Me, I'm an Expert." arXiv:2601.13433
Hwang, Y., et al. (2026). "When Wording Steers the Evaluation." arXiv:2601.13537
Wang, H., et al. (2026). "Teaching Large Reasoning Models Effective Reflection." arXiv:2601.12720
Wang, Q., et al. (2026). "Making Bias Non-Predictive." arXiv:2602.01528
Lim, K., Kim, S., & Whang, S. E. (2026). "DeFrame." arXiv:2602.04306
Brach, W., et al. (2026). "ScrapeGraphAI-100k." arXiv:2602.15189
Gao, J., et al. (2026). "Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems." arXiv:2510.12462
Churina, S., et al. (2026). "Layer of Truth." arXiv:2510.26829
Anthropic. (2026). "The Persona Selection Model." Anthropic Research