The original-content test for this topic
Most pages about deep research do one of three things: explain how OpenAI’s Deep Research product works, list ten “best deep research AI tools,” or recap a generic research workflow with “AI can help” sprinkled in. All three describe the surface and miss the question: what changed in December 2024 when a single product line claimed the term, and how should that change what we teach about doing the work?
This page makes one argument: the term “deep research” now refers to three different things at once, and a workflow that does not say which one it is solving for produces unreliable output regardless of which tool is used. The three things are an agentic search product class, a methodological practice that long predates these products, and an output shape (the long-form, source-cited report). The first is a product evaluation question. The third is a formatting question. Only the second can be taught and improved as a skill that survives a tool change.
The reference data this page anchors to: OpenAI launched Deep Research in February 2025 (announced December 2024); Google DeepMind launched Gemini Deep Research the same month; arXiv 2506.12594 (a 2025 survey paper) documents the rapid product-class proliferation; and Nature 2024 reported a 36% citation fabrication rate across general-purpose LLM tools, which all current deep research products have inherited and partially mitigated.
A page that does not separate these three meanings is teaching the audience to evaluate products when they wanted to learn methodology. This page treats the methodology as the actual subject.
”Deep research” is three different things in 2026
The term collapsed three meanings in late 2024 and has not been disambiguated since. A useful reading of any “deep research” page starts by asking which of these it is talking about.
| Meaning | What it refers to | What is being evaluated |
|---|---|---|
| Product class | OpenAI Deep Research, Gemini Deep Research, Perplexity Pro Search, ChatGPT o3 with browsing, Claude Research, etc. | Speed, source coverage, hallucination rate, UI, pricing |
| Methodological practice | The discipline of producing source-backed work that survives audit — scoping, retrieval, synthesis, verification, deliverable | The researcher’s workflow, regardless of tool |
| Output shape | A long-form report with citations, headings, and structured synthesis | The artifact, not the process that made it |
These are not the same thing. A product can be excellent (low hallucination, fast, broad coverage) while the user employs it badly (no scoping, no verification, no deliverable shape) and produces unreliable work. A user can have an excellent methodological practice and use a mediocre tool and produce reliable work, slowly. The product class and the methodological practice are independent variables.
Pages that conflate them produce two predictable failures. They review products as if a better product means better research (it does not), or they teach methodology as if a particular product is the methodology (it is not). The methodology survives a tool change. The tool review does not.
”500+ sources” is a vanity metric, not a quality signal
OpenAI’s Deep Research launch announcement emphasized that the agent “consults more than 500 sources” for some assignments. This number became a reference point in product comparisons. It is also the wrong signal.
Source count tells you how much retrieval the agent did. It tells you nothing about provenance (whether each source is real and traceable), verification rate (whether each citation supports the claim attached), freshness (when each source was retrieved relative to current state), or relevance (whether retrieved sources actually inform the question). A report citing 500 sources where 36% of citations are unverified is worse than a report citing 50 sources where every citation has been checked.
The signals that actually predict deep research reliability are different:
- Citation resolution rate — what fraction of cited URLs / DOIs resolve to real documents
- Claim-citation match rate — what fraction of citations actually support the sentence they are attached to
- Freshness distribution — for time-sensitive claims, how recently was each source retrieved
- Verification cost — how long does it take a human to audit a 10% sample of claims
A workflow that optimizes for source count optimizes for the wrong thing. The user feels that more sources mean better research; the auditor finds that the additional sources mean more attack surface for the four real failure modes. For a deeper treatment of citation reliability specifically, see AI research with citations which enumerates the five distinct citation failure modes.
A defensible deep research workflow has three audit surfaces
A workflow is defensible when an outsider can reconstruct three things: what was retrieved, how it was synthesized, and what survived editing. Each is a separate audit surface, and each requires its own preservation discipline.
The source surface lists every retrieved document with retrieval timestamp, source type (primary, secondary, commentary, vendor, community), and the claim it was retrieved to support. This is what answers “what evidence base did this report draw on?” Workflows that lose this surface produce reports nobody can verify, even if every individual citation in the prose looks fine.
The synthesis surface traces every claim in the synthesis back to a source passage. Not “this paragraph has a footnote” — “this specific sentence is built from these specific source paragraphs, and a reader can open them.” This is what answers “where in the source does this claim come from?” Workflows that conflate paragraph-level citation with claim-level citation pass casual review and fail expert review.
The deliverable surface is what the reader sees: the final report or deck, with citations preserved through editing, splits, merges, and export. This is what answers “does the artifact still trace to evidence?” Workflows that lose citation graph during revision produce final deliverables that look cited and audit as uncited — the most common silent failure in AI deep research.
A workflow that produces all three surfaces survives review. A workflow that produces only the deliverable surface looks polished and fails audit at the first serious challenge.
Deep research vs deep thinking: different LLM modes for different jobs
A confusion related to the product-class problem: “deep research” is often used interchangeably with “extended thinking” or “reasoning models.” They are different.
Deep research, in the product sense, is retrieval-heavy: the agent loops through multiple search queries, evaluates retrieved documents, retrieves more in response to gaps, and writes a report grounded in retrieved material. The compute spent is mostly on searching, reading, and re-searching. OpenAI Deep Research, Gemini Deep Research, Perplexity Pro Search, and Claude Research are all in this category.
Extended thinking is reasoning-heavy: the model spends compute on internal deliberation before answering, often without retrieval. OpenAI’s o1 and o3, Anthropic’s Claude with extended thinking, and DeepSeek’s R1 family are in this category. They are good at problems where the answer requires multi-step reasoning over what the model already knows; they are not, by themselves, good at problems that require current external information.
The two modes can be combined (some products run reasoning over retrieved material), but the failure modes differ. Retrieval-heavy systems fail at hallucinated citations and stale sources; reasoning-heavy systems fail at confident-but-wrong claims about facts the training data does not contain. A workflow that uses the wrong mode for the job — reasoning model for current-events research, retrieval model for math proofs — fails predictably.
The research-to-deliverable gap most projects fail
The single most common failure in AI deep research is not in the research. It is in the gap between the research output and the actual deliverable.
Deep research produces a report — typically a long-form artifact with headings, citations, and synthesis. The actual deliverable is usually different: a thesis chapter, a partner-ready category brief, a journal article, an internal memo, a slide deck. The shape of the deliverable is not the shape of the research output. The research output is scaffolding; the deliverable is what the reader actually opens.
The gap is where citations get detached. The user copies useful sections from the research report into the deliverable. Citations either come along as plain hyperlinks (which lose metadata), get reformatted by hand (which introduces errors), or get dropped entirely (which produces uncited claims). By the time the deliverable ships, the audit trail that existed in the research artifact has degraded by 10–30% on average.
A workflow that survives this gap treats the research output and the deliverable as views of the same source-backed object, not as separate documents linked by copy-paste. Citations in the deliverable have to resolve to the same source records that the research output used. This is what the academic research workflow and the strategy research workflow both describe in their respective domains.
Common failure modes of AI deep research
Five patterns recur in deep research workflows that did not survive review.
Source-count chasing. Optimizing for “the agent consulted N sources” instead of “every claim is verifiable.” The fix is to set a verification target (e.g. 100% of quantitative claims audited) and let source count be whatever it ends up being.
Hallucinated synthesis with real citations. Each citation resolves to a real source, but the synthesis paragraph claims something the cited sources do not actually say. The fix is claim-citation match auditing, not just URL resolution.
Detached citations after editing. The research output had a clean citation graph; the deliverable does not, because revision broke citation-paragraph binding. The fix is a workspace where citations move with claims through edits, not a workspace where citations are pasted hyperlinks.
Stale source contamination. Time-sensitive claims (pricing, regulation, customer counts) cited from sources retrieved months ago. The fix is per-source freshness windows and re-fetch on use.
The “research never ends” failure. The user keeps retrieving and never produces a deliverable. The fix is a draft-first discipline: produce the deliverable shape early, let it reveal what evidence is missing, and retrieve in response to drafted gaps rather than in advance.
A note from building Innogath
Building Innogath forced us to pick a side on this question. We chose methodology over product class — because the methodology survives a tool change. The long-report output shape we ship only when the brief calls for one; for most user projects, the actual output is a branching tree the user reads themselves, not a generated report they paste somewhere. That distinction — between the artifact the user reads in our workspace and the deliverable that leaves the workspace — is the part of the product that took longest to design.
Where Innogath fits
Innogath implements the methodological practice this guide describes, not the product class. The workspace produces all three audit surfaces (source, synthesis, deliverable), preserves citation-paragraph binding through editing, and treats the research output and the final deliverable as views of the same source-backed object rather than separate documents.
For methodologies that build on this foundation, see the academic research workflow pillar and the strategy research workflow pillar. For the citation reliability layer specifically, see AI research with citations and the systematic literature review with AI sub-cluster.
References
- OpenAI. Introducing Deep Research, February 2025.
- Google DeepMind. Gemini Deep Research, launched December 2024.
- arXiv. A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications, 2025.
- IBM. What are Agentic Workflows?
- Nature. Hallucinated citations are polluting the scientific literature, 2026. DOI 10.1038/d41586-026-00969-z.
- Nature. Can researchers stop AI making up citations?, 2025. DOI 10.1038/d41586-025-02853-8.
- INRA.AI. How to Prevent AI Citation Hallucinations.
- PRISMA. PRISMA 2020 statement. Page MJ et al., BMJ, 2021.
- Cochrane Training. Cochrane Handbook for Systematic Reviews of Interventions.
- Vannevar Bush. As We May Think, The Atlantic Monthly, 1945.
Innogath