Can you use AI for a systematic literature review?

Yes, but the safe job for AI is audit-support work: search expansion, deduplication help, screening suggestions, extraction drafts, consistency checks, and citation-linked synthesis. The unsafe job is letting AI decide the review question, silently change inclusion criteria, or write unsupported conclusions. A defensible review keeps every machine action visible and every final judgment human.

Does PRISMA allow AI-assisted systematic reviews?

PRISMA is a reporting framework, not a tool approval list. If AI is used, the methods section should say what tool was used, which review stages it touched, what humans verified, and where the search strings, exclusion reasons, extraction sheets, and prompts or task instructions can be audited.

What is the biggest risk of AI in a literature review?

The biggest risk is not a bad sentence. It is an untraceable decision. If a paper is excluded, a field is extracted, or a synthesis claim is written without a durable source trail, the review becomes difficult to reproduce even if the prose sounds scholarly.

Can AI replace two independent screeners?

For publication-grade reviews, AI should not be treated as two independent human screeners. It can prioritize, suggest, or act as a second-pass assistant, but final inclusion rules and conflict resolution should remain documented human decisions.

What should I save when using AI for a review?

Save the protocol, databases searched, exact search strings, search dates, deduplication rules, screening labels, exclusion reasons, extraction schema, verified source passages, synthesis drafts, tool names, tool versions if available, and the date each AI-assisted step was run.

Systematic Literature Review with AI: Audit Workflow

The original-content test for this topic

Most pages about AI literature reviews make the same mistake: they describe the stages of a systematic review, then add “AI can help” after each stage. That is not enough. It is a paraphrase of the method, not a useful method.

The practical question is different: what changes when AI is introduced into a review whose credibility depends on reproducibility? The answer is not speed alone. Speed is useful only when the machine makes the trail easier to inspect. If the review gets faster but the decisions become less visible, the work is weaker.

The audit-first rule is simple: every AI-assisted action must leave a record that a skeptical reviewer could follow later. A generated summary is not a record. A record is a search string, a database name, a timestamp, an include/exclude label, an extraction field, a source paragraph, or a versioned synthesis claim.

That is the difference between an AI-assisted systematic literature review and a search-engine article about AI tools.

Start with the protocol, not the prompt

A systematic review begins before the first search. It begins with a protocol that says what question the review answers, what evidence can answer it, where the evidence will be searched for, and how ambiguous cases will be handled.

For clinical and intervention questions, PICO is still the cleanest starting shape: population, intervention, comparison, outcome. For observational questions, PEO may fit better. For qualitative work, SPIDER may be more useful. The framework is less important than the constraint it creates: a paper should be eligible or ineligible because of written criteria, not because a model found it interesting.

AI should not invent the protocol. It can stress-test one.

Useful protocol prompts ask the model to find missing terms, hidden exclusions, conflicting criteria, and likely edge cases. For example: “Here is my inclusion rule. List five studies that might be borderline and explain why.” That kind of work makes the protocol stronger without handing the decision to the model.

The protocol should freeze four things before search:

The research question in reviewable terms
Inclusion and exclusion criteria
Databases and supplementary sources to search
What counts as a final human decision

If those change later, the change should be logged. A review can evolve, but an unlogged evolution is no longer reproducible.

Treat search as an artifact

The search is not a discovery session. It is an artifact that has to be reported.

PRISMA 2020 provides a 27-item reporting checklist and flow diagrams for systematic reviews. Cochrane’s search guidance emphasizes that review searches should be systematic, comprehensive, documented, and designed to minimize bias. Those principles matter more in AI-assisted work because an AI tool can hide a lot of decisions behind a clean interface.

The search artifact should contain:

The database or source searched
The exact search string used there
The date searched
Filters or limits applied
Export format and record count
Deduplication method
Known seed papers used to test recall

AI helps by expanding synonyms, suggesting controlled vocabulary, translating concepts across databases, and checking whether known relevant papers are retrieved. It should not be allowed to run an invisible search whose exact query cannot be exported.

The useful test is brutal: if a second researcher cannot rerun the same search from your notes, the AI step was not review-grade.

Use AI screening as a triage assistant

Screening is where AI can save the most time and cause the most damage.

The good use case is structured triage. The tool reads a title and abstract, applies the written criteria, assigns a provisional label, and gives a short reason tied to the criteria. It does not make the final call silently. It creates a queue.

Use four labels instead of two:

include: clearly eligible
exclude: clearly outside criteria
uncertain: needs human review
duplicate/report: may be another report of the same study

That fourth label matters. Systematic reviews usually include studies, not merely reports. One clinical trial may appear as a protocol, conference abstract, registry entry, and journal article. If an AI tool treats each report as an independent study, the evidence base becomes inflated.

The audit record for screening should include the paper identifier, AI label, AI reason, human label, human reason if changed, and final decision. This gives you more than speed. It gives you disagreement data: where the model is confused, where the criteria are weak, and where the human reviewers need a rule.

Keep exclusion reasons boring and controlled

Exclusion logs fail when the reasons become prose.

“Not relevant” is not an exclusion reason. “Wrong population” is. “Wrong outcome” is. “Not empirical” is. “Duplicate report of included study” is. Controlled reasons make the final PRISMA flow and supplementary table coherent.

AI can help by mapping its draft reason to one controlled reason. It can also flag when none fits, which usually means the protocol needs a new rule or the paper is genuinely borderline.

The best exclusion log is boring:

Field	Example
Record ID	DOI, PMID, arXiv ID, or database ID
Stage	title/abstract or full-text
Final decision	exclude
Controlled reason	wrong outcome
Human note	measures engagement, not learning outcome
Source checked	abstract / full text

Boring logs are not glamorous. They are what make a review defensible.

Build an evidence packet for every included study

The smallest useful unit in an AI-assisted review is not a paper summary. It is an evidence packet.

An evidence packet is a compact, source-linked record for one included study. It contains the citation, eligibility reason, study design, population, intervention or exposure, comparator, outcomes, extraction fields, risk-of-bias notes, and the exact passages that support the extracted values. It also contains one short “do not overstate” note: the limit this study places on your synthesis.

That last note is where original judgment enters the workflow. AI tools tend to extract what a study says. Reviewers need to record what the study does not allow you to say. A randomized trial in one country does not automatically support a global implementation claim. A qualitative interview study can support a mechanism or experience claim, not a pooled effect estimate. A preprint can be useful signal, but it should be labeled differently from peer-reviewed evidence.

The packet model makes later writing safer. When a synthesis branch claims that evidence is strong, mixed, thin, or context-dependent, the writer is reading from packets that already carry the methodological limit. This reduces the risk that a fluent AI draft turns weak evidence into confident prose.

Innogath should encourage this shape directly: one study, one packet, one set of citations, one verification status. The final literature review becomes a synthesis of packets, not a rearranged pile of summaries.

Extract data with provenance, not just columns

Data extraction is where many AI review workflows look impressive and become fragile. A model can fill a table quickly. The question is whether every cell in that table can be defended.

A review-grade extraction table needs three layers:

The extracted value
The source passage or page that supports it
The verification status

For example, sample size: 184 is not enough. The cell should carry the paper ID, the page or section, and whether a human checked the value. If the paper reports multiple samples, attrition, subgroup analyses, or adjusted models, AI extraction becomes a suggestion, not a fact.

For low-risk fields, sample verification may be enough. For outcome fields, effect sizes, intervention definitions, bias judgments, or anything that enters a meta-analysis, the human check should be complete. A single wrong effect size can change a conclusion.

In practice, extraction works best as a double-entry ledger:

AI drafts fields from the full text.
A human verifies the fields against the source.
Corrections are saved, not overwritten.
The synthesis only reads verified fields.

That last rule prevents a common failure: polished synthesis built on unverified extraction.

Synthesize by branch, not by summary

The AI default is to summarize. A systematic review does not need more summaries. It needs synthesis.

Synthesis asks what the included studies collectively show, where they disagree, what is uncertain, and why the evidence should or should not change a decision. A good AI-assisted workflow separates those questions into branches.

For a review of AI tutoring systems, the branches might be:

Learning outcomes
Study design and risk of bias
Age group differences
Implementation context
Cost and teacher workload
Open gaps

Each branch reads from the same verified extraction table but asks a different question. This prevents one long AI-generated narrative from flattening disagreements. It also makes the final writing easier to audit: a paragraph about implementation context should trace to the studies in that branch, not to a general model memory.

The branch model is especially useful when the review mixes quantitative and qualitative material. Quantitative fields can feed tables and meta-analysis software. Qualitative observations can feed thematic branches. The final paper can then state exactly which kind of synthesis supports each claim.

Write claims in evidence grades

AI drafts often fail because every sentence sounds equally confident. Systematic reviews need graded claims.

Use four claim levels while drafting:

descriptive: what the included studies report
comparative: how findings differ across study types or contexts
inferential: what pattern the evidence supports
decision-facing: what a reader should do or believe differently

Each level needs a different evidence threshold. A descriptive claim can cite one study. A comparative claim needs a visible set. An inferential claim needs attention to quality and consistency. A decision-facing claim needs the strongest support and the clearest caveats.

This is a practical guardrail for AI writing. Ask the tool to label each draft paragraph by claim level, then verify whether the citations match that level. If a paragraph makes a decision-facing claim from two weak studies, the fix is not better wording. The fix is demotion: make it descriptive or state the uncertainty.

This also improves search visibility because it creates information gain. Many pages say “AI helps synthesize literature.” A useful page explains how to prevent AI from flattening evidence strength.

Do not outsource quality assessment

Risk-of-bias assessment is judgment-heavy. AI can prepare it; it should not own it.

The tool can identify where a paper discusses randomization, blinding, allocation, attrition, missing outcomes, or conflicts of interest. It can quote the relevant passage and suggest which domain might be affected. The reviewer still decides the rating.

That distinction matters because quality assessment is not extraction. It is interpretation. Two studies can report the same method and still deserve different judgments because of context, reporting clarity, or the outcome being analyzed.

The useful AI output is therefore not “low risk of bias.” It is:

Domain considered
Source passage surfaced
Possible concern
Human rating
Human rationale

That record gives an editor something to inspect. A naked AI rating does not.

Write the methods note while you work

Do not wait until submission to remember what the AI did.

The methods note should be written during the review. It should say:

Which AI tools were used
Which stages they touched
What data was uploaded or connected
What the tool generated
What humans verified
How conflicts were resolved
Where the search strings, logs, and extraction sheets are stored

For an AI-assisted systematic review, the disclosure is not a confession. It is part of the method. A reader should be able to distinguish between the machine-assisted tasks and the human conclusions.

This also protects the review from overclaiming. If AI only helped with search-string expansion and first-pass screening, say that. If it drafted synthesis paragraphs that were rewritten and source-checked, say that. Vague disclosure is weaker than specific disclosure.

A note from building Innogath

The ‘evidence packet’ as the primary unit came from watching early-access users do their own lit reviews inside Innogath. Users who organized work as free-floating notes lost track of what each note was for after about two weeks. Users who created a packet per paper — citation, eligibility reason, source passage, verification status — could pick up the project after a long pause without re-reading anything. We rebuilt the workspace’s primary unit around the packet for that reason.

Where Innogath fits

Innogath should not be positioned as a magic literature-review writer. The stronger claim is narrower: it is a workspace for preserving the review trail while AI accelerates the mechanical work.

The Innogath pattern for this page is:

Create a parent page for the protocol.
Add one branch for each search concept and database.
Save search strings, dates, and result counts inside the branch.
Use screening branches for include, exclude, uncertain, and duplicate/report.
Keep extraction tables linked to source passages.
Draft synthesis branches only from verified extractions.
Export the review with citations and a supplementary trail.

That is a product-specific workflow, not a paraphrase of PRISMA. It explains why branching pages and cited reports matter for a systematic review: they keep decisions inspectable after the writing starts.

For the broader academic workflow, see the academic research workflow and the AI literature review tool use case. For the product mechanics behind the citation trail, see cited AI research reports and branching research pages.

Red flags before publication

An AI-assisted review is not ready if any of these are true:

The search strings cannot be exported.
The model changed inclusion criteria without a logged protocol amendment.
Exclusion reasons are free-text paragraphs instead of controlled reasons.
Extracted fields are not tied to source passages.
Quality assessment ratings have no human rationale.
The synthesis cites AI summaries instead of primary studies.
The methods section says “AI was used” without stage-level detail.
The PRISMA flow cannot be reconciled with the screening log.

These are not cosmetic issues. They are the points where the review stops being reproducible.

References

The workflow above is based on the reporting and search principles in the PRISMA 2020 statement and the Cochrane Handbook chapter on searching and selecting studies, especially their emphasis on transparent search, documented selection, and reproducibility.

It also reflects Google’s spam policies: pages generated or rewritten at scale without original value can fall under scaled content abuse or scraping patterns. This page is therefore written as a product-specific audit workflow rather than a generic restatement of what systematic reviews are.

About this article

I'm Eric Liew, founder of Innogath. This article is part of a methodology series I write while building the product.

How this was made. Researched and drafted with Innogath, our own AI research workspace. SERP analysis, source collection, and first-draft prose are produced inside the workspace; every statistic and citation has been verified against the source linked above. Framing, judgments, and any errors are mine.

Why it exists. Most pages in this category rewrite each other. I write these to have something I would have wanted to read — a working note from someone trying to get the methodology right while building the product around it.

If you find a factual error, email eric@innogath.com. I will fix it and credit the report.

Systematic literature review with AI: an audit-first workflow