Il workflow che uso per scrivere articoli con 4 AI in loop fino al consenso

Published on 5/7/2026

A Verification System

I've built a workflow based on four language models working in sequence and in parallel: Qwen Chat, DeepSeek Chat, Gemini Reasoning Chat, and Claude Chat. It's not full automation—the final review remains mine—but it's a system that helps me reduce factual errors before publication.

The starting point is practical: for my use case, Qwen Chat and DeepSeek Chat offer a wide margin of usage, without the strict limits I encounter with other tools. This allows me to run them intensively without interruptions.

Before you start: what you need (and what you don't)

Replicating this approach doesn't require APIs, code, or complex automation. I use the web chat interfaces of the four models, with manual copy-paste between phases. It sounds cumbersome, but it's an advantage: it forces me to read every step, maintain control, and avoid blind delegation.

The tools:

Qwen Chat: Primary model; handles long contexts well and structures content coherently
DeepSeek Chat: Analytical like Qwen, conducts thorough research, tends to be fairly objective in evaluations
Gemini Reasoning Chat: Excellent for initiating structured research; tends to be less critical during re-evaluations
Claude Chat: Particularly attentive in flagging unwarranted generalizations and critical nuances

The workflow architecture, step by step

The process is iterative. Not linear, but cyclical: research, writing, critique, revision—until consensus emerges on the absence of macroscopic errors.

Phase 1: Distributed research. I pose the same research question to Qwen, DeepSeek, Gemini Reasoning, and Claude. Each explores the topic from different angles, retrieves different data, and highlights aspects the others might overlook.

Phase 2: Aggregation in Qwen. I take all research results and feed them as context into Qwen, which serves as the primary model. Qwen processes the material, identifying overlaps, contradictions, and gaps.

Phase 3: First draft. I ask Qwen to write the article's first draft based on the aggregated data. The draft already includes citations, figures, and references that emerged from the multi-model research.

Phase 4: Cross-critical analysis. I send the draft to DeepSeek, Gemini Reasoning, and Claude with a specific prompt: "Analyze this article looking for factual errors, questionable interpretations, data unsupported by sources, or excessive generalizations."

Phase 5: Iteration on critiques. The critiques from the three AIs return to Qwen, which analyzes them, updates the article, corrects errors, removes unsupported claims, and adds nuance where needed.

Phase 6: Re-evaluation. The updated article is sent back to the three AIs for a second review. The cycle continues until DeepSeek, Gemini Reasoning, and Claude agree that the article contains no macroscopic errors.

Phase 7: Human review. At this point, I step in. I reread the entire article, personally recheck the most important data, or ask the AIs to run targeted searches that lead to original sources. If I see misinterpreted data, I remove or modify it.

Only after this process do I publish the article.

Why use different models (in my experience)

This isn't an exercise in complexity for its own sake. In my usage experience, different models show different tendencies—not immutable laws, but recurring patterns that become useful when orchestrated.

When all four converge on a claim, my confidence in that claim increases. When they diverge, that's where attention is needed—or direct human verification against primary sources.

The limits of the method, honestly

There are two limitations to keep in mind.

First, AIs can hallucinate sources. Even when explicitly asked for citations and links, models sometimes generate references that don't exist or misinterpret complex data. That's why Phase 7—human verification against primary sources—isn't optional. It's essential.

Second is illusory consensus: if all models share the same training bias or the same informational gap, they can agree on an error. It's rare, but it happens. To mitigate this risk, on more sensitive topics I run targeted searches explicitly asking "take me to the original source" and verify personally.

Finally, a note on measurement: I don't have rigorous quantitative data on how many errors this method has caught compared to a traditional workflow. I can say, from direct observation, that the loop has flagged imprecisions I otherwise would have published. But it's not an exact science—it's an evolving practice.

When it makes sense (and when it doesn't)

This approach works well for analytical content requiring factual accuracy: geopolitical articles where demographic, economic, and historical data must be cross-referenced; analyses of tech trends where it's easy to confuse marketing announcements with real data; complex scientific topics where nuance matters.

It makes less sense for purely opinion-based content, where factual accuracy is less central; for breaking news requiring publication within minutes; or for creative texts, where factual verification isn't the primary goal—and indeed, that's an area I don't venture into.

Time required: what it really means

Writing a well-documented article with dozens of cross-referenced sources is immense work. In a newsroom, there would be a team of fact-checkers, an editor, time, and resources. I'm solo and short on time. The real comparison isn't between "article written in 1 day without AI" and "article written in 2 hours with AI". For me, the actual choice is between writing the article—with the systematic help of multiple models checking each other—and not writing it at all.

Of course, this method only works if I'm working on topics where I already have sufficient expertise to evaluate data and implications. I don't venture into unfamiliar territory without first studying it for months. AIs don't replace my foundational knowledge: they augment it, making possible what would otherwise require a newsroom—and costs and timelines I could never afford.

In summary

It's not a perfect system. It requires time, discipline, and awareness that no model—alone or in combination—is infallible. But it's a pragmatic approach that acknowledges AI limitations while still using them systematically to improve accuracy.

The key isn't total automation. It's distributed intelligence: four models checking each other, plus a human maintaining final control over sources. If you write analytical content and value accuracy but can't afford a newsroom's resources, it's worth testing a workflow like this. You don't need complex infrastructure—just four browser tabs open and the discipline to iterate until consensus emerges.

Contact me

Do you have an idea and want to see if it could work? Want to talk about technology? Interested in organizing a talk?

Contact me