Skip to content

Pipeline Flows

All flows are orchestrated with Prefect 3.x and defined in src/pipeline/.

Flow Overview

Flow Module Trigger Description
generate_briefs brief_flow.py Manual Creates article briefs from backlog items
produce_article article_flow.py Manual Full lifecycle: research, write, quality gates, review
batch_produce batch_flow.py Manual Parallel article production with bounded concurrency
resume_article resume_flow.py Manual Resumes article production from current status
recover_stuck_articles resume_flow.py Every 30 min Periodic sweep: finds stuck articles and triggers resume
publish_articles publish_flow.py Manual Publishes approved articles to WordPress as drafts
republish_articles publish_flow.py Manual Re-publishes already-published articles (partial publish recovery)
research_news_scan research_news_flow.py Weekly (Sun 6AM UTC) Scans RSS feeds, triages relevance, triggers digest production
produce_digests research_news_flow.py Sub-flow / Manual Produces per-language digest articles from newsworthy items

Article Lifecycle Status Machine

Articles progress through a defined state machine enforced by src/pipeline/status.py:

stateDiagram-v2
    [*] --> brief_generating
    brief_generating --> brief_pending
    brief_pending --> brief_approved
    brief_approved --> researching
    researching --> writing
    writing --> quality_check
    quality_check --> synthesizing
    synthesizing --> ready_for_review
    synthesizing --> writing: rewrite loop
    ready_for_review --> approved
    approved --> published
    brief_generating --> failed
    brief_pending --> failed
    brief_approved --> failed
    researching --> failed
    writing --> failed
    quality_check --> failed
    synthesizing --> failed
    ready_for_review --> failed
    approved --> failed

The brief_generating status is set when the brief generator agent begins work, preventing premature UI actions (approve/reject) on articles that do not yet have a brief.

Article Production Flow

The produce_article flow is the core pipeline. It chains the following steps:

graph TD
    A[Brief Approved] --> B[Researcher]
    B --> C[Writer]
    C --> D[Quality Gates]
    D --> E{All pass?}
    E -->|No| F[Synthesis]
    F --> G[Rewrite]
    G --> D
    E -->|Yes| H[Image Generation]
    H --> I[Ready for Review]

Quality gates run in parallel (config-driven per content type):

  • Factuality Checker -- validates medical claims against research dossier (mandatory)
  • SEO Optimizer -- deterministic checks + LLM evaluation of keyword usage, heading structure, meta descriptions (skipped for research_news)
  • Style Checker -- readability, structure compliance, voice/tone, AI-pattern detection, medical language (mandatory)

The synthesis agent reconciles feedback from all quality gates and produces a revised draft. If any gate fails, the writer rewrites the article (up to 3 iterations). After all quality gates pass, image generation runs for satellite and cornerstone articles (skipped for research news).

Quality Gate Configuration

Active gates are configured per content type in src/config.py:

Content Type Active Gates
satellite factuality, seo, style
cornerstone factuality, seo, style
research_news factuality, style

Score thresholds are also content-type-specific:

Content Type Pass Threshold
satellite 70
cornerstone 85
research_news 70

Langfuse Trace Integration

Each produce_article run creates a Langfuse root span with:

  • Session ID: article:{article_id} -- groups all LLM calls for one article
  • Tags: ["article", content_type, language] -- for filtering in the Langfuse UI
  • Quality scores: factuality, SEO, and style scores posted per iteration
  • Input/output: article context at start, pass/fail result at end

See Observability for the full Langfuse integration details.

Brief Generation Flow

The generate_briefs flow processes backlog items in parallel (bounded by a semaphore of 10):

  1. Load backlog item and content template
  2. Create article record in brief_generating status
  3. Run the brief generator agent
  4. Store brief and transition to brief_pending
  5. Link backlog item to article and update backlog status

On failure, the article transitions to failed status so it does not remain stuck in brief_generating.

Research News Flow

The research_news_scan flow runs weekly and delegates to a produce_digests sub-flow:

  1. Source scanning -- fetches new items from all active RSS/PubMed sources
  2. Triage -- AI classifies each item for newsworthiness (batches capped at 20 items to prevent structured output truncation)
  3. Pool-aware gate check -- counts the full unassigned newsworthy pool in the DB (cross-week), skips if fewer than 3 items total (not just this week's triage output)
  4. Digest production -- calls produce_digests sub-flow

Digest Batching (FIFO Pool)

The produce_digests sub-flow (also deployable standalone) batches the newsworthy pool into digests of 3-5 items each:

  1. Loads all newsworthy items from the database in FIFO order (discovered_at ASC, limit 100)
  2. Reconstructs triage metadata from persisted SourceItem fields for cross-week items (in-memory triage results from the current scan take precedence via a merge dict)
  3. Batches the pool: while remaining >= 3, slice up to 5 items per batch
  4. For each batch, produces one digest per language via the digest writer agent
  5. Runs quality gates (factuality + style) with a synthesis rewrite loop (up to 3 iterations)
  6. Items remaining below the 3-item minimum stay in the pool for the next scan

A single scan run may produce multiple digests per language if the pool is large enough. For example, 8 items produce 2 digests (5 + 3) per language.

Constants: MIN_NEWSWORTHY_ITEMS = 3, MAX_ITEMS_PER_DIGEST = 5.

Resume Flow

The resume_article flow resumes article production from whatever status the article is currently in. It reads the article's status and dispatches to the appropriate pipeline step:

  • brief_approved / researching — delegates to the full produce_article flow
  • writing — loads brief and dossier from DB, runs writer → quality gates → synthesis
  • quality_check / synthesizing — loads existing draft, runs quality gates → synthesis
  • ready_for_review — no-op (already complete)

For research news / digest articles, the resume flow reconstructs the digest brief and dossier from the database and runs the digest-specific gate loop.

Stuck Article Recovery

The recover_stuck_articles flow runs on a 30-minute schedule. It finds articles stuck in transitional statuses (researching, writing, quality_check, synthesizing) where updated_at is older than a threshold (default: 30 minutes) and triggers the resume-article deployment for each.

This uses a dedicated recovery work pool (recovery-pool) so recovery flows are never blocked by production work in the main content-pool.

Draft Validation

Before running quality gates, draft_validation.py performs deterministic structure checks (zero LLM cost):

  • Detects truncated drafts (ends mid-sentence)
  • Checks word count against minimum thresholds
  • Validates required structural elements from the content template

Failed drafts trigger an immediate rewrite without wasting expensive gate compute.

Republish Flow

The republish_articles flow re-publishes already-published articles to WordPress. It shares the same core publish logic as publish_articles but filters for PUBLISHED status and skips the atomic claim transition. Use this to fix partial publishes where some WordPress operations (taxonomy, Polylang, ACF) failed on the initial publish.

Supporting Modules

Module Purpose
status.py Article status state machine with VALID_TRANSITIONS dict and transition_article()
quality_gates.py Shared map_gate_results() helper for mapping parallel gate results to typed outputs
run_tracker.py Pipeline run lifecycle tracking (start, complete, finish, stale detection)
draft_validation.py Deterministic draft structure validation (truncation, word count, required elements)
resume_flow.py Resume article production from current status + stuck article recovery

Deployment Configuration

Flows are deployed via prefect.yaml at the repository root. Nine deployments are registered across two work pools:

Content Pool (production work):

  • research-news-scan -- weekly Sunday 06:00 UTC (concurrency: 1)
  • produce-digests -- manual trigger (concurrency: 1)
  • generate-briefs -- manual trigger (concurrency: 1)
  • batch-produce -- manual trigger (concurrency: 1)
  • produce-article -- manual trigger
  • resume-article -- manual trigger (concurrency: 1)
  • publish-articles -- manual trigger (concurrency: 1)
  • republish-articles -- manual trigger (concurrency: 1)

Recovery Pool (never blocked by production):

  • recover-stuck-articles -- every 30 minutes (concurrency: 1)

The recovery pool is separate so that stuck article recovery is never blocked by production flows filling the content pool.

See Deployment for setup instructions.