Skip to content

Tools

Agent tools provide external data access. Each tool exports a JSON schema constant (*_SCHEMA) and an async handler function.

Tool Overview

Module Function Schema Purpose
pubmed.py search_pubmed() SEARCH_PUBMED_SCHEMA Search PubMed for research articles by keyword
pubmed.py fetch_abstract() FETCH_ABSTRACT_SCHEMA Fetch full metadata for a PubMed article by PMID
pubmed.py fetch_paper() FETCH_PAPER_SCHEMA Fetch by PMID or DOI
web_search.py web_search() WEB_SEARCH_SCHEMA Search the web via Brave Search API
vault_reader.py read_vault_notes() READ_VAULT_NOTES_SCHEMA Hybrid search (pgvector + full-text) over vault note embeddings
keyword_data.py read_keyword_data() READ_KEYWORD_DATA_SCHEMA Look up backlog item by keyword and language
keyword_data.py get_content_architecture() GET_CONTENT_ARCHITECTURE_SCHEMA Get cornerstone + satellite structure for a pillar
vault_chunking.py chunk_note() -- Parse Obsidian markdown into embeddable chunks by heading boundary
source_scanner.py scan_source() -- Scan external sources (PubMed, RSS, etc.) for new items
image_gen.py generate_image() GENERATE_IMAGE_SCHEMA Generate featured images via DALL-E 3
image_colors.py build_dalle_prompt() -- Assemble DALL-E prompts with pillar color palettes and brand-consistent style
r2_storage.py upload_image() -- Upload featured images to Cloudflare R2 object storage
r2_storage.py download_image() -- Download image bytes from R2 by key
r2_storage.py presigned_url() -- Generate presigned GET URL for R2 images (15 min TTL)
r2_storage.py delete_images() -- Delete all images for an article from R2 (prefix-based)
clinical_trials.py search_clinical_trials() SEARCH_CLINICAL_TRIALS_SCHEMA Search ClinicalTrials.gov for trials by condition
semantic_scholar.py search_semantic_scholar() SEARCH_SEMANTIC_SCHOLAR_SCHEMA Search Semantic Scholar for academic papers by query
semantic_scholar.py fetch_s2_paper() FETCH_S2_PAPER_SCHEMA Fetch full paper details (including abstract) by paper ID
citation_formatter.py format_citation_label() -- Generate deterministic citation labels from source metadata (6-level fallback)
url_validator.py validate_source_urls() -- HEAD-check source URLs; remove sources with unreachable URLs from dossier
source_enrichment.py enrich_dossier_urls() -- Construct publication URLs from PMID/DOI identifiers when missing
sanitize.py sanitize_external_content() -- Strip control characters, collapse whitespace, truncate

Integration Pattern

Tools are wired into agents via wrapper functions that return JSON strings:

# In the agent module
async def _wrap_search_pubmed(*, query: str, max_results: int = 50) -> str:
    pmids = await search_pubmed(query, max_results=max_results)
    return json.dumps(pmids)

_TOOL_MAP = {"search_pubmed": _wrap_search_pubmed}
_TOOL_DEFINITIONS = [SEARCH_PUBMED_SCHEMA]

All tool results are sanitized via sanitize_external_content() before being sent back to the LLM.

Sanitization

The sanitize_external_content() function in src/tools/sanitize.py is a defense-in-depth measure applied to all external content before it reaches an LLM or the database. It:

  • Strips control characters (except newlines and tabs)
  • Collapses excessive whitespace runs
  • Truncates content beyond a configurable character limit

This prevents prompt injection via scraped web pages, RSS feeds, or PubMed abstracts. External content is always placed in the user role of messages, never in system prompts.

Academic Search Tools

ClinicalTrials.gov (clinical_trials.py)

Searches the ClinicalTrials.gov v2 API for clinical trials by condition. Returns NCT IDs, titles, statuses, start dates, and URLs. The researcher agent uses this to find ongoing and completed trials relevant to the article topic.

Semantic Scholar (semantic_scholar.py)

Searches the Semantic Scholar Academic Graph API for papers by query. Two-step usage: search_semantic_scholar() returns lightweight results (title, date, citation count) without abstracts, then fetch_s2_paper() retrieves full details for selected papers.

Rate-limited to 1 request per 2 seconds to stay within the API tier. Optionally authenticated via S2_API_KEY environment variable.

Image Pipeline Tools

R2 Storage (r2_storage.py)

Cloudflare R2 object storage client for featured images. Images are stored at images/{article_id}/{ordinal}.png. Wraps synchronous boto3 S3 calls in asyncio.to_thread() for async compatibility.

Requires R2_ACCOUNT_ID, R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, and R2_BUCKET_NAME environment variables.

Image Colors (image_colors.py)

Assembles DALL-E prompts from pillar-specific color palettes and fixed template sections (style, composition, avoidance rules). Each content pillar (P1-P5) has a dedicated warm earth-tone color pair. The image generator agent calls build_dalle_prompt(scene_description, pillar) with the scene it composed from article context.

Post-Research Processing Tools

Citation Formatter (citation_formatter.py)

Generates deterministic citation labels from ResearchSource metadata using a 6-level fallback chain: authors + year → publication + year → authors only → publication only → truncated title → [source]. Runs after the researcher agent returns, before the writer receives the dossier.

URL Validator (url_validator.py)

HEAD-checks all source URLs in a research dossier. Removes sources where all URLs (both article_url and publication_url) are unreachable. Raises ValueError if all sources are removed. Reports removals to Sentry for monitoring.

Source Enrichment (source_enrichment.py)

Constructs clickable publication_url values from identifiers when the researcher agent didn't populate them. Priority: PMID → PubMed URL, DOI → doi.org URL, fallback to article_url. Existing URLs are never overwritten.