Tools¶
Agent tools provide external data access. Each tool exports a JSON schema constant (*_SCHEMA) and an async handler function.
Tool Overview¶
| Module | Function | Schema | Purpose |
|---|---|---|---|
pubmed.py |
search_pubmed() |
SEARCH_PUBMED_SCHEMA |
Search PubMed for research articles by keyword |
pubmed.py |
fetch_abstract() |
FETCH_ABSTRACT_SCHEMA |
Fetch full metadata for a PubMed article by PMID |
pubmed.py |
fetch_paper() |
FETCH_PAPER_SCHEMA |
Fetch by PMID or DOI |
web_search.py |
web_search() |
WEB_SEARCH_SCHEMA |
Search the web via Brave Search API |
vault_reader.py |
read_vault_notes() |
READ_VAULT_NOTES_SCHEMA |
Hybrid search (pgvector + full-text) over vault note embeddings |
keyword_data.py |
read_keyword_data() |
READ_KEYWORD_DATA_SCHEMA |
Look up backlog item by keyword and language |
keyword_data.py |
get_content_architecture() |
GET_CONTENT_ARCHITECTURE_SCHEMA |
Get cornerstone + satellite structure for a pillar |
vault_chunking.py |
chunk_note() |
-- | Parse Obsidian markdown into embeddable chunks by heading boundary |
source_scanner.py |
scan_source() |
-- | Scan external sources (PubMed, RSS, etc.) for new items |
image_gen.py |
generate_image() |
GENERATE_IMAGE_SCHEMA |
Generate featured images via DALL-E 3 |
image_colors.py |
build_dalle_prompt() |
-- | Assemble DALL-E prompts with pillar color palettes and brand-consistent style |
r2_storage.py |
upload_image() |
-- | Upload featured images to Cloudflare R2 object storage |
r2_storage.py |
download_image() |
-- | Download image bytes from R2 by key |
r2_storage.py |
presigned_url() |
-- | Generate presigned GET URL for R2 images (15 min TTL) |
r2_storage.py |
delete_images() |
-- | Delete all images for an article from R2 (prefix-based) |
clinical_trials.py |
search_clinical_trials() |
SEARCH_CLINICAL_TRIALS_SCHEMA |
Search ClinicalTrials.gov for trials by condition |
semantic_scholar.py |
search_semantic_scholar() |
SEARCH_SEMANTIC_SCHOLAR_SCHEMA |
Search Semantic Scholar for academic papers by query |
semantic_scholar.py |
fetch_s2_paper() |
FETCH_S2_PAPER_SCHEMA |
Fetch full paper details (including abstract) by paper ID |
citation_formatter.py |
format_citation_label() |
-- | Generate deterministic citation labels from source metadata (6-level fallback) |
url_validator.py |
validate_source_urls() |
-- | HEAD-check source URLs; remove sources with unreachable URLs from dossier |
source_enrichment.py |
enrich_dossier_urls() |
-- | Construct publication URLs from PMID/DOI identifiers when missing |
sanitize.py |
sanitize_external_content() |
-- | Strip control characters, collapse whitespace, truncate |
Integration Pattern¶
Tools are wired into agents via wrapper functions that return JSON strings:
# In the agent module
async def _wrap_search_pubmed(*, query: str, max_results: int = 50) -> str:
pmids = await search_pubmed(query, max_results=max_results)
return json.dumps(pmids)
_TOOL_MAP = {"search_pubmed": _wrap_search_pubmed}
_TOOL_DEFINITIONS = [SEARCH_PUBMED_SCHEMA]
All tool results are sanitized via sanitize_external_content() before being sent back to the LLM.
Sanitization¶
The sanitize_external_content() function in src/tools/sanitize.py is a defense-in-depth measure applied to all external content before it reaches an LLM or the database. It:
- Strips control characters (except newlines and tabs)
- Collapses excessive whitespace runs
- Truncates content beyond a configurable character limit
This prevents prompt injection via scraped web pages, RSS feeds, or PubMed abstracts. External content is always placed in the user role of messages, never in system prompts.
Academic Search Tools¶
ClinicalTrials.gov (clinical_trials.py)¶
Searches the ClinicalTrials.gov v2 API for clinical trials by condition. Returns NCT IDs, titles, statuses, start dates, and URLs. The researcher agent uses this to find ongoing and completed trials relevant to the article topic.
Semantic Scholar (semantic_scholar.py)¶
Searches the Semantic Scholar Academic Graph API for papers by query. Two-step usage: search_semantic_scholar() returns lightweight results (title, date, citation count) without abstracts, then fetch_s2_paper() retrieves full details for selected papers.
Rate-limited to 1 request per 2 seconds to stay within the API tier. Optionally authenticated via S2_API_KEY environment variable.
Image Pipeline Tools¶
R2 Storage (r2_storage.py)¶
Cloudflare R2 object storage client for featured images. Images are stored at images/{article_id}/{ordinal}.png. Wraps synchronous boto3 S3 calls in asyncio.to_thread() for async compatibility.
Requires R2_ACCOUNT_ID, R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, and R2_BUCKET_NAME environment variables.
Image Colors (image_colors.py)¶
Assembles DALL-E prompts from pillar-specific color palettes and fixed template sections (style, composition, avoidance rules). Each content pillar (P1-P5) has a dedicated warm earth-tone color pair. The image generator agent calls build_dalle_prompt(scene_description, pillar) with the scene it composed from article context.
Post-Research Processing Tools¶
Citation Formatter (citation_formatter.py)¶
Generates deterministic citation labels from ResearchSource metadata using a 6-level fallback chain: authors + year → publication + year → authors only → publication only → truncated title → [source]. Runs after the researcher agent returns, before the writer receives the dossier.
URL Validator (url_validator.py)¶
HEAD-checks all source URLs in a research dossier. Removes sources where all URLs (both article_url and publication_url) are unreachable. Raises ValueError if all sources are removed. Reports removals to Sentry for monitoring.
Source Enrichment (source_enrichment.py)¶
Constructs clickable publication_url values from identifiers when the researcher agent didn't populate them. Priority: PMID → PubMed URL, DOI → doi.org URL, fallback to article_url. Existing URLs are never overwritten.