Discovery Agent

AI agent that finds relevant Twitter conversations, drafts replies in your voice, and never posts without your approval.

01 — Overview

An agent that operates on your behalf — but only with your permission.

Most engagement tools either blast generic replies or require you to manually trawl feeds. Discovery Agent sits in between. It profiles your skills and interests, searches for conversations that match, ranks them, drafts replies that sound like you, and delivers a curated digest to Slack. You review, approve or reject, and the agent posts. Every tweet requires explicit approval; the agent never acts autonomously.

Under the hood, the system runs as a multi-stage pipeline: search, filter, rank, draft, digest, approve. Persistent state backs every stage so the pipeline survives restarts. Execution pauses at the approval stage and resumes asynchronously when decisions arrive through Slack, potentially hours later. Human-in-the-loop wasn't bolted on as a safety feature. It was the architectural foundation from the start.

The agent improves over time. Approved and rejected items feed into preference profiles that shape future ranking. Query budgets shift toward categories that produce better results. Style traits extracted during onboarding keep drafts grounded in how you actually write, not how the model defaults to writing.

02 — The Hard Parts

Pausing a Pipeline for Human Judgment

The discovery pipeline runs asynchronously. A slash command kicks off search, ranking, and drafting, then the system pauses while waiting for approval decisions that arrive through Slack, potentially hours later. State needs to survive process restarts. The graph framework handles the pause, but wiring it to an external approval surface required careful separation: the pipeline owns orchestration, Slack owns the user interface, and the database is the only shared state between them. Getting that boundary wrong would mean lost decisions or duplicate posts.

Spending Real Money Safely

The agent makes API calls that cost real money: reads to find posts, writes to publish tweets, and LLM calls to rank and draft. Without hard limits, a runloop bug or a misconfigured query could burn through a budget in minutes. Cost enforcement runs before any work begins. Session and daily limits are checked independently for reads and writes. Every external action gets a durable audit log entry before the call is made, so if the process crashes mid-request, there's always a record of what was attempted. This wasn't defensive programming; it was the only way to run an agent that spends money without watching it constantly.

Writing Like You Without Making Things Up

The model drafts replies that need to match your tone and reference your actual background, not hallucinate expertise. The ranking stage classifies each post's topic against your profile, separating what you know from what you don't. If a topic falls outside your domain, the drafter skips it entirely rather than fabricating authority. Style traits extracted during onboarding capture how you write (directness, vocabulary, personality) without encoding factual claims. The rest is structural: character limits, format validation, and rejection of drafts that read like they were obviously machine-generated.

03 — Outcomes

Tweets posted without explicit human approval

100+

Tests across the pipeline, cost enforcement, and feedback loop

Pipeline stages orchestrated with persistent state and async resumption

23 days

First commit to v1.0 shipped

04 — Stack

Python 3.12LangGraphSlack BoltAnthropic ClaudePostgreSQLpgvectorSQLAlchemyAlembicDockerhttpxpytest

05 — What's Next

From personal tool to multi-user product.

→

Multi-user isolation — Other people should be able to connect their own accounts, run their own discovery cycles, and stay fully isolated — separate data, separate budgets, separate learning profiles.

→

Closing the feedback loop — v1.0 collects signals (edits, approvals, engagement) but doesn't fully consume them yet. The next milestone wires those signals into a verifiable learning loop: preference extraction from edits, drift detection when drafts stop matching your voice, and learned scoring weights that adapt ranking to each user over time.

→

Prompt generalization — The current prompts are tuned for a software engineer's profile. Generalizing the system so it works for any professional persona — a designer, a PM, a researcher — means rethinking how the agent reasons about expertise and relevance.

→

Reliability infrastructure — Alerting on failures, graceful degradation when external services go down, and cost enforcement that survives restarts. The kind of operational maturity that separates a side project from something people depend on.

This is a private project — but I'm happy to talk through the engineering.

Ask me about it