Claude Code Setup Log #8: Three Agent Setups Working for Me

Last log was about treating my AI work as an operating system. Hermes for routing, GBrain for memory, Codex as a second coding lane, deterministic scripts as the glue.

This log is what I have actually been building on top of that OS over the last two weeks. Three setups, all running daily, all earning their keep:

Plaid CLI as the substrate for a personal finance agent.
Hermes scheduled jobs that close end-of-day loops via GBrain.
/goal as a cross-runtime primitive for end-to-end execution.

The thread across all three: give agents goals, tools, memory, and verification. Then make the loops run on a schedule.

1. Plaid CLI as a personal finance agent substrate

Tracking my net worth across 8 institutions used to mean broker logins, manual spreadsheets, and a third-party aggregator capped at 75 API calls a day. The fragmentation got worse exactly when I wanted a single view to ask agent questions against.

On May 7, Plaid shipped an official Plaid CLI built explicitly for AI agents. JSON-native output, local OAuth token storage at ~/Library/Application Support/plaid-cli/config.json with 0600 perms, $0/yr on the dev tier with 10 trial item slots. No server to babysit.

I linked 8 institutions in the active stack: ETrade (Hannah), Vanguard, Empower (PandaDoc + Toptal 401k), Fidelity NetBenefits, US Bank, ETrade (Aaron), Chase, Schwab. Two stay manual: Citi (Plaid permanently blocks it) and Fidelity Personal (Plaid has been blocked on consumer Fidelity since 2022, so that one moves via CSV import).

The skills layered on top

The CLI is the substrate. The value comes from the skills I run against it:

weekly-finance-snapshot (Claude Code skill, Friday 8 AM cron via /schedule): pulls the YNAB last-7-days transactions, composes a vault note plus an HTML email summarizing spend against my $12,309/mo baseline ($890 margin), flags uncategorized transactions.
portfolio-optimizer (Hermes-native, cron 482c737be420, Mon + Thu 6:55 PM PT): reads live Plaid holdings plus YNAB cash flows plus GBrain facts/portfolio-objective-2026 (weighted goals: 40% net worth at age 60, 35% lifetime tax minimization, 25% time-to-FI by 2037 / $5.5M), then publishes ranked recommendations to 03_Finances/portfolio-optimizer/ and a tightened Telegram digest.
finviser-alpha: a strategic sub-agent persona invoked from the weekly snapshot for cross-position optimization with web research grounding.
YNAB MCP (33 tools): wired against budget “Aaron Nam’s Plan” for transaction inspection, category planning, and scheduled transaction management.

What it caught

The first Mon/Thu portfolio-optimizer run surfaced $25K of harvestable tax losses across ADBE, ARKK, CRM, BABA, SNOW (worth roughly $5-8K in real tax savings) plus a $20K cash-reserve gap I had missed against the $100K floor (since lowered to $75K on 2026-05-14). It also confirmed that my largest single position is GOOG 2,288.15 shares at $878,832 in Schwab Equity Awards, which Plaid returned in a single JSON call versus the previously stale $627K spec figure I had been working from.

Why this works

Three properties make this more than just “automated portfolio scan”:

The agent knows me through GBrain. Current state: 12,164 pages and 37,620 embedded chunks across vault, transcripts, Slack, email, calendar, and people pages. Recommendations are graded against weighted goals, prior decisions, and household context.
Skills are wired to actual goals, not generic prompts. “Optimize my portfolio” produces generic advice. “Optimize against my weighted goals from facts/portfolio-objective-2026 and check against Hannah’s risk tolerance” produces ranked, defensible recommendations with confidence scores.
One JSON interface across 8 institutions. Plaid normalizes the data layer so the agent does pattern-matching across positions instead of reconciling broker-specific UIs.

The takeaway pattern: once an official CLI exists for the data layer you care about, the skills you build on top become the multiplier. Plaid is the substrate. The skills are the value. GBrain is the memory that personalizes the recommendations.

Links

Plaid CLI: plaid.com/docs/resources/cli
GBrain (Garry Tan): github.com/garrytan/gbrain

2. Hermes EOD scheduled jobs: closing loops at 5:15pm

By 5pm a weekday’s decisions and unresolved threads have scattered across Slack, Granola transcripts, iMessage, GitHub PRs, daily logs, and Claude Code / Hermes / Codex session histories. The standard “what did I ship today, what is open, what is tomorrow’s first move” review depends on memory and channel-by-channel scanning. Brittle by design.

The setup: a Hermes cron named weekday-eod-close-the-loops (id fb164ebfaf73) runs Mon-Fri at 5:15pm PT. It calls into GBrain and qmd-indexed sessions, then emits a structured digest with four buckets: Shipped, Waiting, Tomorrow’s first move, Drop. A companion job weekday-workday-synthesis (id 9d43129ea110) fires 15 minutes later for a deeper synthesis pass.

The deterministic split

A Python pre-step (workday-synthesis-inputs.py) gathers same-day inputs into one JSON file before Hermes touches it. The script handles: daily-log block extraction, Slack thread file collection, Granola/Meetily transcript filtering, authored GitHub PRs, bounded iMessage context, and source-gap flags.

This split is the discipline I keep coming back to. I do not want an LLM deciding how to paginate Slack, parse calendar events, or filter transcript files. Code should do that. The model answers: what shipped, what is open, what should be tomorrow’s first move, what can be dropped.

What gets ingested

The synthesis runs on a fully populated input file pulled from:

Granola/Meetily meeting transcripts (autocaptured during meetings)
Slack threads from the day (via slack-collector-import-daily at 2:30pm PT)
Email summaries (via email-to-brain-daily at 6:35am)
The day’s block from 00-Daily-planning/
GitHub PRs I authored in the last 24 hours
iMessage conversations (via imessage-collect-import-3h-production)
Claude Code / Hermes / Codex session transcripts (via the qmd-sessions ingest)
Explicit todos from the daily log

Roughly 41 active Hermes cron jobs feed this ecosystem in some form.

A real run

The 2026-05-14 EOD run pulled 13 same-day iMessage conversations, 5 merged am-dashboard PRs (#177-#181), 4 Granola meetings, and 8 Slack threads. It produced a 3-step tomorrow’s-first-move plan: send Matt the OpenAI defer email, close Town logistics with Hannah before the 5/18 SF trip, write the skill-library brainstorm doc. It also flagged one open data-quality regression (a COMPANY_NAME literal email regression) I had forgotten about, plus one source gap. 12 EOD synthesis reports exist for May alone.

The user-side requirement

None of this works without one piece of discipline on my end: I have to write my todos and notes in coded language the agent can grep. The conventions I use, end-to-end:

Wikilinks for entities: [[02_Work/PandaDoc/follow-ups/john-q2-renewal]]
Inline source attribution: *(source: 5/14 #ai-at-pandadoc thread + DM w/ John)*
Explicit status markers: [x] done, [~] in progress, [WAIT] blocked-by-someone-else
Bold scope labels: **Today (top of must-do):**, **Deep Work:**
Channel-style hashtags for project routing

Without this convention, the synthesis collapses into generic noise. The agent’s value is bounded by my tagging discipline in the channels it scans. This is the part nobody talks about when they say agents will remember everything. The agent only finds what I have tagged in a way it can match.

By 5:15pm my own working memory is full of half-finished threads. The Hermes job does the synthesis I would otherwise do badly or skip. Mornings start with a 3-step plan grounded in yesterday’s actual activity instead of from scratch.

Links

Hermes scheduled tasks: hermes-agent.nousresearch.com/docs/user-guide/features/cron
GBrain: github.com/garrytan/gbrain

3. /goal: agents that actually finish

In the last three weeks, /goal shipped in Codex CLI, Hermes, and Claude Code. Same primitive, three independent implementations, all in a two-week window:

Codex CLI 0.128 shipped /goal first, built by Eric Traut (Pyright) on the OpenAI Codex team. It introduced the lifecycle: set a goal, the agent works, an evaluator checks each turn, the agent keeps going until the goal is achieved or the turn budget runs out.
Hermes (Nous Research) shipped its take on May 1 in PR #18262, authored by teknium1. They credit Eric Traut openly: “Our take on the Ralph loop, directly inspired by Codex CLI 0.128.0’s /goal.” Independent implementation, same user-facing surface.
Claude Code v2.1.139 shipped /goal on top of the hooks system. Anthropic’s docs call out the relationship to /loop and stop hooks.

If you have not used it: /goal sets a completion condition that survives across turns. After every turn, a small judge model checks whether the assistant’s last response actually satisfies the condition. If not, the agent gets a continuation prompt and keeps working. The loop stops when the judge says done, the user sends a new message, or a turn budget (default 20) runs out.

The design discipline behind it

There is a real safety design behind the loop. From the Hermes docs:

Judge failures fail-OPEN (continue) so a flaky judge never wedges progress. The turn budget is the real backstop.

Three things matter about that:

The judge is allowed to be wrong in the safe direction (keep going). The turn budget catches the unsafe direction (stop).
A real user message always preempts the continuation loop. You are not handing over the wheel, you are letting it keep its hands on the wheel between your prompts.
The continuation prompt is a normal user-role message appended to history. It does not mutate the system prompt or toolset. Prompt caches stay intact.

plan-to-goal: codify the prompt you keep retyping

A meta-observation: a good completion condition is hard to write well, and a bad one wastes the entire turn budget. So I kept hand-typing the same prompt over and over for months, with small variations per runtime. Verbatim from my Claude Code session history on 2026-05-13 (typos preserved):

“Write a /goal prompt tha twould set a goal for claude code to comprehensively run and complete the plan flawlessly and validate that it works flawlessly and elegantly with no bugs upon completeion.”

I had near-identical text in Codex and Hermes sessions. The typos (“tha twould”, “completeion”) are not edited out. That is the prompt I kept retyping.

Last week I codified it. plan-to-goal is a Claude Code skill that composes a /goal block from an implementation plan. It does a few things I was doing badly by hand:

Plan-source detection. Explicit path > most recent plan-like file in conversation > AskUserQuestion fallback.
Repo-aware verification. Detects pnpm / bun / npm / pytest / cargo / go stacks and emits the actual commands. Cross-checks against my global CLAUDE.md (oxlint / vitest / tsc for Node 22 ESM projects).
Risk-aware gates. UI plans get a manual browser verification gate. Schema plans get forward + rollback. Integration plans get a live smoke step. Plans without those risks do not get the gate. No more boilerplate.
Anti-bypass enforcement. Forbids --no-verify, xfail / .skip additions, swallowed exceptions, completion-hiding mocks. The judge cannot be tricked into a false “done” by a test that was never going to fail.
Anti-loop guardrail. “Same error twice, stop and try a different strategy.” That single rule has saved more turn-budget than anything else.
Judge-friendly closing sentence. The skill emits a templated closing line like Plan complete. Goal achieved. {N} tests passed across {gates}. Without that sentence, the judge does not recognize completion and the loop happily burns through its 20-turn budget. This was the single biggest failure mode in my hand-typed era.
Hard char cap. The skill refuses to save anything over 4,000 chars (the real /goal limit). v2.0 caught one of my older saved goals at 7,492 chars (1.9x the cap) that had been burning the 20-turn budget without ever terminating because the closing trigger fell past the limit.

Dual output: the composed /goal block prints in chat for copy-paste and saves to ~/.claude/goals/YYYY-MM-DD-<slug>.md for re-use or audit.

The Deep Prayer example

The concrete win: my agent just built a feature for my iOS prayer app (Deep Prayer). I wrote the implementation plan, ran /plan-to-goal, and pasted the /goal block. The agent built the feature, ran tests, then verified by opening the iOS simulator via XcodeBuildMCP and walking every user flow until it passed every acceptance test from the original /goal prompt.

The “How to prove it” block in the goal demanded an a11y-tree trace pasted into the conversation as evidence. The agent could not claim done without producing the trace. The gate was not “tests pass” or “code compiles.” It was “run the flow in the simulator and paste the trace.”

That is the actual shift: the agent runs until provably done, not until it ran out of obvious moves.

Take-home pattern

Look at your last 50 prompts. The ones you keep retyping with small variations are skill candidates. The skill does not have to be fancy. It has to capture the judgment you were applying by hand.

Links

Claude Code /goal: code.claude.com/docs/en/goal
Codex /goal: developers.openai.com/cookbook/examples/codex/using_goals_in_codex
Hermes /goal: hermes-agent.nousresearch.com/docs/user-guide/features/goals
XcodeBuildMCP (Sentry): github.com/getsentry/XcodeBuildMCP

Credits: Eric Traut on the OpenAI Codex team for the original /goal in Codex CLI 0.128. teknium1 and Nous Research for shipping it in Hermes. Anthropic for shipping it in Claude Code v2.1.139.

The quiet pattern

Three different shapes (a data substrate with skills layered on top, a scheduled synthesis job, a cross-runtime verification primitive), but they rhyme.

Each one solves a version of the same problem: agents only help if I can trust what they produce. Plaid plus the finance skills matters because portfolio-optimizer ranks recommendations with confidence scores and risk-reward ratios I can verify. The Hermes EOD job matters because I tagged my channels in a way the agent can match. /goal matters because the judge holds the agent to a concrete completion condition.

More agent autonomy means more pressure on verification. The thesis across all three: give agents goals, tools, memory, and verification. Then make the loops run on a schedule.

If you are running /goal across runtimes, I would love to compare completion conditions. The ones that have worked best for me share three things: a concrete assertion the agent can demonstrate in transcript, an explicit closing sentence the judge can pattern-match on, and a guardrail against the same error repeating twice.

Last log: Setup Log #7: Hermes, GBrain, and Treating AI Work as an Operating System.