Skip to content
Go back

Claude Code Setup Log #8: Three Agent Setups Working for Me

Published:

Last log was about treating my AI work as an operating system. Hermes for routing, GBrain for memory, Codex as a second coding lane, deterministic scripts as the glue.

This log is what I have actually been building on top of that OS over the last two weeks. Three setups, all running daily, all earning their keep:

  1. Plaid CLI as the substrate for a personal finance agent.
  2. Hermes scheduled jobs that close end-of-day loops via GBrain.
  3. /goal as a cross-runtime primitive for end-to-end execution.

The thread across all three: give agents goals, tools, memory, and verification. Then make the loops run on a schedule.


1. Plaid CLI as a personal finance agent substrate

Tracking my net worth across 8 institutions used to mean broker logins, manual spreadsheets, and a third-party aggregator capped at 75 API calls a day. The fragmentation got worse exactly when I wanted a single view to ask agent questions against.

On May 7, Plaid shipped an official Plaid CLI built explicitly for AI agents. JSON-native output, local OAuth token storage at ~/Library/Application Support/plaid-cli/config.json with 0600 perms, $0/yr on the dev tier with 10 trial item slots. No server to babysit.

I linked 8 institutions in the active stack: ETrade (Hannah), Vanguard, Empower (PandaDoc + Toptal 401k), Fidelity NetBenefits, US Bank, ETrade (Aaron), Chase, Schwab. Two stay manual: Citi (Plaid permanently blocks it) and Fidelity Personal (Plaid has been blocked on consumer Fidelity since 2022, so that one moves via CSV import).

The skills layered on top

The CLI is the substrate. The value comes from the skills I run against it:

What it caught

The first Mon/Thu portfolio-optimizer run surfaced $25K of harvestable tax losses across ADBE, ARKK, CRM, BABA, SNOW (worth roughly $5-8K in real tax savings) plus a $20K cash-reserve gap I had missed against the $100K floor (since lowered to $75K on 2026-05-14). It also confirmed that my largest single position is GOOG 2,288.15 shares at $878,832 in Schwab Equity Awards, which Plaid returned in a single JSON call versus the previously stale $627K spec figure I had been working from.

Why this works

Three properties make this more than just “automated portfolio scan”:

  1. The agent knows me through GBrain. Current state: 12,164 pages and 37,620 embedded chunks across vault, transcripts, Slack, email, calendar, and people pages. Recommendations are graded against weighted goals, prior decisions, and household context.
  2. Skills are wired to actual goals, not generic prompts. “Optimize my portfolio” produces generic advice. “Optimize against my weighted goals from facts/portfolio-objective-2026 and check against Hannah’s risk tolerance” produces ranked, defensible recommendations with confidence scores.
  3. One JSON interface across 8 institutions. Plaid normalizes the data layer so the agent does pattern-matching across positions instead of reconciling broker-specific UIs.

The takeaway pattern: once an official CLI exists for the data layer you care about, the skills you build on top become the multiplier. Plaid is the substrate. The skills are the value. GBrain is the memory that personalizes the recommendations.

Links


2. Hermes EOD scheduled jobs: closing loops at 5:15pm

By 5pm a weekday’s decisions and unresolved threads have scattered across Slack, Granola transcripts, iMessage, GitHub PRs, daily logs, and Claude Code / Hermes / Codex session histories. The standard “what did I ship today, what is open, what is tomorrow’s first move” review depends on memory and channel-by-channel scanning. Brittle by design.

The setup: a Hermes cron named weekday-eod-close-the-loops (id fb164ebfaf73) runs Mon-Fri at 5:15pm PT. It calls into GBrain and qmd-indexed sessions, then emits a structured digest with four buckets: Shipped, Waiting, Tomorrow’s first move, Drop. A companion job weekday-workday-synthesis (id 9d43129ea110) fires 15 minutes later for a deeper synthesis pass.

The deterministic split

A Python pre-step (workday-synthesis-inputs.py) gathers same-day inputs into one JSON file before Hermes touches it. The script handles: daily-log block extraction, Slack thread file collection, Granola/Meetily transcript filtering, authored GitHub PRs, bounded iMessage context, and source-gap flags.

This split is the discipline I keep coming back to. I do not want an LLM deciding how to paginate Slack, parse calendar events, or filter transcript files. Code should do that. The model answers: what shipped, what is open, what should be tomorrow’s first move, what can be dropped.

What gets ingested

The synthesis runs on a fully populated input file pulled from:

Roughly 41 active Hermes cron jobs feed this ecosystem in some form.

A real run

The 2026-05-14 EOD run pulled 13 same-day iMessage conversations, 5 merged am-dashboard PRs (#177-#181), 4 Granola meetings, and 8 Slack threads. It produced a 3-step tomorrow’s-first-move plan: send Matt the OpenAI defer email, close Town logistics with Hannah before the 5/18 SF trip, write the skill-library brainstorm doc. It also flagged one open data-quality regression (a COMPANY_NAME literal email regression) I had forgotten about, plus one source gap. 12 EOD synthesis reports exist for May alone.

The user-side requirement

None of this works without one piece of discipline on my end: I have to write my todos and notes in coded language the agent can grep. The conventions I use, end-to-end:

Without this convention, the synthesis collapses into generic noise. The agent’s value is bounded by my tagging discipline in the channels it scans. This is the part nobody talks about when they say agents will remember everything. The agent only finds what I have tagged in a way it can match.

By 5:15pm my own working memory is full of half-finished threads. The Hermes job does the synthesis I would otherwise do badly or skip. Mornings start with a 3-step plan grounded in yesterday’s actual activity instead of from scratch.

Links


3. /goal: agents that actually finish

In the last three weeks, /goal shipped in Codex CLI, Hermes, and Claude Code. Same primitive, three independent implementations, all in a two-week window:

If you have not used it: /goal sets a completion condition that survives across turns. After every turn, a small judge model checks whether the assistant’s last response actually satisfies the condition. If not, the agent gets a continuation prompt and keeps working. The loop stops when the judge says done, the user sends a new message, or a turn budget (default 20) runs out.

The design discipline behind it

There is a real safety design behind the loop. From the Hermes docs:

Judge failures fail-OPEN (continue) so a flaky judge never wedges progress. The turn budget is the real backstop.

Three things matter about that:

  1. The judge is allowed to be wrong in the safe direction (keep going). The turn budget catches the unsafe direction (stop).
  2. A real user message always preempts the continuation loop. You are not handing over the wheel, you are letting it keep its hands on the wheel between your prompts.
  3. The continuation prompt is a normal user-role message appended to history. It does not mutate the system prompt or toolset. Prompt caches stay intact.

plan-to-goal: codify the prompt you keep retyping

A meta-observation: a good completion condition is hard to write well, and a bad one wastes the entire turn budget. So I kept hand-typing the same prompt over and over for months, with small variations per runtime. Verbatim from my Claude Code session history on 2026-05-13 (typos preserved):

“Write a /goal prompt tha twould set a goal for claude code to comprehensively run and complete the plan flawlessly and validate that it works flawlessly and elegantly with no bugs upon completeion.”

I had near-identical text in Codex and Hermes sessions. The typos (“tha twould”, “completeion”) are not edited out. That is the prompt I kept retyping.

Last week I codified it. plan-to-goal is a Claude Code skill that composes a /goal block from an implementation plan. It does a few things I was doing badly by hand:

Dual output: the composed /goal block prints in chat for copy-paste and saves to ~/.claude/goals/YYYY-MM-DD-<slug>.md for re-use or audit.

The Deep Prayer example

The concrete win: my agent just built a feature for my iOS prayer app (Deep Prayer). I wrote the implementation plan, ran /plan-to-goal, and pasted the /goal block. The agent built the feature, ran tests, then verified by opening the iOS simulator via XcodeBuildMCP and walking every user flow until it passed every acceptance test from the original /goal prompt.

The “How to prove it” block in the goal demanded an a11y-tree trace pasted into the conversation as evidence. The agent could not claim done without producing the trace. The gate was not “tests pass” or “code compiles.” It was “run the flow in the simulator and paste the trace.”

That is the actual shift: the agent runs until provably done, not until it ran out of obvious moves.

Take-home pattern

Look at your last 50 prompts. The ones you keep retyping with small variations are skill candidates. The skill does not have to be fancy. It has to capture the judgment you were applying by hand.

Links

Credits: Eric Traut on the OpenAI Codex team for the original /goal in Codex CLI 0.128. teknium1 and Nous Research for shipping it in Hermes. Anthropic for shipping it in Claude Code v2.1.139.


The quiet pattern

Three different shapes (a data substrate with skills layered on top, a scheduled synthesis job, a cross-runtime verification primitive), but they rhyme.

Each one solves a version of the same problem: agents only help if I can trust what they produce. Plaid plus the finance skills matters because portfolio-optimizer ranks recommendations with confidence scores and risk-reward ratios I can verify. The Hermes EOD job matters because I tagged my channels in a way the agent can match. /goal matters because the judge holds the agent to a concrete completion condition.

More agent autonomy means more pressure on verification. The thesis across all three: give agents goals, tools, memory, and verification. Then make the loops run on a schedule.

If you are running /goal across runtimes, I would love to compare completion conditions. The ones that have worked best for me share three things: a concrete assertion the agent can demonstrate in transcript, an explicit closing sentence the judge can pattern-match on, and a guardrail against the same error repeating twice.


Last log: Setup Log #7: Hermes, GBrain, and Treating AI Work as an Operating System.


Share this post on:

Next Post
The Boundary Is the Product Spec