3 Comments
User's avatar
Pawel Jozefiak's avatar

The 'verifiable work' distinction is the useful framing here. Code compiles or it doesn't - that feedback loop is why software engineering is moving faster than other domains.

What Amodei's timeline undersells: the transition isn't happening at the frontier models. It's happening in agent orchestration - the wrapper that turns a model into a continuous worker.

I've been running an agent on actual production tasks overnight. The gap isn't model capability - it's knowing when to escalate vs. proceed. That judgment layer is what's actually hard to automate: https://thoughts.jock.pl/p/building-ai-agent-night-shifts-ep1

What verifiability mechanisms have you found work best for non-code tasks?

Exitfund's avatar

Great point, the “verifiable work” framing really is the unlock here.

And I agree with you on orchestration. The real acceleration isn’t just coming from better frontier models; it’s coming from turning them into continuous workers with memory, tools, retries, and smart escalation logic. For non-code tasks, I’ve seen structured outputs with strict schemas help a lot, along with self-evaluation against predefined rubrics and grounding results in external tools or data sources wherever possible. It’s not as clean as code compilation, but it creates a layer of partial verifiability that reduces silent failures.

Really interesting that you’re running agents on production tasks overnight. Are you leaning more on rule-based escalation or confidence thresholds for those decisions?

Pawel Jozefiak's avatar

I would say mix, but I think the closest where I am going is: confidence thresholds. I want Agent to do things, not ask every 1 sec xD