Agentic systems are moving into the parts of healthcare where they touch patients directly. Care navigation that decides what a person hears next. Documentation that becomes part of the record. Patient engagement that speaks in the institution's voice. These systems do not fail like a broken button fails. They fail plausibly: a confident wrong answer, a subtle drift, a behavior that was fine in testing and is not fine in the long tail of real use.
That failure mode demands a role most teams do not yet have a name for. Call it the safety and evaluation engineer. They build the evaluation harness, define the failure taxonomy, and own the single hardest question in the building: how do we know this is safe to ship? Not whether it demos well. Whether it is safe, in the cases that matter, in a form you can defend.
Founders tend to discover they need this role late, usually after an incident or a near miss. The companies that do it well hire for it before the agent ships, while the harness can still shape the product rather than chase it.
Where these people come from
There is no clean pipeline for this role yet, which is exactly why it is hard to fill and easy to fill badly. The strong candidates come from three directions, and each brings half of what you need.
- ML engineers who have built evaluation systems for production models. They understand how models fail and how to measure it, but may underweight the clinical stakes.
- QA and test leaders from regulated software. They know how to design for failure, document evidence, and think adversarially, but may be new to the open-endedness of agents.
- Clinical informaticists. They know what a dangerous output actually looks like in a care setting, but may need an engineering partner to build the harness around that judgment.
The best hire is rarely a perfect blend of all three. It is usually someone strong in one with genuine curiosity about the others: an ML engineer who keeps asking the clinician what would actually hurt a patient, or a clinical informaticist who has taught themselves to think in test coverage. Curiosity across the boundary is the trait that ages well. The person who treats the other two disciplines as someone else's problem will build a harness that misses the failures those disciplines would have caught.
Why the role must be separate
The single most important structural decision is this: the person who decides whether the agent is safe to ship cannot report into, or be measured by, the team racing to ship it. Not because feature engineers are careless. Because incentives are gravity. A team rewarded for shipping will, without any bad intent, find reasons the evaluation results are good enough. The pressure is constant and the rationalizations are reasonable, and that is precisely why the judgment cannot live inside the shipping team.
Separation does not mean adversarial. The evaluation engineer should sit close to the builders, understand the system deeply, and care about shipping too. But the call on safety, and the harness that informs it, has to be insulated from the deadline. When the two roles collapse into one person or one team, the evaluation quietly bends to the roadmap, and you find out it bent only when something reaches a patient that should not have.
A team paid to ship will always find a reason the evals look fine. Insulate the judgment, or the deadline will make it for you.
What good looks like in a portfolio review
This is a role you evaluate by looking at what someone has built, not what they can describe. Ask the candidate to walk you through an evaluation system they own. The work tells you almost everything.
Listen for whether they think in failure taxonomies, whether they can name the categories of ways their system went wrong, ranked by severity, rather than a single accuracy number. A real evaluation engineer has a mental map of the failure surface and can tell you which corners of it are well covered and which keep them up at night. The candidate who only talks about aggregate scores has not yet learned that the average hides the cases that matter.
Listen for how they handle the long tail and the ambiguous case. Anyone can evaluate the clear-cut examples. The skill is in the cases where the right answer is contested, where they had to define what correct even means, and where they had to build a process (human review, escalation, sampling) for the things automated metrics could not catch. And listen for how they decide when something is safe to ship: the strong ones have a threshold they can defend, an account of the residual risk they accepted, and a memory of a time they said no.
- Show me your failure taxonomy for a system you evaluated. How did you build it, and what was in the most severe category?
- Tell me about a case where the metrics looked fine and the behavior was not. How did you catch it?
- How do you evaluate something where reasonable people disagree on the right answer?
- Describe a time you recommended against shipping. What was the evidence, and what happened next?
Hiring the conscience of the system
We score this role on a calibrated scorecard built with the founder and the clinical lead together, because no single discipline can define the bar alone. The weighting rewards demonstrated judgment about failure over raw model-building skill, and it rewards the candidate who has said no and lived with the consequences over the one who has only ever made things work.
This person becomes, in practice, the conscience of the system: the one who can stand in front of a regulator, a clinician, or your own board and say how you know it is safe. Hire them early, give them the independence the role requires, and let the harness they build shape the product before the product ships. The alternative is to build the harness after the incident, which is the most expensive time to learn what it should have caught.