The first VP of Engineering sets the ceiling on everything the company builds after them. They choose the architecture, the bar for what ships, and the people who will fill the team beneath them. In a consumer company, a bad version of this hire costs you a slow quarter. In a clinical-grade AI company, a bad version of this hire costs you a finding in your first audit, a model that fails quietly in production, and a year spent rebuilding trust you did not know you had spent.
The trouble is that the resume that looks best is often the wrong one. The most decorated candidates come from places where scale was the whole game: billions of requests, famous products, teams that moved fast because nothing they shipped could hurt anyone. That experience is real, and it is not the experience you need. You need someone who has built where errors have consequences, where the question is not only does it work but can you prove it worked, and prove it later, to someone whose job is to doubt you.
This is the hire founders most often get wrong, because the failure mode is invisible at the offer stage. Brilliance interviews well. Scar tissue does not announce itself. You have to go looking for it.
The profile that actually fits
The candidate you want has scaled a platform team inside a regulated environment: health, finance, safety-critical infrastructure, somewhere the regulator was a real presence rather than a slide. They have shipped ML or LLM systems into production where a wrong output had a clinical or compliance consequence, and they have lived with the aftermath of one. They can speak to an audit trail without being prompted: how a decision the system made gets reconstructed months later, who can see what, and what evidence exists that the model in production was the model that was validated.
Note what is not on that list. Raw technical brilliance is necessary but it is table stakes, not the differentiator. The differentiator is judgment formed under constraint: having had to slow down, document, and defend, and having learned to do it without losing the team's speed entirely. That judgment is hard to teach and harder to fake. It is the thing you are hiring for.
The mis-hire to fear
The classic mis-hire is the brilliant consumer-scale engineer who has never met a compliance officer. They are genuinely excellent. They have shipped things you admire. And the first time someone tells them the model cannot go out until it has been validated against a frozen test set, with the results signed off and stored, they treat the requirement as friction to be routed around rather than the job itself. They optimize for velocity because velocity is the only thing they have ever been rewarded for.
In a clinical context, that instinct is a structural risk. It compounds quietly, and it points the whole team at the wrong reward. The whole point of the role is to carry velocity and validation at once, and a leader who has only ever carried one will pull the team toward the one they know. You will not see it in the first month. You will see it in the first incident.
Velocity is a habit, and so is the discipline that contains it. You are hiring for whichever one the candidate built first.
What to probe in the interview
Do not ask whether they value compliance. Everyone values compliance in an interview. Ask for the stories, and listen for whether the stories have texture: specific decisions, named tradeoffs, consequences they had to own. A candidate who has really done this will not need to be coached toward the hard parts. They will go there on their own, because the hard parts are where the lessons live.
- Walk me through a production incident with a model you shipped. What failed, how did you find out, and how long was it wrong before you knew?
- Tell me about a time you rolled back a model. What was the trigger, who made the call, and what did you wish you had instrumented beforehand?
- Describe a release where validation told you to wait and the business wanted to ship. How did that conversation go, and what did you decide?
- If a regulator asked you to reconstruct a specific decision the system made six months ago, what would you pull, and would it be enough?
- Where have you seen an audit trail done badly? What was missing, and what did it cost?
The rollback question is the most revealing. Anyone can describe shipping. Describing an unship (the decision to pull something back, under pressure, with the cost visible) tells you whether they have the reflex you need. The candidate who has never rolled back a model has either been very lucky or has not been watching closely enough to notice they should have.
How to weight the scorecard
We score this role on a weighted scorecard, calibrated with the founder before any candidate is seen, and scored blind across the panel so the loudest interviewer does not set the verdict. The temptation is to let technical brilliance dominate the weighting, because brilliance is the easiest dimension to feel confident about. Resist it. Brilliance should clear a high bar and then stop earning points. Regulated-environment scar tissue, the evidence that this person has shipped where it counts and learned the right lessons, should carry the weight that decides the hire.
This does not mean settling for a weaker engineer. The candidate who has both is the one to hold out for. It means that when you are choosing between the dazzling generalist and the slightly-less-dazzling leader who has done exactly this before, in this kind of environment, the scorecard should point you at the second one. The first one you can hire later, into a seat where the stakes are lower and the judgment can be learned on something that does not matter as much.
The first engineering leader teaches the company what shipping means. In clinical AI, shipping means it works and you can prove it. Hire the person who already believes that, and the rest of the team will inherit the belief.