Machine learning engineer interview questions

picked.ai/hire/machine-learning-engineer/interview-questions

30 machine learning engineer

interview questions that actually work.

Pulled from the Neuroworx item bank: nine years of calibration against twelve-month performance outcomes on 14,083 machine learning engineers. Sorted by stage (screen, assessment, on-site) and level (IC1 to IC5). Each question comes with what to listen for, what to ignore, and the failure mode it is designed to catch.

questions

stages

levels

14k

hires of validity data

Screen ↓Role-fit ↓On-site ↓Anti-pattern questions ↓

Stage 01 · Screen

Twelve minutes. Ten questions.

The screening conversation. Picked runs this with an AI voice; this is what a human screen would look like with the same rubric. Time-box hard. 60 seconds per answer.

10 questions

Tell me about the last model you shipped to production. What degraded first?

productionspecificity

Listen for

A named model, a real input distribution shift, the signal they caught it on, the action they took.

Ignore

"It performed really well." With no metric, no week, no shift.

catches · Engineers whose only production was the demo notebook.

When did you last decide not to use ML for something?

judgement

Listen for

A specific problem, the heuristic or rule they shipped instead, the reason ML was a worse fit.

Ignore

"ML can solve almost anything." A red flag.

catches · Engineers who treat ML as the default tool.

Walk me through your last training run. What was the baseline?

evaluationhonesty

Listen for

A real baseline (often a simple model or a rule). The honest delta. The reason the delta is worth the complexity.

Ignore

A win figure with no comparison.

catches · Engineers who do not run a baseline.

Describe a time your model gave the right answer for the wrong reason.

evaluationcuriosity

Listen for

A leakage story, a spurious feature, a confounder. The way they found it.

Ignore

"My models are well validated." Means nothing.

catches · Cannot describe a single near-miss they caught.

How do you decide a model is ready for production?

production

Listen for

A real bar. Offline metrics, online shadow run, rollout plan, monitoring in place. Not a wishlist.

Ignore

"When it passes evaluation." Vague.

catches · Engineers who treat deployment as a hand-off.

What is the most expensive training run you have authorised, and what did you learn?

costjudgement

Listen for

A real number, the reason it was that size, the call on whether it was worth it.

Ignore

"We have a big compute budget."

catches · Engineers who do not know what their runs cost.

Tell me about a paper you read and decided not to adopt.

curiositytaste

Listen for

A real paper, the specific reason (small benchmark, brittle setup, infra cost). What they used instead.

Ignore

A paper they are "exploring".

catches · Engineers who chase papers without reading them critically.

How do you detect drift in a model that has been in production for six months?

productionobservability

Listen for

Input drift, output drift, business-metric drift. The cadence. The threshold.

Ignore

"We retrain on a schedule." Not the same thing.

catches · Engineers who treat monitoring as someone else's job.

Tell me about a disagreement with a researcher on your team. What happened?

comms

Listen for

A real disagreement, the mechanics of it, the outcome. They did not always win.

Ignore

"Researchers and engineers always agree." A lie.

catches · Engineers who treat researchers as a separate species.

One thing you want from the next team that you did not have last time.

stage fit

Listen for

A specific something. A data infrastructure, a researcher partner, a real prediction surface.

Ignore

"Cutting-edge work." Vague.

catches · Engineers who are not sure why they are looking.

Stage 02 · Role-fit assessment

A scoped task. A scored rubric.

One realistic task. We score the writeup, not the polish. The candidate has the take-home equivalent of 60 minutes.

8 questions

We have 200 million rows of behavioural data and a churn prediction problem. Sketch the first model, the baseline, and the monitoring plan.

modellingIC2+

Listen for

A simple first model, an honest baseline (a rule or a heuristic), a monitoring plan that includes the business metric.

Ignore

A transformer architecture diagram. Not the question.

catches · Engineers who skip the baseline.

Here is a 200-line training script from a real project. Refactor it for production. Three bullets on what you changed and why.

productionIC2+

Listen for

Configuration extracted, randomness controlled, artifacts named, logging in place. The bullets show taste.

Ignore

Cosmetic edits.

catches · Engineers who cannot tell a notebook from a service.

Reproduce a classification result from a small dataset we provide. Write the eval. 60 minutes max.

craftIC2+

Listen for

How they split the data. What they evaluate against. The thing they refuse to claim from the result.

Ignore

A higher score than warranted by the data size.

catches · Engineers who overfit and call it a win.

Read this 3-page model card and write three questions you would ask the author plus one claim you would push back on.

judgementIC3+

Listen for

Engagement with the dataset description. A push-back on a brittle generalisation claim.

Ignore

Stylistic edits to the model card.

catches · Engineers who cannot critique a model card.

Estimate the monthly cost of serving the model from question 1 at 1000 requests per second. Show your working.

cost-awareIC3+

Listen for

GPU vs CPU reasoning, batching, caching. They name the largest line item.

Ignore

A vendor list-price answer.

catches · Engineers who do not know what their inference costs.

Take this real evaluation report. Tell me what you would accept, what you would not, and what you would re-run.

evaluationIC3+

Listen for

A reading that catches data leakage or a missing slice. They re-run something concrete.

Ignore

Approval of every result on the page.

catches · Engineers who approve eval reports without engaging.

Write the monitoring plan for the model in question 1, including the action you take when an alert fires.

operabilityIC3+

Listen for

Specific signals, specific thresholds, a specific runbook. A defined rollback.

Ignore

A generic "monitor the metrics" answer.

catches · Engineers who can train but not operate.

In 200 words: why might the model from question 1 be the wrong tool, and what would you ship instead?

humilityIC4+

Listen for

A real alternative, often a heuristic or a smaller model. A reason the alternative might be better.

Ignore

A second pitch for the original model.

catches · Engineers without perspective on their own choices.

Stage 03 · On-site (after Picked)

Twelve questions you will still want to ask in person.

Picked screens, scores, and shortlists. These are the questions worth asking with a human in the room: the calibration questions, the dealbreakers, the chemistry probes.

12 questions

Where in the work do you want to grow this year?

growthmanager fit

Listen for

A specific gap, a plan, a person they would learn from.

Ignore

"I want to be a staff ML engineer." Title-laddering.

catches · Engineers without a learning agenda.

Tell me about a time you disagreed with a researcher about a model. What happened?

commsmanager fit

Listen for

A real disagreement, the data they brought to it, what they took from it.

Ignore

"I defer to research." A worrying answer.

catches · Engineers who treat research as untouchable.

What is the most uncomfortable feedback you have received from a peer?

self-awareness

Listen for

A specific piece of feedback, the change they made, the thing they still wrestle with.

Ignore

"I take feedback well." Tells us nothing.

catches · Defended self-narrative.

Walk me through a model project you wish you had killed earlier.

judgementoperating

Listen for

Honesty. The moment the experiment was clearly not working. The reason they kept going.

Ignore

A late save story with no failure in it.

catches · Sunk-cost thinkers.

What is a strong ML opinion you have changed in the last year?

intellectual humility

Listen for

A specific opinion, the result or paper that changed it, the new practice they adopted.

Ignore

"My mind is always open."

catches · Closed-loop thinkers.

Pick two ML engineers you admire from your last role. What do they do differently?

taste

Listen for

Concrete habits. The ones they adopted. The ones they did not.

Ignore

Pure praise.

catches · Engineers without taste for other engineers.

Tell me the last paper, talk, or post-mortem you read that changed how you work.

curiosity

Listen for

A specific source, what they took from it, the way it shows up in their work.

Ignore

A textbook they mean to finish.

catches · Engineers who do not read outside their stack.

When are you most productive on a research-engineering problem?

operating model

Listen for

A self-aware answer. A time of day, a working pattern, a kind of problem they reach for.

Ignore

"I am productive all the time."

catches · Engineers without self-instrumentation.

Where would you rather be in three years, deeper IC or research lead?

careerretention

Listen for

A direction and a reason. Honesty about the uncertainty.

Ignore

"Wherever the company needs me."

catches · Drifting engineers.

If you join, what would your first week look like?

agencyonboarding

Listen for

A specific plan. Often: reproduce one model offline, read the last three model cards, shadow one deploy.

Ignore

"Whatever you suggest."

catches · Engineers without an onboarding instinct.

What would make you leave us within six months?

dealbreaker

Listen for

A specific irritant. A research-engineering split that does not work. A data-access pattern that blocks you.

Ignore

"As long as the work is interesting."

catches · Hidden dealbreakers, surfaced post-offer.

What would you want to ask our most sceptical researcher?

probingcuriosity

Listen for

A real question, often about the data or the evaluation. "Where do you think the model is wrong?"

Ignore

A softball.

catches · Candidates who do not want to know what is shaky.

The anti-pattern set

Eight questions that look smart
but tell you nothing.

"What is your biggest weakness?"

You will get a strength-shaped weakness. We have asked this 47,000 times. It catches no-one. Replace with: "What is the most uncomfortable feedback you have received?".

"Where do you see yourself in five years?"

Either a rehearsed answer or a stalled one. Both useless. Replace with: "Where would you want to be in three years?"

"Tell me about yourself."

Wastes the first three minutes on the CV they already gave you. Replace with: "Walk me through the most recent thing you shipped end-to-end."

"Why this company?"

Generates polished mission-talk. Replace with: "What about this role made you apply that would not have made you apply elsewhere?"

"Are you a team player?"

No-one says no. Replace with: "Tell me about a time a teammate disagreed with you and how you handled it."

"How do you handle stress?"

No-one says badly. Replace with: "Tell me about your last production incident and your precise role."

"How would you reverse a linked list?"

Probes nothing we care about. We removed it from the bank in 2019. Replace with: "Refactor this 200-line file and tell me what you changed and why."

"If you were an animal, which animal would you be?"

You know what we are going to say. Replace with: anything else.

Or, let us ask

We will ask these for you.
By Friday.

Picked runs the screen, the assessment, and the first-round interview against this exact item bank. You meet the three finalists in person, with these on-site questions in hand.

Post a machine learning engineer role for free See pricing →

$0.99 per AI-vetted candidate. First 50 free.

Companion pages