Take-homes are a tax on the candidate. A 6-hour coding exercise, evaluated for 15 minutes by a tired engineer, is a poor signal generator and an even worse candidate experience. We do not run them.
What we run instead is two assessments, one conversation, and a structured score sheet. The whole thing takes the candidate about 45 minutes. The signal is calibrated against eight years of psychometric data from the Neuroworx side of the business. Here is the method, in full.
Twelve to eighteen items, depending on the seniority of the role. Each item is a scenario calibrated against the role family: a backend engineer at series B sees a different item set from a staff engineer at mid-market. The items are not multiple-choice trivia. They are short forced-choice tradeoffs (e.g. "you have a flaky test that fails 1 in 20 runs. The release is in 36 hours. What do you do, and why?") scored on the reasoning more than the answer.
The item bank is sourced from the same psychometric battery Neuroworx has been running since 2018. It has been validated against on-the-job performance for over 40,000 hires across software, data, and operations roles. The 4/5ths fairness check (every protected group, every role) is applied on every batch.
A 20 to 25 minute live voice conversation between the candidate and Picked. Not a recording. Not a one-way Loom. A real adaptive interview, scored against the role rubric in flight. The candidate speaks. Picked listens, asks clarifying questions, and probes on weak answers. The transcript is published to the hiring manager with the score sheet.
The conversation runs on LiveKit, with OpenAI Whisper doing transcription and Anthropic Claude doing the reasoning and scoring. The voice agent stack is documented on /trust/model-cards.
Two things make this different from HireVue-style recorded video:
Every candidate produces a single-page score sheet: eight to twelve competency scores, each with the evidence span from the transcript, and a single role-fit rank. The hiring manager reads the sheet, opens the transcript on any score they want to challenge, and either accepts the rank or overrides it. The override is logged.
We publish the rubric. We do not hide the prompt. The model card for each role family is on /trust/model-cards, version-controlled, and the bias audit per role family is on /trust/fairness.
A take-home selects for candidates with six hours of free time on a Sunday. That is not the same set as good engineers. It under-selects for parents, carers, and candidates with second jobs. It is a quiet bias filter applied early and never measured.
The voice interview also selects, of course. Every assessment selects. But it selects on the thing you want, which is "can this person reason about an engineering tradeoff in real time and explain it to a colleague". That generalises to the job. A clean take-home submission does not.
We have run the method against four held-out cohorts so far. Performance prediction at 6 months on the job, measured against manager performance ratings, comes out at r = 0.49 against an industry benchmark of r = 0.20 for unstructured interviews and r = 0.34 for work-sample tests. The methodology is on /trust/model-cards. The held-out cohorts are not yet large enough to publish externally; we plan to do so in the first annual report in Q3 2027.
Engineering hiring is the hardest part of the funnel, because engineers know what bad assessment looks like and will opt out. The method above is what we built so they would opt in.
The people building Picked. Method posts, model cards, fairness audits, product opinions. Edited and signed off by the engineering and research leads.