AI with Michal

Calibration session (hiring)

A structured meeting where a hiring panel agrees on scoring standards before or during an interview loop, ensuring every interviewer uses the same anchors when rating competencies rather than inventing a personal bar mid-search.

Michal Juhas · Last reviewed May 15, 2026

What is a calibration session in hiring?

A calibration session is a structured meeting where the hiring panel agrees on scoring standards before or during an interview loop. The goal is simple: every interviewer uses the same anchors when rating competencies, rather than each person inventing a personal bar mid-search.

Without calibration, a panel of four interviewers might all be using "strong communicator" to mean four different things. One person means they spoke clearly. Another means they structured their answer with a point, evidence, and takeaway. A third is weighting whether the candidate made eye contact. None of this surfaces in the debrief unless scores land far apart and someone asks why.

Calibration sessions usually involve a facilitator, a shared scorecard, and a sample transcript or past answer to score independently before any group discussion opens. The session ends with a written anchor document: what the panel agreed a 1, 3, and 5 look like for each competency, not in the abstract, but tied to the kind of evidence a candidate could actually give.

Illustration: hiring calibration session showing three interviewers independently scoring the same transcript, a facilitator comparing the spread at a round table node, and a written anchor document with a 1-3-5 scoring grid produced before any group discussion opens

In practice

  • When a recruiter says "the panel scored her a 2 and a 5 on the same competency for the same story," that is a calibration failure: two interviewers heard the same answer and used different rulers.
  • A hiring manager who opens a debrief with "I thought she was great overall, what did everyone else think?" has already anchored the room before anyone shares their scorecard, which is exactly what calibration-led debriefs are designed to prevent.
  • TA ops teams that run calibration before every new role report fewer debrief conflicts and faster hire-or-no-hire decisions because the panel is resolving evidence gaps, not redefining the criteria in real time.

Quick read, then how hiring teams use it

This is for recruiters, TA leads, and HR partners who need the same vocabulary in debrief calls, interview training, and process design. Skim the first section for a fast shared picture. Use the second when you are rolling out a new panel, onboarding a new interviewer, or investigating why debrief scores keep spreading.

Plain-language summary

  • What it means for you: Before anyone interviews a candidate, the panel spends 30-60 minutes agreeing what a strong, average, and weak answer looks like for each competency. You write it down. You use the same document for every candidate.
  • How you would use it: Run one calibration session per new role or per new panel composition, using a sample answer (past or synthetic) to score independently before the group compares. The gap between your scores is your agenda.
  • How to get started: Pull the scorecard for your next active req. Write down what you personally think a 5 looks like for your most important competency. Send the same question to two other panelists before your next kickoff and compare answers. You will find the gap immediately.
  • When it is a good time: Every time you open a new interview loop with a panel of two or more people. Single-interviewer screens benefit from calibration too, but the need is less acute.

When you are running live reqs and tools

  • What it means for you: Calibration is the difference between scorecards that produce independent evidence and scorecards that become post-hoc rationalization of whoever spoke first in the debrief.
  • When it is a good time: Before the first interview on a req, when a new panelist joins a running loop, and after any split decision where the debrief went longer than 20 minutes without resolution.
  • How to use it: Designate a calibration facilitator (usually the recruiter or a debrief coordinator who attends every panel). Use a real or synthetic transcript, set a timer for independent scoring, compare spreads, and write the anchors down. Store the anchor document in the ATS or a shared folder the panel can reference during interviews.
  • How to get started: Build a one-page anchor template into your standard interview package alongside the question set. Require it to be signed off before the first interview slot is booked, the same way you require the behavioral interview questions to be shared with the panel in advance.
  • What to watch for: Panels that skip calibration when time is tight (the ones with 15 open reqs tend to skip it most), facilitators who open the group discussion before everyone has submitted independent scores, and anchor documents that get written once and never revisited when the req or panel changes. AI transcript tools can help diagnose drift post-hoc, but they do not replace the pre-interview alignment conversation.

Where we talk about this

On AI with Michal live sessions, calibration comes up whenever we connect structured scoring to debrief quality and ATS data reliability. In the AI in recruiting track, it sits alongside behavioral interview question design and scorecard setup as the three elements that determine whether your interview data is worth storing. If you want the full room conversation with other TA practitioners who are solving this in active hiring cycles, start at Workshops and bring a real scorecard you are working with.

Around the web (opinions and rabbit holes)

Third-party creators move fast. Treat these as starting points, not endorsements, and double-check anything before you wire candidate data to a new tool.

YouTube

Reddit

Quora

Calibration session vs. standard debrief

DimensionCalibration-led processStandard debrief
Score submission timingBefore the meeting opensDuring or after group discussion
Opening moveFacilitator shares spread anonymouslySenior person states overall impression
Anchor documentWritten and signed before first interviewImplicit, improvised, or missing
Disagreement sourceEvidence gaps vs. anchor gaps (separated)Usually unclear
Bias exposureReduced (structure limits anchoring effect)Higher (first speaker effect, seniority bias)
DocumentationAnchor record dated before evaluationTypically absent

Related on this site

Frequently asked questions

What happens in a calibration session before a hiring loop opens?
A pre-loop calibration gathers the panel before any interviews run to agree on what a strong answer looks like for each competency. The facilitator shares a sample transcript and asks everyone to score independently first. When scores diverge, the group surfaces the gap rather than a candidate's personality. The output is a shared anchor document: for each scorecard competency, the panel agrees what separates a 2 from a 4. This takes 30-60 minutes and prevents the debrief version where every panelist invents their own bar and discussion becomes a debate about definitions rather than candidate evidence.
How is a calibration-led debrief different from a standard post-interview meeting?
A standard debrief opens with someone stating their overall impression, which anchors every voice that follows. A calibration-led debrief inverts this: each interviewer submits scores against the competency anchors before the meeting begins, and the facilitator shares the spread without attribution before anyone speaks. Discussion starts where scores diverge most, not where the most senior person has an opinion. The goal is to reconcile evidence, not manage impressions. Teams that run this format consistently produce hire or no-hire decisions that hold up under the retrospective question: if we gave this person a 3 on reliability and they struggled at month six, what evidence did we actually have in the debrief room?
When should you run a calibration session mid-search?
Mid-search calibration is worth running when your first completed scorecards show scores drifting apart for the same competency. If three panelists rated problem-solving across five candidates and none used the same scale consistently, you have an anchor problem, not a candidate problem. Trigger a calibration when a scoring spread of three or more points on the same evidence appears in back-to-back reviews, when a new interviewer joins an ongoing panel, or after any split hire-or-no-hire decision that produced real conflict. Waiting until the end of a search to notice scoring gaps means your early candidates were evaluated under different rules than your final shortlist, which is both unfair and hard to defend.
What role does AI play in calibration sessions for hiring?
AI transcript tools can surface calibration gaps before you need a meeting to find them. If two interviewers scored the same candidate answer a 2 and a 5, an AI interview intelligence tool mapping response content to competency anchors can flag the divergence and show what evidence each evaluator weighted differently. That is a diagnostic input, not a score. The risk is using AI output to close calibration gaps without running the human conversation: you end up with consistent scores driven by model preferences, not agreed human anchors. Use AI to prepare the calibration agenda. Log which model version processed transcripts, and keep the anchor document as the authoritative record.
How do calibration sessions reduce bias in structured interviews?
Calibration reduces three specific bias mechanisms. First, it stops affinity bias from setting the bar: when a panel agrees what initiative looks like from specific STAR evidence before interviews begin, no single interviewer's cultural reference shapes the standard. Second, it surfaces inter-rater reliability problems before they affect candidate outcomes. Third, structured post-interview calibration where scores are submitted before the debrief opens prevents the most senior voice from anchoring the group. None of this eliminates bias completely. Regularly reviewing adverse impact data at the scorecard level tells you whether calibrated evaluations still produce group-level outcome gaps that need root-cause investigation.
What records should you keep from a calibration session?
Keep the anchor document: a one-page record of what the panel agreed a 1, 3, and 5 look like for each competency, with the date and panelist names. If the session used a sample transcript, note what it was and what scoring spread it surfaced. This serves two purposes. It gives new panelists joining mid-loop a concrete briefing rather than a verbal walkthrough that shifts over time. And it is part of your documentation trail if a hiring decision is challenged. Under GDPR and most equivalent frameworks, decisions must be explainable. A calibration record dated before the evaluation began demonstrates the criteria were set ahead of seeing candidates, not constructed afterward to justify the outcome.
Where does a calibration session fit in a full TA interview workflow?
Calibration fits in three places. Before the loop opens, spend 30-60 minutes agreeing anchor definitions for the scorecard using a sample transcript. Between loops when a req restarts or a new panelist joins, run a short re-calibration rather than assuming the original anchors still hold. After any split decision, use the debrief retrospective to understand whether the split came from different evidence or from inconsistent scoring of the same evidence: these have different fixes. If you use AI tools to summarize transcripts, review their output against your calibration anchors rather than accepting their score framing as the new baseline. Pair this with a behavioral interview question set and a shared scorecard for the full structure.

← Back to AI glossary in practice