Question 1

What does model evaluation actually measure in a hiring context?

Accepted Answer

In hiring, evaluation typically checks three things: accuracy (does the model rank or label the way trained reviewers would?), consistency (does the same profile produce the same output on two runs?), and group fairness (do pass rates differ by gender, ethnicity, or age in a way that cannot be explained by the role criteria?). TA teams run evaluation on a held-out set of real applications, compare model outputs to human decisions, and look for gaps above a threshold they agree on before launch. Without this step, AI screening tools carry forward whatever patterns existed in the training data.

Question 2

How often should hiring teams re-evaluate a model once it is live?

Accepted Answer

A practical starting point is quarterly for high-volume roles and after any major change: new job families added, ATS migration, a prompt rewrite, or a model version update from the vendor. Accuracy tends to stay stable early, but fairness metrics can drift as applicant pools shift seasonally. Log every prediction with a timestamp and a version tag so you can compare cohorts over time. If your vendor changes the underlying model without notice, treat it as a new evaluation cycle. EU AI Act high-risk systems require documented ongoing monitoring as a compliance baseline.

Question 3

Do small teams need a full evaluation framework or just spot checks?

Accepted Answer

Spot checks catch obvious failures but miss quiet drift and systematic bias. A minimum credible evaluation for a small team is: 50 to 100 labelled applications reviewed blind by two recruiters, model outputs compared against those labels, and a simple parity check by gender and any other group in the applicant data. This takes a few hours, not weeks. The [AI bias audit](/ai-glossary-in-practice/ai-bias-audit) page covers the fairness slice in detail. Document what you tested, when, and what the error rate was. That document is the first thing legal will ask for if a candidate complaint arrives.

Question 4

What are the most common failure modes in hiring model evaluations?

Accepted Answer

The four patterns that show up most in live cohorts: title mismatch, where the model scores a career-changer low because their past titles do not match training examples; recency bias baked in, where recent-grad profiles from target universities inflate scores; inconsistent scoring for identical profiles with different formatting, which surfaces when you run the same CV twice after whitespace changes; and proxy discrimination through zip code or school prestige as a signal. All four are findable with a structured sample review before you connect the model to an outreach sequence or a reject queue.

Question 5

Which roles on the TA team should own model evaluation?

Accepted Answer

No one role owns it cleanly, which is why it often falls through the cracks. A practical split: TA ops or a senior recruiter owns sample collection and labels, a data-literate person (people analytics, HR tech, or an outside partner) runs the comparison and fairness checks, and a TA leader signs off on the accept threshold. In agencies or small teams without all three, one person can cover all parts as long as the review is documented and does not loop back to whoever tuned the model. Calibration sessions in AI workshops cover how to build a lightweight version of this split without a dedicated team.

Question 6

How is model evaluation different from A/B testing in recruiting?

Accepted Answer

A/B testing measures which pipeline variant produces better business outcomes (lower time-to-fill, higher offer accept rate) over weeks of live traffic. Model evaluation runs offline, before deployment, against labelled ground truth. You need both: evaluation tells you the model is safe to deploy; A/B testing tells you whether it actually improves the funnel. Skipping evaluation and going straight to A/B means any bias in the model is live during the test period, creating legal and candidate-experience risk. Think of evaluation as the pre-flight checklist and A/B as the post-launch performance review.

Model evaluation for hiring

What is model evaluation for hiring?

In practice

Quick read, then how hiring teams use it

Plain-language summary

When you are running live reqs and tools

Where we talk about this

Around the web (opinions and rabbit holes)

Related on this site

Frequently asked questions