Question 1

What does fine-tuning a domain model actually mean for recruiting?

Accepted Answer

It means taking a base language model and continuing to train it on a narrow recruiting corpus (your outreach, intake notes, screen summaries, JD library) so the model internalises your tone, taxonomy, and judgment without you re-explaining them every call. It sits at the heavy end of the customization stack, after [system instructions](/ai-glossary-in-practice/system-instructions), [few-shot prompting](/ai-glossary-in-practice/few-shot-prompting), and [RAG](/ai-glossary-in-practice/rag). Most TA teams never need it. Prompting plus retrieval reaches diminishing returns slowly, and fine-tuning means labeled datasets, eval suites, and retrains every time policy moves. Reserve it for high-volume narrow tasks (intake summarization at scale, internal mobility tagging, structured classification) where output shape is so repetitive that pinning the weights starts to earn its keep.

Question 2

When does fine-tuning beat prompting or RAG for hiring teams?

Accepted Answer

When the same narrow task runs hundreds of times per week and prompts keep growing to control quality. If your outreach pack already uses eight exemplars plus a 600-token rubric, a fine-tuned smaller model can match output quality at lower per-call cost and lower latency. For one-off executive searches or anything where the criteria change quarterly, prompting wins because you can revise the rule in minutes. [RAG](/ai-glossary-in-practice/rag) beats fine-tuning when the facts move (org charts, comp bands, JD libraries), because retrieved snippets stay current without retraining. The honest rule: fine-tune for shape and tone, retrieve for facts, and prompt for one-off judgment calls. Run a twenty-case eval before committing budget so finance sees a number, not a vibe.

Question 3

What GDPR and EU AI Act risks come with fine-tuning on candidate data?

Accepted Answer

High. Candidate names, resumes, screen notes, and outreach replies are personal data under GDPR, and employment AI sits in the high-risk category of the [EU AI Act in hiring](/ai-glossary-in-practice/eu-ai-act-hiring). Once data is baked into weights you cannot easily delete a single record on a Subject Access Request, which makes Article 17 erasure painful. Most legal teams require pseudonymized or synthetic training corpora, a written retention rule, and a record of which dataset version trained which model. Pair the fine-tune with the per-decision logs from [explainable AI in hiring](/ai-glossary-in-practice/explainable-ai-hiring) so reviewers can defend specific outputs, and treat the vendor agreement like a data processing addendum, not a low-stakes API call.

Question 4

Which vendors offer fine-tuning today, and which do not?

Accepted Answer

OpenAI exposes supervised fine-tuning on the smaller GPT-4o-mini class models through the platform UI and API. Anthropic does not publish a general fine-tuning product for Claude; teams steer Claude with [system instructions](/ai-glossary-in-practice/system-instructions) and [RAG](/ai-glossary-in-practice/rag) instead. Google Gemini offers tuning for some models through Vertex AI. Open-weight families (Llama, Mistral, Qwen) let you fine-tune on your own hardware or a managed cloud. Prices, data residency, audit logs, and rate limits all vary. Always ask the vendor where training jobs run, who can see the dataset, how long it is retained, and whether logs survive a security incident. Procurement should run the same checklist they use for any HR data processor, not a faster lightweight review.

Question 5

How much data do you need before fine-tuning is worth trying?

Accepted Answer

Less than people guess for tone, more than people guess for judgment. A few hundred carefully labeled examples often shift style and structure reliably; thousands are needed before scoring or ranking work consistently outperforms a well-prompted base model. Data quality dominates volume: ten clean exemplars beat a thousand noisy ones for outbound voice. Before collecting anything, write the eval set first (twenty representative cases with the answer you want) so each training iteration can be measured. Without an eval set you will not know whether the second fine-tune is better, worse, or just different. Treat fine-tuning as an evaluation pipeline you maintain, not a one-shot training job, and budget the labeling time, not only the GPU time.

Question 6

What goes wrong with domain fine-tunes in production?

Accepted Answer

Drift, stale checkpoints, leaked PII, and silent regressions. Your bar moves (new comp bands, new diversity language, a renamed role family), and a fine-tune frozen six months ago keeps producing last quarter's voice. If you trained on real candidate text without redaction, the model can echo personal details in unexpected outputs, which is hard to defend in an audit. Without an eval set, the team cannot tell whether last week's improvement on outreach quietly broke this week's intake summary. Schedule re-evaluation after every policy change, every vendor model upgrade, and at least once a quarter. Pair the fine-tuned model with the same [human-in-the-loop](/ai-glossary-in-practice/human-in-the-loop) review you use for prompted output so a person still owns the send.

Question 7

How should recruiting teams decide before paying for a fine-tune?

Accepted Answer

Three checks. First, can you frame the task as repeatable shape and stable tone, not changing facts? If facts move, choose [RAG](/ai-glossary-in-practice/rag). Second, do you have at least a hundred clean examples per output type and a written eval set? If not, spend a week labeling before training; it is the cheapest part of the project. Third, can legal sign off on the data, the model host, and the retention story? If the answer is unclear, run the first experiment on synthetic or fully anonymized data. For room-tested patterns, join a [Live Build](/recruiting-os/labs/ai-sourcing) cohort or open the [AI Sourcing Lab](/recruiting-os/labs/ai-sourcing) to see whether your team's repetition actually justifies the lift before the bill arrives.

Approach	When it wins	Watch out
System instructions	Stable rules, global tone, low cost	Hard limits on length and specificity
Few-shot prompting	Tone shift in one thread, fast iteration	Eats LLM tokens, drifts when nobody owns the pack
RAG	Facts that move (comp, org chart, JDs)	Retrieval quality dominates everything else
Fine-tuning	High-volume narrow task, stable shape	Data labeling, retrains, GDPR scope on weights

Fine-tuning for domain models

What is fine-tuning for domain models?

In practice

Quick read, then how hiring teams use it

Plain-language summary

When you are running live reqs and tools

Where we talk about this

Around the web (opinions and rabbit holes)

Fine-tuning versus related customization options

Related on this site

Frequently asked questions