AI with Michal

Fine-tuning for domain models

Fine-tuning for domain models means taking a general-purpose language model and training it further on a narrow corpus (your outreach, intake notes, classification labels) so the model bakes in your tone, taxonomy, and judgment instead of relying on prompts every call.

Michal Juhas · Last reviewed May 27, 2026

What is fine-tuning for domain models?

Fine-tuning for domain models is the heavy lift of AI customization: you take a general-purpose language model and train it further on a curated set of your own examples so the model internalises your tone, taxonomy, and judgment without you re-explaining the rules in every prompt. For a recruiting team that already runs solid system instructions, few-shot prompting, and RAG, fine-tuning is the next step only when the same task repeats often enough that pinning the weights pays for the labeling, the evals, and the retrains.

Most TA teams will get further with better prompts and retrieval than with a fine-tune. The right test is not "could we", it is "does this pattern repeat enough that prompting has stopped scaling, and is the output shape stable enough that we will not retrain every month?"

Illustration: a generic base language model plus a curated, redacted recruiting corpus feeding a training job that produces a domain-specialised model, sitting beside a retrieval store and passing through a human review gate before candidate-facing output, with an audit log strip beneath

In practice

  • A sourcing lead with thousands of intake notes per quarter fine-tunes a small model on a labeled corpus so summaries land in the team's preferred structure without a 400-token system block. The phrase you hear in standups is "we trained the model on our format", not "we wrote a longer prompt".
  • TA legal asks vendor demos whether a scoring model was fine-tuned on customer data, how that data was pseudonymized, and whether they can produce per-candidate logs. If the vendor cannot answer, the conversation moves to explainable AI in hiring before the contract is signed.
  • A founder building an in-house outreach tool decides retrieval over fine-tuning because the comp bands and JD library change every six weeks. Fine-tuning would freeze a snapshot the team would have to retrain quarterly to keep accurate.

Quick read, then how hiring teams use it

This is for recruiters, sourcers, TA, and HR partners deciding whether to pay for a fine-tune, or evaluating a vendor who already runs one. Skim the first section when you need a shared picture across a hiring team. Use the second when you are working with engineering or a vendor on data, evals, and rollout.

Plain-language summary

  • What it means for you: A fine-tuned model has been taught your team's tone and structure during training, so you can ask for shorter prompts and still get on-brand output.
  • How you would use it: You collect labeled examples (your good outreach, your preferred intake format), run a training job with a vendor or open-weight tooling, then plug the new model into the same workflow your team already uses.
  • How to get started: Write the eval set first: twenty real cases with the answer you would call great. Do not collect training data until you know how you will measure success.
  • When it is a good time: When the same task runs hundreds of times a week, prompts have crept past 800 tokens to keep quality, and the underlying criteria have stayed stable for at least a quarter.

When you are running live reqs and tools

  • What it means for you: Fine-tuning lives next to retrieval and prompting, not above them. It is the option you reach for last because it costs the most to maintain.
  • When it is a good time: High-volume narrow tasks (intake summarization, internal mobility classification, structured screening notes) with stable output shape and rare policy moves.
  • How to use it: Build the eval set, redact the training data, label aggressively for tone and structure, train a smaller model first, and benchmark against a strong prompted baseline before you ship.
  • How to get started: Read How to write better AI prompts first; if a sharper prompt closes the gap, you do not need a fine-tune yet.
  • What to watch for: Stale checkpoints when policy moves, leaked PII from unredacted training data, and silent regressions on tasks the eval set never covered. Pair with human-in-the-loop review on every candidate-facing output.

Where we talk about this

On AI with Michal live sessions, fine-tuning comes up most in sourcing automation when teams ask whether to pay a vendor to train on their corpus or build the workflow with prompting plus retrieval. The honest answer is usually "prompt and retrieve first, fine-tune only the narrowest repeat task". Bring your highest-volume pattern to a Sourcing Lab cohort and we will pressure-test it before you spend a euro on training compute. For ongoing room conversations on sourcing patterns where fine-tuning could earn its place, the AI Sourcing Lab is the right home.

Around the web (opinions and rabbit holes)

Third-party creators move fast. Treat these as starting points, not endorsements, and double-check anything before you wire candidate data into a training job.

YouTube

Reddit

Quora

Fine-tuning versus related customization options

ApproachWhen it winsWatch out
System instructionsStable rules, global tone, low costHard limits on length and specificity
Few-shot promptingTone shift in one thread, fast iterationEats LLM tokens, drifts when nobody owns the pack
RAGFacts that move (comp, org chart, JDs)Retrieval quality dominates everything else
Fine-tuningHigh-volume narrow task, stable shapeData labeling, retrains, GDPR scope on weights

Related on this site

Frequently asked questions

What does fine-tuning a domain model actually mean for recruiting?
It means taking a base language model and continuing to train it on a narrow recruiting corpus (your outreach, intake notes, screen summaries, JD library) so the model internalises your tone, taxonomy, and judgment without you re-explaining them every call. It sits at the heavy end of the customization stack, after system instructions, few-shot prompting, and RAG. Most TA teams never need it. Prompting plus retrieval reaches diminishing returns slowly, and fine-tuning means labeled datasets, eval suites, and retrains every time policy moves. Reserve it for high-volume narrow tasks (intake summarization at scale, internal mobility tagging, structured classification) where output shape is so repetitive that pinning the weights starts to earn its keep.
When does fine-tuning beat prompting or RAG for hiring teams?
When the same narrow task runs hundreds of times per week and prompts keep growing to control quality. If your outreach pack already uses eight exemplars plus a 600-token rubric, a fine-tuned smaller model can match output quality at lower per-call cost and lower latency. For one-off executive searches or anything where the criteria change quarterly, prompting wins because you can revise the rule in minutes. RAG beats fine-tuning when the facts move (org charts, comp bands, JD libraries), because retrieved snippets stay current without retraining. The honest rule: fine-tune for shape and tone, retrieve for facts, and prompt for one-off judgment calls. Run a twenty-case eval before committing budget so finance sees a number, not a vibe.
What GDPR and EU AI Act risks come with fine-tuning on candidate data?
High. Candidate names, resumes, screen notes, and outreach replies are personal data under GDPR, and employment AI sits in the high-risk category of the EU AI Act in hiring. Once data is baked into weights you cannot easily delete a single record on a Subject Access Request, which makes Article 17 erasure painful. Most legal teams require pseudonymized or synthetic training corpora, a written retention rule, and a record of which dataset version trained which model. Pair the fine-tune with the per-decision logs from explainable AI in hiring so reviewers can defend specific outputs, and treat the vendor agreement like a data processing addendum, not a low-stakes API call.
Which vendors offer fine-tuning today, and which do not?
OpenAI exposes supervised fine-tuning on the smaller GPT-4o-mini class models through the platform UI and API. Anthropic does not publish a general fine-tuning product for Claude; teams steer Claude with system instructions and RAG instead. Google Gemini offers tuning for some models through Vertex AI. Open-weight families (Llama, Mistral, Qwen) let you fine-tune on your own hardware or a managed cloud. Prices, data residency, audit logs, and rate limits all vary. Always ask the vendor where training jobs run, who can see the dataset, how long it is retained, and whether logs survive a security incident. Procurement should run the same checklist they use for any HR data processor, not a faster lightweight review.
How much data do you need before fine-tuning is worth trying?
Less than people guess for tone, more than people guess for judgment. A few hundred carefully labeled examples often shift style and structure reliably; thousands are needed before scoring or ranking work consistently outperforms a well-prompted base model. Data quality dominates volume: ten clean exemplars beat a thousand noisy ones for outbound voice. Before collecting anything, write the eval set first (twenty representative cases with the answer you want) so each training iteration can be measured. Without an eval set you will not know whether the second fine-tune is better, worse, or just different. Treat fine-tuning as an evaluation pipeline you maintain, not a one-shot training job, and budget the labeling time, not only the GPU time.
What goes wrong with domain fine-tunes in production?
Drift, stale checkpoints, leaked PII, and silent regressions. Your bar moves (new comp bands, new diversity language, a renamed role family), and a fine-tune frozen six months ago keeps producing last quarter's voice. If you trained on real candidate text without redaction, the model can echo personal details in unexpected outputs, which is hard to defend in an audit. Without an eval set, the team cannot tell whether last week's improvement on outreach quietly broke this week's intake summary. Schedule re-evaluation after every policy change, every vendor model upgrade, and at least once a quarter. Pair the fine-tuned model with the same human-in-the-loop review you use for prompted output so a person still owns the send.
How should recruiting teams decide before paying for a fine-tune?
Three checks. First, can you frame the task as repeatable shape and stable tone, not changing facts? If facts move, choose RAG. Second, do you have at least a hundred clean examples per output type and a written eval set? If not, spend a week labeling before training; it is the cheapest part of the project. Third, can legal sign off on the data, the model host, and the retention story? If the answer is unclear, run the first experiment on synthetic or fully anonymized data. For room-tested patterns, join a Live Build cohort or open the AI Sourcing Lab to see whether your team's repetition actually justifies the lift before the bill arrives.

← Back to AI glossary in practice