Vervoe
Buyer's Guide|18 Min Read

How to Evaluate AI Hiring Vendors: A Buyer's Guide to Transparent AI

The AI hiring market is growing fast and the claims are getting louder. Before you commit to any tool that uses artificial intelligence to grade, rank, or screen candidates, there are questions every buyer should be asking — and clear answers they should be demanding.

AI is now embedded in hiring tools across every category — applicant tracking, skills assessments, video interviews, resume screening, and reference checking. Each vendor promises faster decisions, less bias, and better hires. What most of them do not promise — and what most buyers fail to ask for — is transparency.

This guide is written for HR leaders, talent acquisition directors, and procurement teams evaluating AI-powered hiring software. It covers what explainability actually means, why a score alone is never enough, how training data shapes every decision the AI makes, and the legal exposure your organisation accepts when it relies on a system it cannot explain.

The questions at the end of this guide are designed to be taken directly into vendor conversations. Any vendor unwilling or unable to answer them clearly is telling you something important.

The promise and the problem with AI hiring tools

The promise is compelling. AI hiring tools can process thousands of candidate responses in the time it takes a recruiter to review ten. They can apply consistent scoring criteria at scale, reduce the variance introduced by different reviewers, and surface candidates who might be missed by keyword-matched résumé screening.

The problem is that most AI tools are designed to produce outputs, not explanations. They tell you a candidate scored 73 out of 100. They might even categorise that as "strong" or "needs development." But they rarely tell you what evidence produced that score, whose definition of good performance the model was trained on, or why a candidate who scored 58 was deemed unsuitable for your specific role.

That gap between output and explanation is where legal exposure lives, where bias hides, and where employer confidence in the tool erodes over time.

Explainability vs. transparency: they are not the same

These two terms are often used interchangeably in vendor materials. They are meaningfully different.

Transparency means you can see how a system is built. A vendor might publish a technical whitepaper, share their model architecture, or provide a general description of how their algorithm works. This is a starting point, but it is not sufficient on its own.

Explainabilitymeans you can account for a specific decision about a specific candidate. Not "our model considers communication skills, problem-solving, and attention to detail," but rather "this candidate received this score on this question because their response demonstrated these specific attributes, and here is the evidence."

Transparency describes the system. Explainability describes the decision. Both matter, but when a candidate asks why they were not selected, or a regulator asks you to justify a screening outcome, it is explainability you will need.

The score is not enough

Many AI hiring tools offer what they describe as "score explanations." They will tell you that a candidate scored highly on communication, or that they "demonstrated strong analytical reasoning." This kind of labelling feels informative but is rarely sufficient.

Consider the analogy of a student receiving an exam grade. Telling the student they scored 62% and that this was "below average" communicates an outcome. It does not communicate what they got wrong, which answers were marked down, or what a correct answer would have looked like. A student cannot improve from a number. A teacher cannot verify their marking was consistent. And an institution cannot defend that grade if the student challenges it.

The same logic applies to candidate scoring. A score with a descriptive label is an assertion. A score with the underlying evidence — the specific response, the criteria applied, and why that response was rated the way it was — is a justification. You need justifications, not assertions.

What to demand from any AI grading tool

  • The exact candidate response that was scored
  • The specific criteria applied to that response
  • Why the response met or did not meet those criteria
  • What a higher-scoring response would look like

The dataset problem: whose idea of "good" is the AI using?

Every AI model is trained on data. That data encodes a definition of what a "good" or "poor" candidate response looks like. The most important question you can ask any AI hiring vendor is: where did that definition come from?

If the model was trained on generic, third-party data — aggregated responses from across many different organisations, roles, and industries — then the AI's notion of good performance may have nothing to do with what good performance means in your organisation. A financial services firm and an early-stage tech startup may value very different things in a customer success hire. A generic model trained on aggregate data cannot know which standard applies to your role.

Worse, if the training data itself contained historical bias — skewing toward responses from candidates who were hired but later performed poorly, or from groups that were historically over-represented in your pipeline — the AI will replicate and amplify that bias at scale.

You must be able to see what data the AI was trained on, or at minimum understand clearly whether that data comes from your organisation's own hiring history or from an undisclosed third-party pool.

Active learning: why your data must drive the AI

The gold standard for AI hiring tools is a model that learns from your organisation's own standards — not someone else's. This is the principle behind active learning: the AI is continuously updated based on feedback from your team, so that its definition of a strong response reflects your specific requirements for each role.

In practice, this means that when a hiring manager or subject matter expert reviews a candidate response and marks it as strong or weak, that feedback is fed back into the model. Over time, the AI becomes calibrated to what your organisation genuinely values — not to a generic industry average.

The key consideration for buyers is whether this process is genuinely low-touch for your team. Active learning should not require your hiring managers to grade hundreds of responses manually before the AI becomes useful. A well-implemented system surfaces a small number of responses for review and uses that feedback to update the model intelligently, without creating a significant burden on your team.

If a vendor cannot clearly explain how their model is updated based on your organisation's feedback, or if the answer is that it is not — that the model is static and pre-trained — then you are not in control of what "good" means in your hiring process. The AI is.

The control question

If you are not actively teaching the AI what good performance looks like for your roles, then the AI is making those decisions independently — using standards you did not set, cannot inspect, and may not be able to justify.

If you cannot see the data, it is not explainable

Explainability is not just a feature. It is a prerequisite for responsible use of AI in hiring. And real explainability requires access to the data that drove the decision.

When an AI grades a candidate response, the explanation for that grade is only meaningful if you can trace it back to what the model was trained to look for. If the training data is proprietary, hidden, or described only in vague terms in a whitepaper, then you cannot truly verify whether the AI's grading criteria are appropriate for your role. You are taking the vendor's word for it.

This matters in several practical situations. If a candidate challenges their assessment outcome, you need to be able to show the specific evidence that informed their score. If an internal audit flags a pattern of disparate impact across demographic groups, you need to be able to examine the data to understand why. If a regulator requests documentation of your selection process, you need a paper trail that links scores to observable evidence — not to an opaque model.

The ability to show the data is not optional for organisations that take their compliance obligations seriously. Ask every vendor to show you, concretely, what an auditor would see if they examined a specific candidate's scoring record.

The regulatory landscape

Regulation of AI in hiring is accelerating globally. Organisations evaluating vendors now should be buying ahead of this curve, not scrambling to catch up.

New York City Local Law 144

NYC Local Law 144 requires employers using automated employment decision tools (AEDTs) to conduct annual bias audits, publish the results publicly, and notify candidates that AI is being used in the hiring process. It applies to employers with employees based in New York City. The law is among the most prescriptive in the world and is widely viewed as a model for broader regulation.

EU AI Act

The European Union's AI Act classifies AI tools used in employment as "high-risk" systems, subject to strict requirements around transparency, data governance, human oversight, and documentation. Organisations using high-risk AI systems are required to maintain detailed technical documentation and provide meaningful explanations of AI-assisted decisions to individuals.

EEOC guidance on AI and disparate impact

The US Equal Employment Opportunity Commission has issued guidance making clear that employers remain responsible for the discriminatory impact of AI tools they use, even if those tools were developed by a third party. Delegating selection decisions to an AI vendor does not transfer the employer's legal obligations — it transfers the risk while leaving the liability in place.

GDPR and data subject rights

Under GDPR, individuals have rights in relation to automated decision-making, including the right to obtain an explanation of decisions made solely by automated processing. For organisations operating in the EU or processing data of EU residents, this creates a direct requirement to be able to explain AI-assisted hiring decisions to candidates on request.

10 questions to ask every AI hiring vendor

Take these into every vendor demo and evaluation call. Strong vendors will answer them clearly and with supporting documentation. Evasive or vague answers are significant red flags.

1. What is the source of your training data?

You need to know whether the AI was trained on generic third-party data, on your organisation's data, or on a combination. Generic training data means the AI's standards are not your standards.

2. Can I see the training data that informs scoring for my specific assessments?

Not just a description — the actual data. If a vendor cannot show you the data, or says the data is proprietary and cannot be shared, you cannot verify that their grading criteria are appropriate for your roles.

3. How does your model update based on feedback from my organisation?

You need active learning — the AI should improve based on your team's feedback over time. If the model is static and pre-trained, it is using standards that were set without your input and will not adapt to your specific needs.

4. How much time does active learning require from my team?

Active learning should be low-touch. If keeping the AI calibrated to your standards requires significant manual effort from hiring managers, the workflow will break down in practice.

5. For a specific candidate score, can you show me the exact evidence that produced it?

Ask to see a worked example. The vendor should be able to show you the candidate's response, the criteria applied, and why the response was scored the way it was — not just a label or a general description.

6. What would an auditor see if they requested the scoring record for a rejected candidate?

This is the compliance test. The answer should be: a full record of the candidate's responses, the criteria used to grade each one, and the basis for the overall score. Anything less is inadequate for audit purposes.

7. Have you conducted a bias audit of your AI tool? Can I see the results?

Reputable vendors conduct regular audits for adverse impact across demographic groups and share those results. If a vendor has not conducted a bias audit or will not share the findings, that is a serious concern.

8. What is your process for detecting and correcting disparate impact?

Not just how they audit for it, but what they do when they find it. The answer should include a clear process for investigating causes and updating the model.

9. What documentation do you provide to support compliance with the EU AI Act, NYC Local Law 144, or EEOC guidelines?

Vendors serious about compliance will have documentation ready. Be specific about the regulations that apply to your organisation's footprint.

10. What human oversight does your platform support, and how can reviewers override AI-generated scores?

AI should support human decision-making, not replace it. A strong platform makes it easy for reviewers to see the AI's reasoning, disagree with it, and override it where necessary.

What good looks like

A genuinely transparent AI hiring tool is one where every grading decision can be traced back to observable evidence, where the training data came from your organisation and continues to be shaped by your team's feedback, and where an auditor, a regulator, or a rejected candidate can be shown exactly why a score was what it was.

This is not a futuristic standard. It is a reasonable baseline that responsible vendors should already meet. The difference between tools that meet it and tools that do not is the difference between AI that genuinely supports better hiring decisions and AI that creates the appearance of rigour without the accountability to back it up.

How Vervoe approaches this

Vervoe's AI grading is built around active learning from your organisation's own data. When your team reviews a candidate response, that feedback trains the model to understand what good performance looks like for your specific role — not a generic industry average. The training data is yours, the grading criteria are visible, and every score is backed by the candidate's actual response.

The result is a grading system you can explain to a candidate, defend in an audit, and trust to reflect your own standards — not someone else's.