Stanford Studied 4 Million Job Applications. The AI Hiring Tools Failed on Race.
Let me tell you about a number that stopped me cold: 4%.
That’s the percentage of applicants in a massive new Stanford study who applied to ten different positions and were rejected from every single one of them.
Not rejected from a few. Not filtered out by a few employers. Rejected across the board. Ten applications, ten rejections, zero interviews.
You might think those applicants had bad resumes. Or applied to roles they weren’t qualified for. Or made some other mistake that explained the uniform rejection.
The Stanford researchers don’t think so. They think the explanation is something more disturbing: a shared AI hiring infrastructure that quietly, systematically screens out certain groups of people—and then compounds that screening across multiple employers who don’t even know they’re sharing a biased model.
The study, which analyzed 4 million job applications across 156 employers, found “clear racial disparities” in how AI hiring tools performed. Black applicants were disproportionately screened out in 10.62% of the positions studied. Asian applicants faced adverse impact in 5.32% of positions. And because 42 different hiring models were shared across employers, a rejection triggered by one company’s AI could effectively blacklist an applicant at another company using the same underlying tool.
The researchers are careful to note that their data covers 2018 to 2022, before the current wave of LLM-powered hiring tools. Today’s AI works differently. But the study is not a historical footnote. It’s a warning about what happens when you build shared infrastructure for high-stakes decisions without understanding how bias propagates through that infrastructure.
How the Study Worked
The Stanford team did something unusual. They didn’t run their own experiments or build their own models. They got access to real-world data from Pymetrics, a company that provided AI-powered hiring assessments to dozens of employers between 2018 and 2022.
Pymetrics used gamified assessments—pattern-matching games, risk-reward tasks, memory tests—to generate personality and cognitive profiles of applicants. Employers would set thresholds. Applicants who cleared the threshold moved forward. Those who didn’t were screened out, often without a human ever seeing their resume.
The researchers analyzed data across 156 employers, looking for adverse impact by race. Adverse impact is a legal and statistical concept: it means that a selection tool disproportionately screens out a protected group even when that group is equally qualified on the relevant criteria.
What they found was not a uniform bias. It was a pattern. Across the full dataset, 10.62% of positions showed adverse impact against Black applicants. That’s more than one in ten job postings. For Asian applicants, the number was 5.32%—smaller, but still substantial.
Those percentages might sound small. They are not. In a tight labor market, a 10% chance of facing a biased screen means thousands of qualified applicants are being filtered out for reasons that have nothing to do with their ability to do the job.
The Shared Model Problem
The more disturbing finding is about shared infrastructure.
The study found that 42 different hiring models were used across multiple employers. That means a tool trained on one company’s data, tuned for one company’s preferences, was being deployed at other companies—often without those companies knowing the full history of the model or the biases embedded in it.
Here’s the nightmare scenario. Applicant A applies to Company 1. The AI model screens them out. That decision is logged. Then Applicant A applies to Company 2. Company 2 uses a different model—but that model was trained on data from Company 1. The bias transfers. Applicant A is screened out again. By the time they’ve applied to ten companies, the pattern is locked in. Not because any individual employer made a biased decision. Because the shared infrastructure learned and propagated bias without anyone noticing.
The numbers bear this out. The researchers found that 4% of applicants who applied to ten positions were rejected from all of them. That rate is higher than what you would expect if employers were making independent, uncorrelated decisions. The excess rejections are the signature of shared bias.
Think about what that means for an individual job seeker. You apply. You get rejected. You assume you weren’t qualified. You apply somewhere else. Rejected again. You start to doubt yourself. You change your resume. You change your cover letter. You change your approach. Nothing works. And the whole time, the problem isn’t you. It’s the models.
The Caveat: This Is Old(er) AI
The researchers are careful to note that their findings may not generalize to all AI hiring tools—especially the new wave of LLM-powered systems that work very differently from Pymetrics’ gamified assessments.
Pymetrics was a “personality and cognitive fit” tool. Modern AI hiring tools are often resume parsers, cover letter analyzers, or conversational agents. They’re trained on different data, optimized for different outcomes, and deployed in different ways.
It is entirely possible that LLM-based tools are less biased than the older generation. It is also possible that they are more biased in different ways. We don’t know yet. The research hasn’t caught up to the deployment.
But here’s the thing about the Pymetrics era. Those tools were considered state-of-the-art at the time. They were sold as objective, data-driven, bias-reducing alternatives to human resume screening. And they still produced clear racial disparities.
The lesson is not that a particular vendor was bad. The lesson is that AI hiring tools, no matter how well-intentioned, can bake in bias in ways that are hard to detect and even harder to fix—especially when they’re shared across employers.
Why Shared Infrastructure Magnifies Harm
The study’s most important contribution might be making visible a problem that has been hiding in plain sight: shared AI infrastructure changes the math of discrimination.
In the old world, each employer made independent decisions. If one employer’s human recruiter was biased, that was a local problem. You could apply elsewhere and get a fair shake.
In the new world, a biased model can be used by dozens of employers. A single flaw in the training data, a single skewed optimization target, a single unchecked proxy for race—and suddenly hundreds of thousands of applicants are being screened out across the entire job market, not just at one company.
The researchers found 42 shared models in their dataset. That’s 42 points of systemic risk. If any of those models had a bias problem, every employer using it had a bias problem. And the applicants affected might never know why they kept getting rejected.
This is not hypothetical. We have seen similar dynamics in credit scoring, in predictive policing, in healthcare algorithms. A tool built for one purpose gets reused in another context. The bias that was tolerable in the original setting becomes catastrophic in the new one. And by the time anyone notices, the damage is done.
What the Vendors Say (and Don’t Say)
Pymetrics, the vendor whose data was used in the study, was acquired by HireVue in 2020. HireVue has since moved away from the kind of gamified assessments analyzed in the paper, citing evolving customer preferences and a shift toward more explainable AI.
In a statement responding to the study, a HireVue representative said the company takes fairness seriously and that its current tools include bias audits and disparate impact testing. They also noted that the study data is from 2018–2022, and that the company has made significant changes since then.
That’s the standard response. It’s not wrong. But it’s also not reassuring. Because the problem the Stanford study identified is not unique to Pymetrics. It’s a problem with the entire model of shared AI infrastructure for high-stakes decisions.
Until there is independent, mandatory, ongoing auditing of these tools—not voluntary self-reports, not one-time fairness checks—the risk of systemic bias will remain. And the applicants who bear the cost will never know.
The Legal Landscape
The legal framework for AI hiring tools is a mess.
The Equal Employment Opportunity Commission (EEOC) has issued guidance, not binding regulation. Some states—New York City, Illinois, Maryland—have passed laws requiring bias audits or disclosure of AI hiring tools. But enforcement is spotty, and most employers are not covered.
The legal theory is clear: if an AI tool produces disparate impact, it violates Title VII of the Civil Rights Act, regardless of intent. But in practice, proving disparate impact requires access to data that applicants rarely have. You can’t sue for a bias you can’t see.
The Stanford study got access to Pymetrics’ data because the researchers were academics with a formal agreement. A job applicant has no such access. They just see the rejection emails. They don’t see the model scores. They don’t see the cutoff thresholds. They don’t see the training data.
This is the asymmetry at the heart of the problem. The employers and vendors have all the information. The applicants have none. And without information, there is no accountability.
What Employers Should Do (But Often Don’t)
The study includes a set of recommendations for employers. They’re not radical. They’re just good practice.
Audit your tools. Before you deploy an AI hiring tool, test it for adverse impact. Not once. Regularly. And not just on aggregate data—break it down by role, by department, by region. Bias can hide in averages.
Don’t assume shared models are safe. Just because another company uses the same tool doesn’t mean it’s fair. Their applicant pool is different. Their job requirements are different. Their bias might be different. Validate for your own context.
Keep humans in the loop. No AI hiring tool should make final decisions without human review. The technology is not ready. It may never be ready. The cost of a false positive (hiring someone unqualified) is low. The cost of a false negative (rejecting someone qualified) can be a lawsuit, a PR crisis, and a life derailed.
Be transparent. Tell applicants you’re using AI. Tell them what data is being collected. Give them a way to appeal. This is not just ethical. It’s risk management. The more opaque your process, the harder it is to defend.
Most employers do none of these things. They buy a tool from a vendor, they trust the vendor’s marketing, and they move on. The Stanford study is a reminder that trust is not a strategy.
The LLM Question
The study ends with an open question: how do modern LLM-based hiring tools compare?
We don’t know yet. The research is early. But early results are mixed. Some studies show that LLMs can reduce certain kinds of bias by ignoring demographic signals that human screeners might unconsciously favor. Other studies show that LLMs amplify bias by learning stereotypes from training data.
The honest answer is that it depends entirely on how the tools are built, trained, and deployed. A well-audited LLM used as a screener before human review might be less biased than the status quo. A poorly designed LLM used as an automated rejecter could be a disaster.
The Stanford study’s findings from the 2018–2022 era are not a verdict on today’s AI. They are a preview of the kind of problems we need to be looking for. And they are a warning that shared infrastructure requires shared accountability.
The Human Cost
Behind the numbers—4 million applications, 156 employers, 10.62% of positions—are real people.
People who spent hours on applications. People who took assessments they didn’t understand. People who got rejection after rejection and started to believe they weren’t good enough.
The 4% of applicants who were rejected from every job they applied to—those are not statistics. Those are humans. They have rent to pay. They have families to support. They have skills that some employer, somewhere, probably needs.
And they were failed. Not by a malicious algorithm. Not by a racist programmer. But by a system—a system of vendors and employers and shared models and unchecked assumptions—that produced biased outcomes without anyone intending or noticing.
That’s the real lesson of the Stanford study. Bias doesn’t require bad actors. It requires bad systems. And we have been building bad systems at scale.
The Takeaway
The Stanford study is not a reason to abandon AI hiring tools. It is a reason to take them seriously—more seriously than we have.
If you’re an employer, audit your tools. If you’re a vendor, open your models to independent review. If you’re a regulator, write rules that keep pace with technology. If you’re an applicant, know that the rejections might not be your fault.
And if you’re just someone who cares about fairness, remember this number: 4%.
That’s the percentage of applicants who applied to ten jobs and got rejected from every one. In a fair system, that number would be close to zero. In our system, it’s not. And until we fix the AI hiring infrastructure, it won’t be.
Your one-stop shop for automation insights and news on artificial intelligence is EngineAi.
Did you like this article? Check out more of our knowledgeable resources:
Watch this space for weekly updates on digital transformation, process automation, and machine learning. Let us assist you in bringing the future into your company right now