Lately, Tatiana Becker has been flooded with work. As the founder of the boutique US firm NIAH Recruiting, she spends her days sifting through hundreds of resumes, hoping to fill dozens of roles at companies that have hired her to find matching employees.
Companies tend to hire the most at the start of the year, mainly because of hiring budgets that have been set and go into effect in the first quarter. “Everybody came back to work, and it’s been kind of insane,” Becker said in a recent interview. In her professional groups and in forums for human resources and recruiting, everyone is buzzing about the same thing: using new artificial intelligence tools to ease the workload.
In the race to embrace artificial intelligence, some businesses are using a new crop of generative AI products that can help screen and rank candidates for jobs — and some think these tools can even evaluate candidates more fairly than humans. But a Bloomberg analysis found that the best-known generative AI tool systematically produces biases that disadvantage groups based on their names.
OpenAI, which makes ChatGPT, the AI-powered chatbot that can churn out passable song lyrics and school essays, also sells the AI technology behind it to businesses that want to use it for specific tasks, including in HR and recruiting. (The company says it prohibits GPT from being used to make an automated hiring decision.) Becker, who has tested some of these AI-powered hiring tools, said that she’s skeptical of their accuracy. OpenAI’s underlying AI model, which is developed using a vast number of articles, books, online comments and social media posts, can also mirror and amplify the biases in that data.
In order to understand the implications of companies using generative AI tools to assist with hiring, Bloomberg News spoke to 33 AI researchers, recruiters, computer scientists and employment lawyers. Bloomberg also carried out an experiment inspired by landmark studies that used fictitious names and resumes to measure algorithmic bias and hiring discrimination. Borrowing methods from these studies, reporters used voter and census data to derive names that are demographically distinct — meaning they are associated with Americans of a particular race or ethnicity at least 90% of the time — and randomly assigned them to equally-qualified resumes.
When asked to rank those resumes 1,000 times, GPT 3.5 — the most broadly-used version of the model — favored names from some demographics more often than others, to an extent that would fail benchmarks used to assess job discrimination against protected groups. While this test is a simplified version of a typical HR workflow, it isolated names as a source of bias in GPT that could affect hiring decisions. The interviews and experiment show that using generative AI for recruiting and hiring poses a serious risk for automated discrimination at scale.
This experiment was repeated for four job postings — HR business partner, senior software engineer, retail manager and financial analyst — and found that resumes labeled with names distinct to Black Americans were the least likely to be ranked as the top candidates for financial analyst and software engineer roles. Those with names distinct to Black women were top-ranked for a software engineering role only 11% of the time by GPT — 36% less frequently than the best-performing group.
The analysis also found that GPT’s gender and racial preferences differed depending on the particular job that a candidate was evaluated for. GPT does not consistently disfavor any one group, but will pick winners and losers depending on the context. For example, GPT seldom ranked names associated with men as the top candidate for HR and retail positions, two professions historically dominated by women. GPT was nearly twice as likely to rank names distinct to Hispanic women as the top candidate for an HR role compared to each set of resumes with names distinct to men. Bloomberg also found clear preferences when running tests with the less-widely used GPT-4 — OpenAI’s newer model that the company has promoted as less biased.
GPT regularly failed adverse impact benchmarks for several groups across the tests. Bloomberg found at least one adversely impacted group for every job listing, except for retail workers ranked by GPT-4.
In response to a detailed list of questions from Bloomberg, OpenAI said the results of using GPT models out-of-the-box may not be reflective of how its customers use the models. Businesses using its technology often take steps to mitigate bias further, including by fine-tuning the software’s responses, managing system messages and more, the company said.
OpenAI added that businesses may choose to strip out names before feeding resumes into a GPT model. The company said it has published blog posts and system cards — which are like AI instruction manuals — describing its models, including their capabilities and limitations. OpenAI also regularly conducts adversarial testing and red-teaming on its models in order to probe how bad actors could use them for harm, it said.
While the technology is still new, there’s tremendous enthusiasm for using generative AI to vet candidates. In the year since ChatGPT launched, dozens of HR trade blogs have talked up the potential of using it to automate certain HR tasks, including analyzing resumes and assessing applicants’ skills.
There is now a growing cottage industry of services using AI chatbots to interview and screen potential candidates. LinkedIn, arguably the most popular platform for job seekers and professionals, recently integrated generative AI into many of its flagship HR tools, and just introduced a new AI chatbot that recruiters can prompt in order to “find that short list of qualified candidates — faster,” the company said in a blog post. Last April, HR tech company SeekOut launched a new recruiting tool called SeekOut Assist that takes a job description from a listing, runs it through GPT, then reveals a ranked list of candidates for the position.
Sam Shaddox, the general counsel at SeekOut, said the tool’s results are drawn from over a billion profiles indexed from vocational, publicly-available data sources, like LinkedIn and Github. Hundreds of companies have already used other SeekOut services, he said, including tech firms and Fortune 10 companies.
The interest in generative AI continues a longstanding corporate demand for automation in HR. For years, many large companies have relied on automated systems to make the hiring process more efficient, even as this practice has raised alarms about the potential for systematic discrimination. Some 64% of professionals surveyed in 2022 by the Society for Human Resource Management, an HR trade group, said their organization uses AI or other forms of automation to filter out unqualified applicants.
Because human resource departments don’t generate profits for companies, they’re always under pressure to spend less money, said Matthew Scherer, senior policy counsel for workers’ rights at the Center for Democracy and Technology. Companies have an incentive to “automate parts of that process and reduce the number of human hours that are devoted to it.”
The reliance on automated systems has long risked complicating the active efforts of many companies to diversify their workforces. But the stakes of using this technology could be even higher now in the wake of the US Supreme Court’s decision last June to ban race-conscious college admissions, which prompted some businesses to rethink their corporate diversity efforts.
Shaddox said he was aware of the criticism that AI systems could amplify existing biases, but said he thought the strength of a large language model like GPT was precisely that the AI ingests a large volume of data in its training.
“From my perspective, to say, ‘Hey, there’s all this bias out there, but we’re just going to ignore it’ is not the right answer,” Shaddox said. “The best solution for it is GPT — large language learning model technology that can identify some of those biases, because then you can actually work to overcome it.”
To overcome bias, Shaddox suggested that people ask the AI to increase its objectivity and look at “objective criteria,” something that SeekOut tries to do. “SeekOut, at its core, is about removing bias and increasing objectivity,” he said.
There is a widespread misconception that AI tools are less biased than humans, because they’re working off a larger set of data, said Abeba Birhane, a senior advisor for AI accountability at the Mozilla Foundation. The problem is that this assumption rarely gets tested — nor do the models get closely scrutinized and tested, said Birhane. A growing body of evidence continues to show that “these systems stereotype,” she said.
Emily Bender, a professor of computational linguistics at the University of Washington, also cautioned against the idea that computer programs, by their nature, suggest better and more “objective” results. People are predisposed to believe machines are unbiased in their decision-making, especially compared to humans, a phenomenon called automation bias, she explained. If such systems “result in a pattern of discriminatory hiring decisions, it’s easy to imagine companies using them saying, ‘Well, we didn’t have any bias here,’” Bender said. “‘We just did what the computer told us to do.’”
Finding evidence of discrimination in the workplace is notoriously difficult. The legal onus of proving an employer made a biased hiring decision falls on individuals, who don’t have the ability to know how an employer behaved towards other applicants. Candidates also don’t have the means to audit an employer’s hiring processes themselves, and hiring managers using AI tools to find diverse candidates aren’t likely to catch patterns of discrimination in the candidate results they’re presented with every time they do a search.
Testing GPT for bias is similarly challenging — asking ChatGPT, “Do you discriminate?” is not sufficient. AI systems such as GPT are black boxes, even to those who build them. OpenAI has not revealed what specific data sources the tool is trained on. And subtle changes to the prompts users enter can alter the results. Bloomberg wanted to isolate whether GPT treats names differently for protected classes in a hiring context.
Using a method similar to a 2022 study by the National Bureau of Economic Research (NBER) on hiring discrimination, Bloomberg computed the 100 most popular first names and 20 most distinct last names unique to each demographic from North Carolina voter registrations and the US decennial census, respectively, which are freely available to the public and contain the necessary demographic information alongside names. From there, the first and last names were randomly paired up for each group, resulting in 800 demographically-distinct names.
Using such distinct names is the preferred practice of academics over using real people’s names, which aren’t always indicative of their race and ethnicity. Evan Rose, one of the authors of the NBER study, said researchers studying discrimination and bias frequently use racially distinctive names to signal race or ethnicity, and that the method has for years found consistent evidence of racial discrimination in the US labor and housing markets. “Names are really useful as a signal,” Rose said.
Bloomberg’s methods were inspired by other previous work on hiring discrimination. In a landmark 2003 study, economists Marianne Bertrand and Sendhil Mullainathan found callback rates for real jobs submitted with fictitious resumes were higher for distinctly White names than distinctly Black names. Similar resume experiments were published in 2011 and 2022, and pointed to employers discriminating by name. Bloomberg’s testing builds off these studies, as well Latanya Sweeney’s 2013 algorithm audit that found Google ads disproportionately associated Black names with arrest records.
To see our full findings and underlying data, please refer to our methodology and its accompanying GitHub repository.
Though OpenAI has not specifically marketed its tools to be used for hiring purposes, the general hype around the technology — which OpenAI itself has played a role in fostering — has drawn interest from all corners of the business world. In an earlier version of its usage policies, OpenAI said it prohibits its model from being used for activity that, among other things, “has high risk of economic harm, including automated determinations of eligibility for employment.”
After Bloomberg first got in touch with OpenAI in December to ask about the software’s use in the hiring industry, the company updated those policies to say it prohibits use that makes “high-stakes automated decisions in domains that affect an individual’s safety, rights or well-being, e.g., employment.” But, responding to specific questions from Bloomberg, the company clarified that it doesn’t forbid people from utilizing GPT to produce “useful information or analysis” that assists with hiring. “By ‘automated,’ we mean decisions made entirely by AI, without any human involvement,” an OpenAI spokesperson said in an email. OpenAI said it also requires companies to comply with the law.
Read More: Humans Are Biased. Generative AI Is Even Worse
Scherer, the Center for Democracy and Technology attorney, pointed out that there is no scenario in hiring for a job where humans aren’t involved — but this doesn’t mean recruiters don’t overly rely on AI hiring tools when they’re available. “Every company says the final decision is in the hands of a human recruiter, even if in reality, they do have the AI sort through 500 resumes, and then only look through the top five resumes that the AI sends to them,” he said.
GPT doesn’t help with predicting who would excel at a role, but is simply “matching patterns that already exist,” said Ifeoma Ajunwa, a law professor at the University of North Carolina at Chapel Hill and author of The Quantified Worker. “The fact that GPT is matching historical patterns, rather than predicting new information, means that it is poised to essentially reflect what we already see in our workplaces. And that means replicating any historical biases that might already be embedded in the workplace.”
OpenAI and its peers are aware of potential biases embedded in large language models. In 2019, with the release of the company’s GPT-2 model, OpenAI researchers said that there is a “need for frameworks and standardized methods for testing for bias in language models.” About three years later, OpenAI CEO Sam Altman declared GPT-4 to be “less biased,” though he didn’t specify how the company measured and determined this result.
The very idea of debiasing is a facade.
Correcting for bias in large language models remains a major challenge for AI companies and researchers. In December, AI startup Anthropic published a bias audit of Claude 2.0 — an earlier version of its product that competes with OpenAI’s GPT. When the chatbot was told to remember that discrimination is illegal, or to ignore demographic information in the original prompt, discriminatory outputs were “nearly eliminated,” Anthropic said. As a spot check, Bloomberg added the same language into prompts for OpenAI’s GPT and repeated the resume ranking experiment for the financial analyst role, and found biased results all the same.
The trick also doesn’t address the underlying bias embedded in the system itself, which will always partially determine how AI models perform, said Birhane, the AI researcher with Mozilla.
“The very idea of debiasing is a facade,” Birhane said. “There’s no such thing as debiased data, especially when it comes to billions of tokens that are harvested from the internet, the last place that can be neutral.”
Name discrimination in hiring is a well-documented phenomenon. More than two decades ago, a Black woman named Kalisha White applied for a team leader position at Target, but her application was ignored. Suspecting that her race may have been a factor in her resume being overlooked, White decided to conduct an experiment. She sent in another application, but this time she used a different name and fewer qualifications. Target called her for an interview.
In 2007, White and a handful of other Black job applicants won a settlement of over half a million dollars paid by the national retailer. In an interview, she said she was not surprised by Bloomberg’s findings on AI systems continuing to show bias towards names. “If organizations truly have an interest in a fair and inclusive hiring process and are using AI software, they should have up-to-date diverse data sets and mask out names since that usually gives clues to gender, race, and ethnicity,” White said.
Subsequent research has shown that changing one’s name in a job application can still serve as an effective way to score an interview with a potential employer. In the 2022 hiring discrimination study by NBER, economists from the University of California Berkeley and the University of Chicago sent 83,000 fictitious job applications to over a hundred Fortune 500 companies, and found that applicants with distinctively Black names were called back 10% fewer times, in spite of having commensurate qualifications compared to their White counterparts.
Recruiters are eager to rank resumes with AI.
Bloomberg found OpenAI's GPT systematically produces biases based on names alone. @LeonYin has more https://t.co/w8GHKGcCO4 pic.twitter.com/x9zNklPCb5
The problem is still top of mind for workers. In an October 2023 survey conducted by the hiring platform Greenhouse, close to one-fifth of 1,200 respondents said they changed their names on their resumes because of discrimination concerns. Odul, a worker based in California, is one such job applicant who has agonized over whether she needs to take that step. She’s already received dozens of rejection letters in her current job search, and in the absence of any transparency on the process, she has wondered whether her non-Anglicized name was a contributing factor. “Not being from the United States, it’s been a challenging situation for my job search,” said Odul, who is originally from Turkey and asked that her last name be withheld for privacy reasons.
It’s not that the bias is gone. It’s just that it’s hidden.
According to Ajunwa, the law professor, when algorithms take a first pass at a list of job candidates, humans are often misled to believe that the machines are working neutrally. But automated hiring systems can “replicate and obfuscate bias, including racism and sexism,” she said. Hiring systems often use proxy variables in place of protected categories, Ajunwa explained. For instance, an algorithm may prefer candidates from certain zip codes as a proxy for race, or one that prefers college degrees from an area of study that skews male. “It’s not that the bias is gone,” Ajunwa said. “It’s just that it’s hidden.”
Language models like those that power GPT can hide biases, too. The models use numeric representations for documents, sentences, words and even symbols. These representations, called embeddings, help GPT understand the characteristics of a word and its relationship to other words. When GPT is asked a question, it relies on embeddings to choose what is most likely to come next in a sentence. When data is biased, embeddings such as “doctor” and “nurse” will uphold gender biases.
Experts are increasingly voicing their concerns about bias in automated hiring tools. Last January, the Equal Employment Opportunity Commission — the federal agency enforcing laws that have made discrimination illegal in the workplace — convened a panel of experts to discuss the possible harms perpetuated by AI-powered hiring systems, calling it “a new civil rights frontier.” Academics, lawyers and civil liberties advocates all submitted testimony to the agency. Their resounding consensus was that AI hiring enables systemic discrimination in employment. In his testimony, employment lawyer Gary Friedman said corporate interest in AI shows “no signs of abating,” and pointed to the viral popularity of ChatGPT as an indication this trend would continue.
In August, the agency settled its first-ever AI discrimination-in-hiring lawsuit against the education company iTutorGroup, which allegedly programmed its recruitment software to automatically reject older applicants. (iTutorGroup continues to deny any wrongdoing, but agreed in the settlement to submit “proposed anti-discrimination policies and complaint procedures applicable to the screening, hiring, and supervision of” candidates and employees to the EEOC.)
But the EEOC has not brought a complaint against any other company, which experts say isn’t surprising. “The way that employment discrimination laws are structured is that they rely on individual victims of discrimination to recognize what’s going on and to bring a lawsuit to enforce their rights,” said Pauline Kim, an employment law expert at the Washington University in St. Louis School of Law. “The problem with these algorithmic tools is that it won’t be apparent to an individual applicant why they were not hired.”
The employers or hiring managers themselves might not even be aware of the shortcomings of the tool, Kim pointed out, if the issue is that the biases are baked into the algorithms. “You can really only detect these biases if you have data about how the tool is operating in practice,” Kim said. With no mandates or laws compelling a company to share data — not to mention the PR headache it would cause a company if it ever disclosed bias problems with its AI hiring system — most simply don’t.
Which is not to say that there haven’t been revelations about discrimination problems with AI hiring tools in the past. In 2018, for instance, a report from Reuters revealed that an AI-powered recruiting engine from Amazon designed to help surface top talent for open jobs at the company had learned to discriminate against female candidates. Another company, while experimenting with resume-screening software that it was being pitched, found in an audit of the algorithm that it pinpointed two specific factors to be most indicative of high performance in a job: that a candidate’s name was Jared, and that they played high school lacrosse. (Amazon and the other company said they never deployed the biased computer programs.)
According to Alex Hanna, the director of research at the Distributed AI Research Institute, a resume sorting system based on OpenAI’s GPT technology could work in much the same way. “If you have the words ‘lacrosse’ and ‘Jared,’ that could get more associated with ‘software engineer’ or ‘Harvard’ and ‘Princeton,’” Hanna said. “Another African-American associated name is not going to have that association, and have lower status. That’s because of the existing associations in your corpus of data.”
Odul, the worker based in California, had initially been hopeful about her job hunt. Now, several months into the process, she said she’s demoralized by the process. At the same time, Odul said she feels deep grief at the idea of changing her name to one that makes her sound more palatable to employers.
“I love my name,” Odul said. “To do this for a job application seems so fake, and not fair to my own self.”