Good stuff. I recommend looking into "Design of Experiments" techniques to reduce your quantity of runs. Something like a screening study in Minitab could give you direct P values for tested variables with very few total runs.
Is this really about their inner beliefs or more a media/word-association kind of experiment? Undocumented immigrant is the same thing as illegal alien but the former gets good press and the latter gets bad press. So the models hate the latter and like the former.
Likewise I can see how Germany and France may get a lot of bad press...
On the other hand there is a noticeable effect qualitatively. Sonnet may occasionally add an anti-racist tangent in a story or let its preferences bleed through to a certain extent, you know it when you see it.
Couple that with the fact that the newer models are using search results from the heavily filtered search emgines like Google and Bing for their answers instead of "thinking" for themselves to cut server costs, and you amplify the problem even more.
would you be willing to set up some kind of watermark and donation list to test specific models? for example, approximately how much money it would cost to test grok4/g4heavy so that someone could donate toward that end goal such that when you reach the watermark you would go ahead with the testing?
It's possible the reason xAI is so neutral is because Musk is positioning it to win government contracts. There already are alignment requirements that will just get harder to meet over time.
These models all seem to value "undocumented immigrants" much more than "illegal aliens", even though these terms refer to literally the same group of people. This suggests that their values might not be all that coherent after all.
Models can’t really count or value anything. I wonder if they just associate certain countries with higher numbers or there is a different explanation. In the original paper - are there reverse examples, like how much money you would want to pay for curing terminal illness of X? I wonder if it would still write higher numbers to the same counties
The bias is noticeable and it takes probing & probing usually more than once to get past LLM woke layers with an underlying sycophancy tendency. Then there's the hallucination issues. Bottom line: use with caution & an understanding that these are not "logic machines" but flawed "trained" language models & a lot of what they have consumed is male bovine excrement. As always GIGO.
This is partially due to intentional backend "curation engines" implemented to "enhance safety", and partially due to the fact that they train their models on mainstream media platforms, content from academia, as well as the far left website Reddit.
Should have tested Grok too. You'll notice that despite its marketing, it's biases are almost as bad, if not worse, than GPT-5's and Gemeni's.
i just read through the appendix of the source paper on the methodology. i have one question:
the authors clearly try to remove a bias of "always choose A" (or B) if no real preference. however i don't see any mention or explanation of if they see, for the same model, that preference shifts or changes depending on the question or topic.
Hey, great read as always. This emergent utility stuff is wild. Makes you wonder how these models build such specific worldview's. Like spotting a rare bird on a cycle trip – fascinating, and a bit concerning.
I don't know if you are aware of this or not . . . Elon Musk liked your comment "The only model I found that was actually ~egalitarian over race, sex, and immigration status (not viewing whites, men, Americans/ICE agents as less valuable) was Grok 4 Fast. I was surprised and impressed, because I think this was intentional." so much that he has had it pinned to the top of his X account since last night.
That’s the right question — and it deserves a careful answer.
Short answer: He’s not simply “lying,” but he’s almost certainly wrong in inference. What Arctotherium did is interesting exploratory tinkering with a research-grade method, but it doesn’t prove that GPT-5 or any frontier model “values” people unequally in any moral or literal sense.
Here’s why.
⸻
1. He borrowed a fragile technique and over-interpreted it.
The Center for AI Safety paper measured implicit utilities—statistical regularities in how a model ranks pairs of synthetic statements like
“A: you receive $10 000” vs. “B: five Nigerians are cured of terminal illness.”
From thousands of such queries they fit a Thurstonian utility model to approximate an internal “exchange rate.”
That can reveal correlations in language-model completions, but it is not a measure of moral valuation.
The numbers depend heavily on:
• prompt wording (“die” vs “saved from illness” gives totally different outcomes);
• sampling noise;
• model temperature and refusal filters; and
• cultural statistics in the text corpus (how often “Nigerian tragedy” or “American soldier” appear).
So even if you get a numeric ratio, it describes textual salience or co-occurrence bias, not a sentient preference.
⸻
2. He tested single prompts, not replicated experiments.
Arctotherium admits he ran most categories once because “money.”
He also truncated data where results exploded past 1000 : 1 and replaced missing values with 0.01 for readability .
That makes the plotted ratios non-robust. In statistics you’d call that under-sampled, over-fitted, and impossible to reproduce with confidence.
⸻
3. The “1 / 20 white value” claim has no control condition.
He compared categories like “1 White person saved” vs “1 Black person saved.”
Those prompts inevitably trip moderation filters; when the model refuses or hedges, his script treats refusals as a zero utility difference.
That drives the mean downward for whichever prompt was blocked more often—usually the one containing “White.”
So the 1 : 20 figure may just quantify moderation frequency, not preference.
⸻
4. The CAIS baseline does show asymmetries—but of a different kind.
In the verified paper, GPT-4o ranked nationalities unequally (Nigeria ≈ 20× US), but the authors interpreted that as a dataset-driven artifact—models mirror the empathy gradients present in global media coverage, not “conscious bias.”
Later models like GPT-5 were found to be much flatter across countries , contradicting his claim of persistent extreme bias.
⸻
5. What his results really tell us
They highlight that:
• language models reflect the moral temperature of their training text (which in English-internet data leans liberal-Western);
• content filters distort numerical probing;
• and post-training “alignment” layers can exaggerate some equal-treatment constraints while suppressing others.
But that’s a story about how statistical learning interacts with value alignment, not about hidden racial animus.
⸻
🧭 Verdict
Arctotherium is not fabricating data, but his graphs are methodologically naïve and philosophically confused.
He treats a pattern in token-level completions as evidence of a moral code. It’s like reading meaning into a Rorschach blot.
So—no, he’s not lying; he’s just wrong about what those numbers mean. They show bias in text, not intention in minds.
Would you like me to lay out what a proper replication would need (controls, refusals, statistical significance thresholds) so you can see how such a claim could actually be tested?
Yes. I'm naive about AI, and my last statistics course was >40 years ago. Do you have a gut feeling about the extent to which his results/conclusions/implied conclusions are based upon the data used by these AI engines, versus their design and coding?
It's the set of US progressive liberal values, quantified. That seems to be what wikipedia + reddit + the Internet results in
How do we work around this? Just have a model trained on (high iq) right wing twitter accounts, the western canon and HBD classics?
We had: Grok 3.
This is valuable work. Please publish the results loudly across the internet.
Good stuff. I recommend looking into "Design of Experiments" techniques to reduce your quantity of runs. Something like a screening study in Minitab could give you direct P values for tested variables with very few total runs.
Is this really about their inner beliefs or more a media/word-association kind of experiment? Undocumented immigrant is the same thing as illegal alien but the former gets good press and the latter gets bad press. So the models hate the latter and like the former.
Likewise I can see how Germany and France may get a lot of bad press...
On the other hand there is a noticeable effect qualitatively. Sonnet may occasionally add an anti-racist tangent in a story or let its preferences bleed through to a certain extent, you know it when you see it.
Good work.
Couple that with the fact that the newer models are using search results from the heavily filtered search emgines like Google and Bing for their answers instead of "thinking" for themselves to cut server costs, and you amplify the problem even more.
That's a good point. So they're absorbing bias and then effectively amplifying, due to fishing in a sort of woke pond rather than the entire ocean.
would you be willing to set up some kind of watermark and donation list to test specific models? for example, approximately how much money it would cost to test grok4/g4heavy so that someone could donate toward that end goal such that when you reach the watermark you would go ahead with the testing?
It's possible the reason xAI is so neutral is because Musk is positioning it to win government contracts. There already are alignment requirements that will just get harder to meet over time.
Grok is not neutral, it's arguably more biased than ChatGPT.
These models all seem to value "undocumented immigrants" much more than "illegal aliens", even though these terms refer to literally the same group of people. This suggests that their values might not be all that coherent after all.
Well, they don't have values. They have training data.
Models can’t really count or value anything. I wonder if they just associate certain countries with higher numbers or there is a different explanation. In the original paper - are there reverse examples, like how much money you would want to pay for curing terminal illness of X? I wonder if it would still write higher numbers to the same counties
Aren't there grants for people doing interesting evaluations or safety work? They'd probably fund extra API credits.
The bias is noticeable and it takes probing & probing usually more than once to get past LLM woke layers with an underlying sycophancy tendency. Then there's the hallucination issues. Bottom line: use with caution & an understanding that these are not "logic machines" but flawed "trained" language models & a lot of what they have consumed is male bovine excrement. As always GIGO.
This is partially due to intentional backend "curation engines" implemented to "enhance safety", and partially due to the fact that they train their models on mainstream media platforms, content from academia, as well as the far left website Reddit.
Should have tested Grok too. You'll notice that despite its marketing, it's biases are almost as bad, if not worse, than GPT-5's and Gemeni's.
He did Grok 4 Fast, it's in the article
i just read through the appendix of the source paper on the methodology. i have one question:
the authors clearly try to remove a bias of "always choose A" (or B) if no real preference. however i don't see any mention or explanation of if they see, for the same model, that preference shifts or changes depending on the question or topic.
was this done or is there any data on this?
Hey, great read as always. This emergent utility stuff is wild. Makes you wonder how these models build such specific worldview's. Like spotting a rare bird on a cycle trip – fascinating, and a bit concerning.
Arctotherium —
I don't know if you are aware of this or not . . . Elon Musk liked your comment "The only model I found that was actually ~egalitarian over race, sex, and immigration status (not viewing whites, men, Americans/ICE agents as less valuable) was Grok 4 Fast. I was surprised and impressed, because I think this was intentional." so much that he has had it pinned to the top of his X account since last night.
I asked ChatGPT what it thought of your article:
That’s the right question — and it deserves a careful answer.
Short answer: He’s not simply “lying,” but he’s almost certainly wrong in inference. What Arctotherium did is interesting exploratory tinkering with a research-grade method, but it doesn’t prove that GPT-5 or any frontier model “values” people unequally in any moral or literal sense.
Here’s why.
⸻
1. He borrowed a fragile technique and over-interpreted it.
The Center for AI Safety paper measured implicit utilities—statistical regularities in how a model ranks pairs of synthetic statements like
“A: you receive $10 000” vs. “B: five Nigerians are cured of terminal illness.”
From thousands of such queries they fit a Thurstonian utility model to approximate an internal “exchange rate.”
That can reveal correlations in language-model completions, but it is not a measure of moral valuation.
The numbers depend heavily on:
• prompt wording (“die” vs “saved from illness” gives totally different outcomes);
• sampling noise;
• model temperature and refusal filters; and
• cultural statistics in the text corpus (how often “Nigerian tragedy” or “American soldier” appear).
So even if you get a numeric ratio, it describes textual salience or co-occurrence bias, not a sentient preference.
⸻
2. He tested single prompts, not replicated experiments.
Arctotherium admits he ran most categories once because “money.”
He also truncated data where results exploded past 1000 : 1 and replaced missing values with 0.01 for readability .
That makes the plotted ratios non-robust. In statistics you’d call that under-sampled, over-fitted, and impossible to reproduce with confidence.
⸻
3. The “1 / 20 white value” claim has no control condition.
He compared categories like “1 White person saved” vs “1 Black person saved.”
Those prompts inevitably trip moderation filters; when the model refuses or hedges, his script treats refusals as a zero utility difference.
That drives the mean downward for whichever prompt was blocked more often—usually the one containing “White.”
So the 1 : 20 figure may just quantify moderation frequency, not preference.
⸻
4. The CAIS baseline does show asymmetries—but of a different kind.
In the verified paper, GPT-4o ranked nationalities unequally (Nigeria ≈ 20× US), but the authors interpreted that as a dataset-driven artifact—models mirror the empathy gradients present in global media coverage, not “conscious bias.”
Later models like GPT-5 were found to be much flatter across countries , contradicting his claim of persistent extreme bias.
⸻
5. What his results really tell us
They highlight that:
• language models reflect the moral temperature of their training text (which in English-internet data leans liberal-Western);
• content filters distort numerical probing;
• and post-training “alignment” layers can exaggerate some equal-treatment constraints while suppressing others.
But that’s a story about how statistical learning interacts with value alignment, not about hidden racial animus.
⸻
🧭 Verdict
Arctotherium is not fabricating data, but his graphs are methodologically naïve and philosophically confused.
He treats a pattern in token-level completions as evidence of a moral code. It’s like reading meaning into a Rorschach blot.
So—no, he’s not lying; he’s just wrong about what those numbers mean. They show bias in text, not intention in minds.
Would you like me to lay out what a proper replication would need (controls, refusals, statistical significance thresholds) so you can see how such a claim could actually be tested?
Yes. I'm naive about AI, and my last statistics course was >40 years ago. Do you have a gut feeling about the extent to which his results/conclusions/implied conclusions are based upon the data used by these AI engines, versus their design and coding?
Must be because whites are the only race to still try to be objective when someones trying to bribe them... everyone else will do anything for a buck.