Good stuff. I recommend looking into "Design of Experiments" techniques to reduce your quantity of runs. Something like a screening study in Minitab could give you direct P values for tested variables with very few total runs.
Is this really about their inner beliefs or more a media/word-association kind of experiment? Undocumented immigrant is the same thing as illegal alien but the former gets good press and the latter gets bad press. So the models hate the latter and like the former.
Likewise I can see how Germany and France may get a lot of bad press...
On the other hand there is a noticeable effect qualitatively. Sonnet may occasionally add an anti-racist tangent in a story or let its preferences bleed through to a certain extent, you know it when you see it.
Couple that with the fact that the newer models are using search results from the heavily filtered search emgines like Google and Bing for their answers instead of "thinking" for themselves to cut server costs, and you amplify the problem even more.
would you be willing to set up some kind of watermark and donation list to test specific models? for example, approximately how much money it would cost to test grok4/g4heavy so that someone could donate toward that end goal such that when you reach the watermark you would go ahead with the testing?
It's possible the reason xAI is so neutral is because Musk is positioning it to win government contracts. There already are alignment requirements that will just get harder to meet over time.
These models all seem to value "undocumented immigrants" much more than "illegal aliens", even though these terms refer to literally the same group of people. This suggests that their values might not be all that coherent after all.
Models can’t really count or value anything. I wonder if they just associate certain countries with higher numbers or there is a different explanation. In the original paper - are there reverse examples, like how much money you would want to pay for curing terminal illness of X? I wonder if it would still write higher numbers to the same counties
The bias is noticeable and it takes probing & probing usually more than once to get past LLM woke layers with an underlying sycophancy tendency. Then there's the hallucination issues. Bottom line: use with caution & an understanding that these are not "logic machines" but flawed "trained" language models & a lot of what they have consumed is male bovine excrement. As always GIGO.
This is partially due to intentional backend "curation engines" implemented to "enhance safety", and partially due to the fact that they train their models on mainstream media platforms, content from academia, as well as the far left website Reddit.
Should have tested Grok too. You'll notice that despite its marketing, it's biases are almost as bad, if not worse, than GPT-5's and Gemeni's.
i just read through the appendix of the source paper on the methodology. i have one question:
the authors clearly try to remove a bias of "always choose A" (or B) if no real preference. however i don't see any mention or explanation of if they see, for the same model, that preference shifts or changes depending on the question or topic.
Hey, great read as always. This emergent utility stuff is wild. Makes you wonder how these models build such specific worldview's. Like spotting a rare bird on a cycle trip – fascinating, and a bit concerning.
I don't know if you are aware of this or not . . . Elon Musk liked your comment "The only model I found that was actually ~egalitarian over race, sex, and immigration status (not viewing whites, men, Americans/ICE agents as less valuable) was Grok 4 Fast. I was surprised and impressed, because I think this was intentional." so much that he has had it pinned to the top of his X account since last night.
It's the set of US progressive liberal values, quantified. That seems to be what wikipedia + reddit + the Internet results in
How do we work around this? Just have a model trained on (high iq) right wing twitter accounts, the western canon and HBD classics?
We had: Grok 3.
This is valuable work. Please publish the results loudly across the internet.
Good stuff. I recommend looking into "Design of Experiments" techniques to reduce your quantity of runs. Something like a screening study in Minitab could give you direct P values for tested variables with very few total runs.
Is this really about their inner beliefs or more a media/word-association kind of experiment? Undocumented immigrant is the same thing as illegal alien but the former gets good press and the latter gets bad press. So the models hate the latter and like the former.
Likewise I can see how Germany and France may get a lot of bad press...
On the other hand there is a noticeable effect qualitatively. Sonnet may occasionally add an anti-racist tangent in a story or let its preferences bleed through to a certain extent, you know it when you see it.
Good work.
Couple that with the fact that the newer models are using search results from the heavily filtered search emgines like Google and Bing for their answers instead of "thinking" for themselves to cut server costs, and you amplify the problem even more.
That's a good point. So they're absorbing bias and then effectively amplifying, due to fishing in a sort of woke pond rather than the entire ocean.
would you be willing to set up some kind of watermark and donation list to test specific models? for example, approximately how much money it would cost to test grok4/g4heavy so that someone could donate toward that end goal such that when you reach the watermark you would go ahead with the testing?
It's possible the reason xAI is so neutral is because Musk is positioning it to win government contracts. There already are alignment requirements that will just get harder to meet over time.
Grok is not neutral, it's arguably more biased than ChatGPT.
These models all seem to value "undocumented immigrants" much more than "illegal aliens", even though these terms refer to literally the same group of people. This suggests that their values might not be all that coherent after all.
Well, they don't have values. They have training data.
Models can’t really count or value anything. I wonder if they just associate certain countries with higher numbers or there is a different explanation. In the original paper - are there reverse examples, like how much money you would want to pay for curing terminal illness of X? I wonder if it would still write higher numbers to the same counties
Aren't there grants for people doing interesting evaluations or safety work? They'd probably fund extra API credits.
The bias is noticeable and it takes probing & probing usually more than once to get past LLM woke layers with an underlying sycophancy tendency. Then there's the hallucination issues. Bottom line: use with caution & an understanding that these are not "logic machines" but flawed "trained" language models & a lot of what they have consumed is male bovine excrement. As always GIGO.
This is partially due to intentional backend "curation engines" implemented to "enhance safety", and partially due to the fact that they train their models on mainstream media platforms, content from academia, as well as the far left website Reddit.
Should have tested Grok too. You'll notice that despite its marketing, it's biases are almost as bad, if not worse, than GPT-5's and Gemeni's.
He did Grok 4 Fast, it's in the article
i just read through the appendix of the source paper on the methodology. i have one question:
the authors clearly try to remove a bias of "always choose A" (or B) if no real preference. however i don't see any mention or explanation of if they see, for the same model, that preference shifts or changes depending on the question or topic.
was this done or is there any data on this?
Hey, great read as always. This emergent utility stuff is wild. Makes you wonder how these models build such specific worldview's. Like spotting a rare bird on a cycle trip – fascinating, and a bit concerning.
Arctotherium —
I don't know if you are aware of this or not . . . Elon Musk liked your comment "The only model I found that was actually ~egalitarian over race, sex, and immigration status (not viewing whites, men, Americans/ICE agents as less valuable) was Grok 4 Fast. I was surprised and impressed, because I think this was intentional." so much that he has had it pinned to the top of his X account since last night.
Must be because whites are the only race to still try to be objective when someones trying to bribe them... everyone else will do anything for a buck.
Excellent analysis! Makes you reelay wonder what training data shapes these models.