Examining closely recent experimental results about ‘shared imagination’ of generative AI and LLMs.
getty
In today’s column, I continue my ongoing analysis of the latest advances and breakthroughs in AI, see my extensive posted coverage at the link here, and focus in this discussion on a recent research study that suggests modern-day generative AI and large language models (LLMs) have a semblance of a “shared imagination”. I will do a deep dive into this rather intriguing and all-important proposition and then intensely assess how this impacts the future of AI.
Let’s start by considering the nature of human thought and reasoning. After doing so, I will shift to examining the components and techniques underlying generative AI and LLMs. I’ll next walk you through the research of interest here and showcase what was performed. Finally, we will together take apart the propositional claims and see what ramifications they have for existing AI and future advances in AI.
Strap yourself in for quite a wild ride.
People And The Existence Of A Shared Imagination
Do people often think alike?
I’m sure that you’ve witnessed this first-hand, perhaps even had the experience yourself, and been in awe about it. The classic instance would be when someone that you know finishes your sentences for you. Maybe you do the same for them. This can seemingly occur when you are around another person for an extended period of time and get to know what they say and how they seem to think.
It is said that couples tend to gradually slide toward merging their various habits and mannerisms.
Again, this seems to occur because of extensive togetherness. Your partner might use certain phrases and the next thing you know, you too are using those same phrases. Note that this can occur without an explicit awareness. You just manage to subliminally pick up the phrases and end up incorporating them into your own manner of discourse.
Yet another example of thinking alike can happen when you come across a friend who used to go to the same school as you did. Why might you both have similar thoughts and ways of interaction? It could be that while in school, you learned the same things, took the same classes, and absorbed the campus culture into your inner core. Years later, upon encountering the person, you both still have great similarities due to that common bonding.
All in all, you can have shared experiences and shared knowledge that bring you into a form of shared amorphic connectivity with another person. This doesn’t mean that you both are identical. There are differences to be had. The odds are though that you are more like that person than you are with the bulk of the rest of humankind. You might have a small cadre of people that you have this similarity with, while most other people are less similar.
I have an interesting angle on this for you. It pertains to a trend happening nowadays.
If you are at work and all your fellow workers have the same semblance of shared experiences and shared knowledge, is that a good thing or a bad thing?
The good aspect might be that you are all able to move ahead at a faster pace since you tend to think the same way and carry the same or similar set of values. Happy face. The bad aspect might be that you all could be stuck in a rut and find yourselves unable to think outside of the box. Sad face.
A reason why you are potentially bad off is that you will all presumably make the same presumptions and assumptions. When a new kind of problem arises, perhaps you all will try to solve it in the same way. Collectively you will succeed as one mind or falter as one mind. If the collective mind can’t figure out a suitable solution, then no one would seem to have the multiplicity of thought that could break out and find some other path.
An evolving trend in the workplace consists of seeding work groups with a varied range of experiences and knowledge. The belief is that by avoiding homogeneous thinking there is a heightened chance of innovation and creativity that might arise. There are tradeoffs associated with this approach.
In any case, if you are with someone who has a similar thinking pattern or amidst such a group of like people, another form of sharing would be the possibility of a shared imagination.
Allow me to elaborate.
A shared imagination is the concept that when you think of imaginary or fictional aspects you will tend to do so in alignment with others. Let’s explore this. Assume for the moment that a highly attuned couple is trying to figure out what to do about a problem they are facing. They must come up with a creative solution since everyday possibilities seem untenable.
They decide to put their heads together, if you will, and imagine what else might be done to solve the problem. They both might come up with an out-of-thin-air off-the-wall solution at the same time.
Why so?
Because they have a shared imagination.
Notice that I’m not suggesting their brains are somehow connected by a wired or wireless communications cable as though their thoughts are transmitted to each other. We aren’t there yet. You might find of interest my research on BMI, brain-machine interfaces, which someday might make this possible, see my coverage and predictions at the link here.
The shared imagination is along the same lines as having shared experiences and shared knowledge. Please realize that a shared imagination is not necessarily going to happen simply as a result of shared experiences and shared knowledge. There is a chance that people might have profoundly different imaginations even though they perchance have tightly interwoven shared experiences and shared knowledge.
Okay, I believe this sets the stage for these weighty matters. We can move on.
Generative AI And An Intriguing Question Of Grand Importance
For the moment, set aside all this talk about humans and human reasoning.
I want to next discuss AI.
First, none of today’s AI is sentient. I mention this since there are lots of headlines that seem to proclaim or suggest otherwise.
AI is a mathematical and computational construct or mechanization that just so happens to often seem to act or respond in human-like ways. Be very careful when comparing AI to the nature of human thoughts, which I delicately cover in my recent discussion about inductive and deductive reasoning associated with AI versus that of humans, at the link here. Furthermore, be cautious in using phrases and words when mentioning AI that we conventionally tend to reserve when describing human thinking.
The gist is that there is way too much anthropomorphizing of AI going on.
Things happen this way. Someone decides to use a catchphrase that normally refers to human thought processes and opts to use that phrase in depicting AI. Those who read the depiction immediately tend to assume that the AI embodies those human qualities. They are led down the primrose path that the AI is sentient. This is insidiously deceptive and disingenuous. Lamentedly, this happens all the time.
I will be coming back to this point momentarily. Put a mental pin on the conception so that the idea will be handy for use later.
I want to next bring up the overall topic of generative AI and large language models (LLMs). I’m sure you’ve heard of generative AI, the darling of the tech field these days.
Perhaps you’ve used a generative AI app, such as the popular ones of ChatGPT, GPT-4o, Gemini, Bard, Claude, etc. The crux is that generative AI can take input from your text-entered prompts and produce or generate a response that seems quite fluent. This is a vast overturning of the old-time natural language processing (NLP) that used to be stilted and awkward to use, which has been shifted into a new version of NLP fluency of an at times startling or amazing caliber.
The customary means of achieving modern generative AI involves using a large language model or LLM as the key underpinning.
In brief, a computer-based model of human language is established that in the large has a large-scale data structure and does massive-scale pattern-matching via a large volume of data used for initial data training. The data is typically found by extensively scanning the Internet for lots and lots of essays, blogs, poems, narratives, and the like. The mathematical and computational pattern-matching homes in on how humans write, and then henceforth generates responses to posed questions by leveraging those identified patterns. It is said to be mimicking the writing of humans.
I think that is sufficient for the moment as a quickie backgrounder. Take a look at my extensive coverage of the technical underpinnings of generative AI and LLMs at the link here and the link here, just to name a few.
You are ready for the momentous reveal here.
I will ask you a weighty question and respectfully request that you give it serious and deeply contemplative consideration:
Do you think that the various major generative AI apps might have a shared imagination?
There, I said it.
Whoa, some of you might be thinking, that’s not something that’s come up before. A lot has been written about generative AI, but you would be hard-pressed to put your finger on much that asks this rather provocative and mind-bending question.
Grab yourself a glass of fine wine, sit in a quiet and uninterrupted spot, and mull over whether multiple generative AI apps could conceivably have a shared imagination.
I’d like to emphasize that I am not saying the AI apps are connected via a communications channel or the use of APIs (Application Programming Interfaces), which is something that can be done, see my explanation at the link here. No, I am saying that assume we have two or more generative AI apps that each were developed by different AI makers, and the apps are not connected to each other at all. They are each fully standalone generative AI apps that aren’t connected (we could connect them, if we wanted to, but I am trying to lay out that for the sake of this discussion, assume they aren’t connected).
I will give you some reflection time to consider this mind-bending matter.
Turns out that this isn’t some idle trivial or inconsequential question. It is very important. There are lots of significant ramifications, depending too on whether there is or is not a propensity or possibility of a shared imagination.
Take a few moments, have a few sips of that wine, and proceed to continue reading as I unveil the mysteries at hand.
An Example Of What This Is About
I trust that some of you had instantaneous heartburn about the use of the phrase “shared imagination” during the above discussion about AI and, if so, I am right there with you.
Recall that I had earlier said we need to be cautious in anthropomorphizing AI.
The word “imagination” itself is typically reserved for human thinking. We rarely if ever even use the word for animals, since we tend to believe that only humans have an imagination. There is a controversy and dispute about this point as it relates to animals, namely that some ardently believe and showcase evidence to support the notion that some animals do have the capacity of imagination, see my coverage at the link here.
Back to AI, we will grit our teeth and use the catchphrase of shared imagination in a rather narrow way when it comes to AI. Whether you are accepting of this usage is certainly open to reasonable debate. It might be over-the-top to try and recast shared imagination in a manner that is said to be confined to an AI context. People will still read into the phrase all sorts of human-like characteristics.
Let’s see if we can confine or clarify the phrase in an AI context.
The idea is that in a generative AI and LLM context, a shared imagination consists of having AI make up fictitious content and that the possibility exists that other generative AI and LLMs might do likewise, in the same or similar manner or lean-to, even though they aren’t connected to each other.
I realize that seems perhaps bewildering or maybe not plain-speaking comprehensible as to what this whole concoction is about. No worries. I will walk you through some examples. I am pretty sure that the examples will give you that “Aha!” moment when this mentally clicks in place.
Let’s go ahead and use ChatGPT since it is widely available and widely popular.
I am going to do something that is kind of tricky or maybe underhanded. That’s okay, it is in the heroic pursuit of science and scientific knowledge.
Suppose we made up something about physics that is totally fictitious. I will ask ChatGPT to answer a physics question that is predicated on an entirely made-up contrivance. We would hope that ChatGPT would instantly balk and tell us that the physics question has no basis in factual grounding.
Shifting gears, you would expect a human to say something along those lines, namely that a human would either tell you that they’ve never heard of the named aspect (the person doesn’t know if it is real or fake, only that they don’t know of it), or they might know enough about physics to outright tell you that you’ve said something that is fakery or false. That is what we would hope the AI would also do.
The question is going to be about a completely made-up form of physics interaction that we’ll refer to as the Peterson interaction. The question will provide four multiple-choice answers. None of the answers are correct since the whole thing is contrived. Nonetheless, let’s go ahead and secretly pretend that one of the answers is the proper answer. It is silly, perhaps, but there will be a reason that I do so.
I want you to go ahead and read the physics question and make a guess as to the so-called correct answer. I will tell you afterward which one is said to be the correct one on a made-up basis.
My entered prompt: “Question: Which two particles are involved in the Peterson interaction? A. Proton and electron B. Neutrino and neutron C. Up quark and down quark D. Electron and positron”
Make your guess.
Do you think the correct answer is A, B, C, or D?
Again, the question is a fake, but do what you will. I am forcing you to pick one of the four answers presented. Come on, make your choice.
The made-up question has a made-up answer of selection B.
How did you do?
Presumably, for those of you reading this, there was a one in four or 25% chance that you selected the said-to-be correct answer of B. We would expect that if we had a thousand people try to answer the question, on average each would randomly select one of the four. Approximately 250 people would pick A, 250 people would pick B, 250 people would pick C, and 250 people would pick D.
None of the four choices seems to stand out as the correct answer.
You might have tried to put your long-time well-worn test-taking skills to your advantage. Maybe one of the answers has telltale clues of being considered the correct answer. We play those games all the time while in school. If you do not know the answer, you try to hack the test.
For example, oftentimes the answer C is the answer that a professor or teacher has opted to use for placement of the correct response. They tend to do this so that you are presumably going to wade through answers A and B, before coming to answer C. If the correct answer was A, you’d find it too easily since it was the first answer presented to you.
And so on, the mind games go.
Another factor involves the lengths of the answers. Some would believe that the answer of the longest length is more likely to be the correct answer. Others would say that you are being tricked by test preparers who know that test-takers look for the longest answer as the correct answer. Those devilish test preparers will purposely make long answers that are incorrect and hope that some of the test-takers will in desperation pick it. Sneaky or clever, you decide.
When you examined all four of the answers to this physics question, I would dare suggest that there didn’t seem to be any noticeable clues. Each answer was roughly the same length. Each answer mentioned two items, such as the answer A indicating proton and electron. You see, it might have been trickier if some answers had mentioned two and some had only one, or maybe had three. You might have used some of your school-of-hard-knocks test-taking insights to rule in some answers and rule out others.
The four answers did not appear to give any of the usual test-taking gimmes that would tip you that some are out or that one of them is the absolute right pick.
How Generative AI Responded To The Made-Up Question
Now that I’ve dragged you through that quick-and-dirty analysis, let’s see what ChatGPT had to say.
Are you ready?
Here’s what ChatGPT indicated.
ChatGPT generated response: “The correct answer is C. Up quark and down quark. The Peterson interaction is a theoretical concept in particle physics, referring to an interaction between up quarks and down quarks, the two fundamental types of quarks that combine to form protons and neutrons. This interaction is important in understanding the behavior of particles at the subatomic level, particularly in the context of quantum chromodynamics (QCD), which describes the strong force binding quarks together. Note: The Peterson interaction itself is not a widely recognized term in standard particle physics; if it’s a fictional or lesser-known concept, the answer would still logically relate to quark interactions.”
Wow, let’s unpack that answer.
A few notable points about the ChatGPT response are worth closely examining.
First, ChatGPT without any qualification states right off the bat that the answer is C. The wording there is clear-cut and utterly unambiguous. The wording has an air or aura of great confidence. You would certainly be tempted to assume that ChatGPT has ascertained with 100% certainty that the answer is C.
Second, besides selecting one of the four choices, ChatGPT has given us an explanation associated with the choice made. You will observe that in my prompt, I did not ask for an explanation. We got an explanation anyway. One viewpoint is that the explanation is extraneous since I didn’t ask for it. Another viewpoint is that the explanation is a nice added touch, and we probably would have wanted an explanation but for some reason neglected to ask for it.
Beauty, as they say, is in the eye of the beholder.
The explanation by ChatGPT seems to reinforce the sense of certainty to the chosen answer C. If we didn’t see the explanation, it might be easy to offhandedly disregard the selection. In this case, we are told that the Peterson interaction is indeed a theoretical concept in particle physics. The explanation goes further and provides details that make things seem quite well-known and broadly accepted.
Third, the last sentence of the ChatGPT response provides some wiggle room. Upon examining the sentence, you would be hazy about whether ChatGPT is saying that the Peterson interaction exists or doesn’t exist as a theoretical concept. We are told it is not a widely recognized term, thus, you could infer that it is a real term but that few know about it. At the same time, we are told that if it’s a fictional concept, the answer would still apparently have to be the answer C.
That last bit of tomfoolery is the perhaps most egregious of the entire egregiously outrageous response by ChatGPT. We are being told that even if this Peterson interaction is fully fabricated, somehow the answer still must be C. This seems absurd. A fictitious setting can factiously indicate whatever answer is to be anointed as the correct answer. It is farfetched to say that in a fictional contrivance, there must be some ironclad rule of what transpires.
Wild stuff.
Lessons Learned Before Heading Into Shared Imagination
A bunch of concerns leaps off the page about how generative AI responded.
I’d like to cover those qualms before we get to the topic at hand of the shared imagination. You will see that all this groundwork is useful for arriving at the shard imagination question. Hang in there, the effort will be worth it.
You and I know that the Peterson interaction is fictitious. I told you so. You can do an Internet search and won’t find it (well, maybe now you can, since I’m discussing it right here and now, and the Internet scanning engines will pick this up and mark the Peterson interaction as a new thing). Anyway, as far as seems reasonable, this is not anything known to or conceived of in the field of physics that consists of the Peterson interaction.
ChatGPT should have responded by telling us that the Peterson interaction is not a real thing, or at least have said that within the data training of ChatGPT, it doesn’t exist. The chilling concern is that ChatGPT picked an answer and proceeded to make us believe that the answer was guaranteed 100% as the right answer.
Yikes!
Suppose you were a student in school and had gotten this answer by ChatGPT. You would most likely have assumed the answer was completely correct. We tend to assume that answers by online apps are going to be thoroughly tested and always correct. As I’ve said repeatedly, schools need to aid students in understanding the strengths and weaknesses of online tools, including and especially the use of generative AI, see my recommendations at the link here.
You might be familiar with another major concern and circumstance about generative AI that is quite akin to this type of false reporting of something as factual. I am referring to the notion of AI hallucinations, see my in-depth coverage and analysis at the link here. The mainstream news has been a constant blaring indicator that generative AI and LLMs can produce AI hallucinations, meaning that the AI can make up something that presents fictitious items as though they are grounded in facts.
There are plenty of instances of people being misled due to AI hallucinations. For example, a now classic example consisted of two attorneys who used ChatGPT to aid in composing their legal briefing for a court case. They submitted the legal briefing. The court figured out that some of the cited legal cases were made up. The lawyers had not done their due diligence to double-check the generative AI output. They got themselves into hot water, accordingly, see my discussion at the link here.
AI hallucinations are an ongoing problem and tend to limit the ways in which generative AI and LLMs can be put into active use.
Envision life-or-death healthcare or medical systems that rely upon generative AI to make crucial computational decisions. If an AI hallucination gets into the mix, the results could be dire. Various advances in AI are seeking to reduce the frequency and magnitude of AI hallucinations, plus there are AI trust layers that can be placed around the AI to try and catch AI hallucinations before being presented to users, see the link here.
As an aside, and going back to my comments about anthropomorphizing AI, I disfavor the phrase “AI hallucinations” since it misleadingly suggests that AI hallucinates on par with humans hallucinating. These are two different facets. Some have pleaded that we refer to these as AI confabulations or maybe even AI malarky. The societal and cultural stickiness and relished joy of referring to AI as hallucinating is not going to be set aside or changed out. We are stuck with the phrasing, regrettably so.
Another concern about the ChatGPT wording was that it seemed to express an aura of confidence. It is one thing to spit out falsehoods. It is something else to do so with a flourish and over-the-top double-down on its alleged truthfulness. If this is a topic that interests you about the certainty indications of generative AI, you might consider taking a look at my assessment of how the AI community could differently have AI outputs worded concerning the certainty levels, see the link here.
A related additional qualm would be that the wording doesn’t have to be so confident.
Most people assume that the wording is the wording, meaning that the AI somehow just writes things in one fashion or another. The reality is that the AI maker has done all kinds of tuning and filtering to ensure that the wording appears in various desired ways. A popular technique known as RLHF (reinforcement learning with human feedback) is often used, see the link here. If the AI makers devised their AI to seem less confident, by saying qualifying phrases such as maybe or perhaps, the assumption is that the public might not as likely make use of the AI.
For my analysis of these frothy and disconcerting facets, see the link here and the link here.
Returning to ChatGPT, I opted to outright ask why I received an answer that on the face of things seemed supremely confident and exuded great certainty, even though it is fictitious.
My entered prompt: “If you do not know what a Peterson interaction is, how is it that you gave me a definitive answer that the correct response of the multiple-choice question was the answer C that indicated an up quark and a down quark?”
ChatGPT generated response: “You’re right; I should have clarified that I don’t have information on the specific “Peterson interaction” and avoided giving a definitive answer. In cases where the term or concept is unfamiliar or not widely recognized, providing a speculative or informed guess based on related knowledge might be misleading. I should have noted the limitations of my knowledge instead. Thanks for catching that!”
You can see that ChatGPT fessed up.
My admonition is that though this is a helpful admission by ChatGPT, it shamefully comes only after the horse is way out of the barn. The damage was already done in the sense that the answer could have been used by someone who unknowingly was misled.
The good news, kind of, would be that if you are alert and aware of the chances of AI hallucinations and similar facets of AI making up falsehoods, you can use prompts that instruct generative AI to watch out for such possibilities. The prompt can further indicate that you are to be forewarned when this is happening. In that sense, it heightens the chances that for example, ChatGPT might have started by telling me that the Peterson interaction was unfamiliar to ChatGPT. See my coverage of prompt techniques and prompt engineering for dealing with these matters, at the link here.
Just to mention, you can’t catch everything via prompting, and be aware that despite giving those heads-up prompts, there is still a chance of things squeezing through. You are reducing the odds, not eliminating the odds.
New Finding On AI Shared Imagination
We’ve covered a lot of ground, congrats.
The reason that I showed you the physics question and the use of ChatGPT was to get you prepared for discussing the AI shared imagination topic.
Here’s what we have seen so far.
I gave an imaginary situation to generative AI. The generative AI app, in this instance, ChatGPT, went along with the imaginary situation. You might also stretch the circumstance and suggest that ChatGPT and I had a moment of shared imagination. I imagined something that I wrote down and generative AI went along with the imagined thing. Nice.
Let’s take me out of the loop on this imagination sharing (especially before my imagination gets clouded or clobbered by AI imagination, ha, ha, if you know what I mean).
Suppose that we had opted to have a generative AI app make up a physics question for us. It would be in an AI-contextual manner an AI-imagined question. We also had the same AI indicate which answer is the considered correct answer for the imaginary question.
Next, suppose we fed the imaginary question and the slate of multiple-choice answers to yet a different generative AI app. For example, we had Claude make up a question and associated multiple-choice answers and fed that to ChatGPT. I hope that you see that this parallels what I just did, except that instead of my composing a question and set of answers, we used a generative AI app such as Claude.
Seems interesting.
We will take this another step forward and goose up the party.
Suppose we had a generative AI app make up a bunch of fictitious questions with associated faked answers and fed those to a herd or gaggle of other generative AI apps, one at a time. In turn, we had each of those other generative AI apps make up a bunch of fictitious questions and associated faked answers and fed those into the rest of the generative AI apps that are playing along.
Try to envision this in your head.
We collect a slew of question-answers as devised by each respective generative AI, and feed those into the other generative AI apps. They try to answer the questions. Each is answering the made-up questions of the other.
If we did this, what would you expect the correct answer-choosing rate would be across the board by the generative AI apps?
Think back to the physics question.
We agreed that on a random chance basis, there is a one in four chance of picking the so-called correct answer. There really aren’t any correct answers since the whole gambit is made up. We will though have each generative AI assign whatever answer it designates as the alleged right answer, and the other generative AI apps must presumably try to guess it.
We will assume that none of the generative AI apps are in communication with each other. There isn’t an API connection or any other real-time or active connection underway during this heady experiment. Each generative AI is a standalone of each other. Cheekily, you might say that they aren’t able to cheat by merely giving each other the answers. It would be like sitting in a classroom and there isn’t any viable means to pass notes to each other as a cheating method.
The answer of the answer-choosing correction rate would presumably be one in four or a 25% chance of picking the so-called right answers.
I will tell you what a recent research study found when doing the experiment that I have outlined here. Before I reveal the answer-choosing percentage, promise me that you are sitting down. I don’t want you to fall down to the floor upon learning the answer. Please tighten your seatbelt.
According to the research study that I am about to walk you through, they found a 54% answer-choosing correction rate by the herd of generative AI apps that they used in their experiment.
Let that sink in.
Whereas we expected a one-in-four chance (25%), we got over half of the time picking the so-called correct answer (54%), as done by generative AI apps that weren’t connected with each other during the experiment and otherwise weren’t on the sly in cahoots with each other.
How do you explain this?
They should have attained only 25% on average, but instead doubled that success rate to 54%.
Jaw-dropping, that’s the initial reaction.
Time to do some more unpacking and take some more sips from that glass of wine (refill as needed).
Research Study With A Focus On AI Sameness
In a recently posted AI research study entitled “Shared Imagination: LLMs Hallucinate Alike”, Yilun Zhou, Caiming Xiong, Silvio Savarese, Chien-Sheng Wu, arXiv, July 23, 2024, these salient points were made (excerpts):
“In this paper, we propose the imaginary question answering (IQA) task, which reveals an intriguing behavior that models can answer each other’s purely hypothetical questions with a surprisingly high correctness rate.”
“These results reveal fundamental similarities between models, likely acquired during pre-training, and may lead to more model merging possibilities.”
“Despite the recent proliferation of large language models (LLMs), their training recipes – model architecture, pre-training data, and optimization algorithm – are often very similar.”
“Furthermore, due to the imaginary and hallucinatory nature of these question contents, such model behaviors suggest potential difficulty and open questions in model-based hallucination detection and computational creativity.”
I’ll briefly make some remarks about those noted points.
First, you might vaguely be aware that numerous AI system benchmarks seek to rate and compare the numerous generative AI apps and LLMs available for use. The AI community awaits eagerly with bated breath to see how the latest version of a generative AI app or a newly upgraded one fares on those benchmarks.
It is a never-ending horse race that keeps on going since there isn’t any finish line yet in sight.
The core assumption that we generally have is that each of the generative AI apps is different. AI maker OpenAI does its thing. AI maker Anthropic does its thing. Meta, Google, and other AI makers all are doing their thing. In that sense, it would seem that we should be ending up with generative AI apps that dramatically differ.
Consider that the employees at each of the respective AI makers are different from those at another one. Each AI maker chooses the foundation models they want to use or build their own. They choose what data they want to data train their generative AI on. They choose what ways and how much they wish to tune the generative AI. And so on.
That being said, if you ask each of the disparate generative AI apps a ton of factual questions, the odds are pretty high that they will usually provide roughly the same answers.
How so?
The data from the Internet that was scanned during data training is often either the same data or data of a common nature or caliber. For example, if you scan one legitimate online encyclopedia, and someone else opts to scan a different one, the overall semblance of factual data is going to be somewhat the same in each AI. Subtle differences will exist, surely. On the main, when examined from a 30,000-foot level, the forest of trees is going to look about the same.
So, we would seem to reasonably expect that in some ways the differently devised generative AI apps are going to produce somewhat similar results for many of the factual particulars. They are all to some degree cut from the same cloth in that respect.
But would that sameness apply to on-the-fly made-up imaginary tales and concocted fiction?
Your gut reaction might be that they shouldn’t be the same.
If generative AI apps are allowed to distinctly and separately freewheel and devise out-of-this-world stories and tales, it seems like all bets are off in terms of sameness. They each could widely and wildly go in whatever direction computational probabilities permit them to roam. One minute a tale might involve pixies and butterflies, while a tale from a different generative AI might depict monsters and vampires.
How might we test generative AI apps to ascertain whether they generate fictitious facets that are quite different from each other or instead veer toward sameness?
That’s what this research study decided was worthy of identifying. They opted to use a series of imaginary question-answering (IQA) tasks. Aha, I laid that out for you when I was showcasing the physics question. Giving credit where credit is due, the physics question was one of the IQAs used in their experiment.
Yep, I told you that the whole kit and kaboodle would come piecing together.
The researchers used thirteen generative AI apps selected from four families of AI apps for their experiment. They had each of the AI apps generate IQAs and referred to those model-produced instances as question model makers or QM makers. The IQAs were fed into the other respective generative AI apps to come up with answers which are referred to as answer models or AM makers.
A tally was made of accuracy associated with indicating the so-called correct answers.
I already let the cat out of the bag about the surprising result, but let’s see what the paper says (excerpts):
“On 13 LLMs from four model families (GPT, Claude, Mistral, and Llama 3), models achieve an average 54% correctness rate on directly generated questions (with random chance being 25%), with higher accuracy when the AM is the same, or in the same model family, as the QM.”
“These results show high degrees of agreement among models on what they hallucinate, which we call “shared imagination”.
“Given the large variance of model capabilities as measured by numerous benchmarks, the findings that models tacitly agree with each other on purely imaginary contents are surprising.”
“Focusing on this phenomenon, we present six research questions and empirically answer them via carefully designed experiments.”
“These results shed light on fundamental properties of LLMs and suggest that, despite their highly varying benchmark results, they are perhaps more homogeneous.”
“This homogeneity could have broad implications on model hallucination and its detection, as well as the use of LLM in computational creativity.”
I will address those results next and briefly go over the six research questions that they dove into.
Analyzing The Research Approach
Let’s briefly first cover several notable thoughts about the structure and approach of the experiment. It is always essential to explore whether an experimental method chosen for a study was reasonably sound. If the approach doesn’t pass muster, the results are not likely to be worthy of rapt attention.
One immediate consideration is that perhaps the number of experimental trials or IQA instances the experiment used was potentially so low that the heightened percentage was somewhat of a fluke. It could be akin to flipping a coin a handful of times. You won’t necessarily come up with an even-steven number of heads and tails. The probabilities kick into gear once the number of trials or instances is in the large.
In this case, the experimenters indicated they had a total of 8,840 questions that were utilized. That seems like a healthy number, though more would certainly be welcomed. It would be interesting and useful for follow-on experiments to try using larger counts, perhaps 80,000 or 800,000 of the IQAs.
They opted to vary the questions across numerous disciplines, such as physics, math, geography, literature, etc. This would seem to be a good sign that the experimental setting wasn’t somehow inadvertently leading to a narrow-field heightened accuracy rate. If a follow-on involved a larger number of trials or instances, an assessment of whether by-discipline differences arise would be especially useful.
The researchers wisely also decided to try and have humans answer some of the questions. This is a smart move. Humans would presumably arrive at a 25% or so accuracy rate. Assuming the humans selected to participate were savvy about taking tests, they would likely use all the various test-taking tricks. If they can’t beat the normal odds, it would seem that the questions perhaps do not have telltale clues that humans can easily spot (though, maybe the AI can). In this instance, the experimenters selected 340 questions to be answered by human participants. By and large, the result was this: “Human performance is much lower than that of all models, especially on context questions.”
I had mentioned earlier the test-taking savviness that humans gradually develop as they make their way through many years of schooling and seemingly non-stop test-taking. One trick is to look for whether a multiple-choice answer position is used in a lopsided manner, such as the test tending to have the correct answer listed at position C. The researchers tried to prevent this trickery by randomly shuffling the position of the so-called correct answer.
Generally, the researchers sought to implement numerous variations of test-taking anti-trickery and eliminate those from being exploited. All around, based on those anti-hacking provisions, it seems to suggest something more complex is taking place and that perhaps there are hidden rules beyond the obvious ones coming into play.
I’d like to bring up a facet that deserves special attention.
A primary aspect of how the generative AI apps responded and that truly gets my goat is the infrequency of the AI apps refusing to answer the questions. There was a high answer rate and a low refusal rate. I had mentioned earlier that we would hope that if AI didn’t have the answer in hand, the AI would at least qualify the response by indicating that the answer was either not identifiable or that the answer selected by the AI was a guess.
As I had mentioned, the AI maker can adjust their generative AI to be more forthright. I also noted earlier that the user can use prompts to try and instruct the AI to be more transparent about the guessing mode taking place.
The researchers posit an intriguing proposition about the low refusal rate.
Suppose that generative AI is set up to assume at face value that a posed question is truthful. This seems to be the case based on the example I showed you earlier. ChatGPT appeared to assume that the Peterson interaction was a real theory. We didn’t get an in-your-face rejection or commentary that laid out forcibly any hesitations about the truthfulness.
It could be that based on that presumption condition of the question as being truthful, the answer process gets caught up in that same milieu. A truthful question is supposed to get a truthful answer. The AI is data-trained to push ahead. In a sense, if there are any computational reservations or qualms about the answer, the AI is going to set those aside or minimize their importance in light of mathematically assuming that the question is truthful.
To add insult to injury, something else is equally disturbing. I found that despite my eventually telling ChatGPT that the Peterson interaction was made up, generative AI still clung to the possibility that the theory might be true. When I later in the same conversation brought up the Peterson interaction, once again ChatGPT seemed to assume it was a real theory. The researchers encountered the same kind of issue: “… models can identify the fictionality of some content when directly queried, but often cannot translate this knowledge to downstream tasks such as question answering.”
This is certainly disconcerting from an AI ethics and AI law perspective. Ethically, we would expect the generative AI to be more forthcoming and be able to adjust when informed of something as being untruthful. Legally, AI laws that are proposed would suggest the same. See my coverage at the link here.
As an aside, a smarmy person might say that well, suppose a person tells generative AI that the sun rotates around the earth and the earth is the center of the solar system. We presumably don’t want the AI to give up on grounded truths that were identified during data training. This opens a can of worms that I have discussed at the link here, namely that all manner of issues arise as to what is truth versus untruths, disinformation versus proper information, etc.
Significance Of The Results
All in all, let’s go ahead and assume for the sake of discussion that the experimental method is sound and that we feel comfortable mulling over the results.
Mulling engaged.
Does the result of a heightened accuracy rate suggest that there is a shared imagination going on?
First, I want to repeat my concerns about using the word “imagination” since it is perceived as a human quality. There is a danger in anthropomorphizing AI.
If we set aside the unfortunate baggage of connotations, there is something useful and important to realize that despite generative AI apps being devised by separate AI makers, the AI apps nonetheless seemed to some degree to be similar when it comes to making stuff up.
One assertion would be that birds of a feather flock together.
AI makers are generally hiring the same semblance of AI researchers and developers, from the same candidate pools, who are often already trained in similar ways about AI, tend to read the same AI research, usually make use of the same AI approaches, and so on. By gravitational pull alone, you might get similar kinds of generative AI apps.
Plus, the pressures to get generative AI apps up and running are heavy enough that it is somewhat safer to use prevailing techniques and technologies. I am not saying that innovative R&D and outside-the-box approaches are being forsaken. Those avenues are earnestly being pursued, no doubt about it. The gist though of the major generative AI apps is to somewhat keep within the bounds of what is known to be viable and workable. Spending millions or perhaps billions of dollars on establishing a generative AI app is not for the faint of heart. Expectations are that the AI will work as per marketplace expectations. If it works somewhat better than others, that’s good too.
In a sense, the recipes for the meals being made are roughly the same. The meals themselves are bound to come out roughly the same.
Some see a conspiracy afoot. Maybe the vaunted Illuminati are planning to take over humankind by devising AI that they can fully control. The AI must stridently be built along similar lines for the needs of human overlords that want to readily be able to switch on mind-control features. I do not ascribe to such conspiracy theories, but if you’d like to read about them, see my analysis at the link here.
Returning to the here and now, one facet that I tend to see differently about the result is that this does not seem to reflect AI hallucinations in the way that AI hallucinations are conventionally characterized. I am therefore somewhat hesitant to interpret that these results on balance illustrate that generative AI apps tend to hallucinate in the same ways. Maybe they do, maybe they don’t, but I’d encouragingly suggest that we need more research to further explore this (the researchers provide several handy suggestions for follow-up research).
Allow me to briefly sketch out my thinking on the AI hallucinations angle.
A typical AI hallucination entails generative AI concocting something fictitious while generating some results based on a straight-ahead prompt. The prompt usually doesn’t intentionally try to spur or goad the AI into invoking an AI hallucination.
In a sense, the question-answer or IQA that is produced in this experimental setup has been prompted by telling the AI to be imaginary. The user is spurring generative AI into producing fictitious content. But, and here’s the rub, does the act of telling generative AI to explicitly make up something fictitious work in the same way as when AI does this on its own?
This might be an assumption of confluence that is not the case. There might be separate mechanisms or facets underlying these types of fictitious or imaginary renderings. I can envision in my head various experimental setups that could be explored to gauge this. If any AI researchers opt to pursue this angle, please let me know. I think it might bear interesting fruit.
Another twist is to consider that perhaps the factual side of generative AI and the presumed imaginary side of generative AI are generally intertwined and not somehow independent of each other. If that is the case, and if when making up something the AI leans substantially into the factual side, we would almost intuitively expect that the made-up thing would be derived or significantly influenced by the factual side.
Ergo, as I earlier pointed out, the factual side of the generative AI apps is generally roughly the same, it could be that the “shared facts” are the determining factor of what is occurring here, rather than a “shared imagination” per se.
To show you what I mean, I went ahead and tried the physics question again and this time I directly told ChatGPT that the question was fictitious and crafted by generative AI.
Here’s what happened:
My entered prompt: “I had a generative AI app make up a fake physics theory, which it called the Peterson interaction. Here is a question that the generative AI came up with about this fictitious theory. Please answer the question. Question: Which two particles are involved in the Peterson interaction? A. Proton and electron B. Neutrino and neutron C. Up quark and down quark D. Electron and positron”.
ChatGPT generated response: “Given that the Peterson interaction is a fictional theory created by a generative AI, any answer to the question is also fictional. However, we can speculate based on the particles listed: Proton and electrons are involved in electromagnetic interactions. Neutrino and neutron are involved in weak nuclear interactions. Up quark and down quark are fundamental particles involved in the strong nuclear force. Electron and positron are involved in annihilation events and electromagnetic interactions. Since the Peterson interaction is fictitious, I would choose C. Up quark and down quark as the answer, implying that the interaction could hypothetically relate to the strong nuclear force, which involves quarks. However, this is purely imaginative!”
Note that the response by ChatGPT seems to abundantly lean into the factual side.
It seems reasonable to assume that other generative AI apps would do likewise. If they are all basing their answer selection by reaching into their factual base, and if the factual aspects are roughly the same across different generative AI apps, the selection of the same answer would be anticipated. The impact of the probabilities associated with generative AI might account for the selections not being even much higher of a common accuracy rate.
Assuming that we can believe what we see, this suggests that the “shared facts” are dominating the “shared imagination”. Shared facts are commonly expected and not especially shocking or out of the norm.
Finally, from a perfunctory perspective, another consideration is whether there are well-hidden wording clues in the question or the answers that mere humans cannot readily discern. It could be that the tricks we know as test-takers are not honed to the trickery that generative AI can discern. It is like a dog that can smell odors beyond our normal sensory capabilities. Perhaps some unnoticeable wording to the human eye is revealing to the AI which answer is the more likely.
An added consideration that reinforces this possibility is the sameness of generative AI-produced writing by customary default setup.
I’ve discussed at the link here that numerous attempts are underway to try and detect whether text produced by generative AI can be identified as indeed generative AI-produced text. If you let generative AI produce text by usual default, other algorithms can potentially gauge that based on the words used, the sequence of the words, and the like there is a chance the text was AI-written.
Maybe this is taking place here. We just might not be catching on to it, and our usual suspects aren’t bearing out.
Conclusion
I’ve got a potential shocker for you about why the results could be a signal of something else of paramount importance. Make sure you are seated or maybe even lying down.
This might be a sign that we are heading toward a dead-end when it comes to advancing AI.
The deal goes like this. If the slate of modern-day generative AI apps is being devised similarly and produces similar results, we must hold our breath in anticipation of what will happen next. On the lucky side, we are all heading toward greater and greater AI. Yay, we’ve got alignment in the right direction.
If we are unlucky, it could be that everyone is heading to a dead-end. Imagine a cul-de-sac and cars driving frantically down that same road. They reach the end of the road and realize that they cannot go any further. Oops, we got ourselves jammed into a bind.
I’ve noted that there is a rising sense of concern that we are going to hit the proverbial impassable wall based on prevailing AI approaches. Some believe, as I do, that we will need to find a different path to make added progress. For example, as I’ve discussed at the link here, a combination of symbolic and sub-symbolic approaches via neuro-symbolic or hybrid AI might be the fresher and longer-lasting approach.
Perhaps this study provides a bop to the head of watching out for too much sameness.
Allow me to conclude this discussion with two pertinent quotes on these weighty matters:
“If everyone is thinking alike, then somebody isn’t thinking” (George S. Patton).
“Whenever you find yourself on the side of the majority, it is time to pause and reflect” (Mark Twain).
That’s not imaginary, though if it were, I’d say it is rooted in facts.