Why ChatGPT AI Is Not Reliable
I’ll start with the simple fact: ChatGPT is not a reliable answerer to questions.
To try to explain why from scratch would be a heavy lift, but fortunately, Stephen Wolfram has already done the heavy lifting for us in his article, “What is ChatGPT Doing… and Why Does It Work?” [1] In a PF thread discussing this article, I tried to summarize the key message of Wolfram’s article as briefly as I could. Here is what I said in my post there [2]:
ChatGPT does not make use of the meanings of words at all. All it is doing is generating text word by word based on relative word frequencies in its training data. It is using correlations between words, but that is not the same as correlations in the underlying information that the words represent (much less causation). ChatGPT literally has no idea that the words it strings together represent anything.
In other words, ChatGPT is not designed to answer questions or provide information. It is explicitly designed not to do those things, because, as I said in the quote above, it only works with words in themselves; it does not work with, and does not even have any concept of, the information that the words represent. And that makes it unreliable, by design.
So, to give some examples of misconceptions that I have encountered: when you ask ChatGPT a question that you might think would be answerable by a Google Search, ChatGPT is not doing that. When you ask ChatGPT a question that you might think would be answerable by looking in a database (as Wolfram Alpha, for example, does when you ask it something like “What is the distance from New York to Los Angeles?”), ChatGPT is not doing that. And so on, for any value of “which you might think would be answerable by…”. The same is true if you substitute “looking for information in its training data” for any of the above: the fact that, for example, there is a huge body of posts on Instagram in ChatGPT’s training data does not mean that if you ask it a question about Instagram posts, it will look at those posts in its training data and analyze them to answer the question. It won’t. While there is, of course, voluminous information in ChatGPT’s training data for a human reader, ChatGPT does not use, or even comprehend, any of that information. All it gets from its training data is relative word frequencies.
So why do ChatGPT responses seem reliable? Why do they seem like they must be coming from a process that “knows” the information involved? Because our cognitive systems are designed to interpret things that way. When we see text that looks syntactically, and grammatically correct and seems like it is confidently asserting something, we assume that it must have been produced, if not by an actual human, at least by an “AI” that is generating the text based on some kind of actual knowledge. In other words, ChatGPT fools our cognitive systems into attributing qualities to it that it does not have.
This security hole, if you will, in our cognitive systems is not a recent discovery. Human con artists have used many of the same tricks throughout human history. The only difference with human con artists is that they were doing it intentionally, whereas ChatGPT has no intentions at all and is doing it as a side effect of its design. But the result is much the same: let the reader beware.
[1] https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/
- Completed Educational Background: MIT Master’s
- Favorite Area of Science: Relativity
consciousness
"
Nobody is claiming that ChatGPT is conscious. This is off topic for this thread.
"
As was discussed in the thread I started
"
And further discussion should go in that thread, not this one.
calm down the general public regarding the alarmism we are seeing in media.
"
From the media.
Or from certain elements of the AI community, with vested interest.
An example of what I mean here is the clear fact that ChatGPT is carefully trained never to say "I don't know the answer" (though it can essentially say "that depends", or "I only have information up to 2021"). Instead, it tries to give what information it can, even if it is essentially "blowing smoke." So I asked it why it doesn't say "I don't know", even though it is trained on a dataset where that response must be quite common, and it said:
"You are correct that "I don't know" is a common response to questions that AI models like me might not have enough information to answer accurately. While it is indeed a common response in human communication, I have not been explicitly trained to respond with "I don't know." Instead, my training data and fine-tuning process emphasize providing informative and contextually relevant responses whenever possible."
So clearly we see here an example where ChatGPT is not simply trained to respond based on word frequencies, certain word frequencies are intentionally skewed by the "fine-tuning process" to avoid answers like "I don't know" in favor of trying to provide something useful. If you ask it something that no one knows (like the maximum population of the ancient city of Troy), it will say that it is a matter of debate, and tell you a bunch of other things about it. There are many examples where we can see the intentions of the trainers on how ChatGPT responds, so I don't think it is a stretch at all that it has received specialized attention to its training in various specific areas (such as questions about how it operates). ChatGPT agrees: "Yes, it's highly likely that during the fine-tuning process of AI models like mine, the human reviewers and developers placed special emphasis on prompts that involve explaining how the model functions. This is because providing clear and informative responses about how the AI works is important for user understanding and trust." But of course, we don't really know when it is correct, that problem never goes away.
Would you let ChatGPT diagnose illness and prescribe medication? I mean, human doctors are 100% either. What could possibly go wrong?
"Hmmm…the search tree shows that no patients who were prescribed cyanide complained ever again. Therefore that must be the most effective treatment."
"
This was essentially what people thought Watson would be very good at, but it turned out to never be useful for that. The problem with Watson was that the data it was analyzing could not be properly standardized to make it useful. Like you say, humans are better at "filling in the gaps" using our ability to make logical connections, when there is sparseness in the evidence. ChatGPT navigates immense sparseness in its language model, it can write a poem that contains short sequences of words, say five or six in a row, that never appeared anywhere in its training data, yet make sense together. But there's a difference between words that make sense together and the right treatment for some ailment, and each patient is enough different from all the rest that Watson never had access to enough standardizable information to be able to do better than a human doctor. So the problem might not have been that Watson didn't understand what it was outputting the way a human does, but rather it could not understand the various unstandardizable aspects of the input data the way humans do.
But now think of a context where there is not so much unstandardizable information to understand, like a game with simple rules. No doubt this is why machine learning is so vastly successful at beating humans at games with simple rules, the input data is completely standardizable, there is literally nothing there except the rules and the possible positions. Does Stockfish "understand" how to play chess, while it destroying the greatest human chessplayers? An interesting question in the context of this thread.
I sincerely believe in the not-so-distant future, we'll have pharmacies and medical institutions where X% of low-grade illnesses will be handled by bots.
"
Before trusting any such bot, I would want to know that it was not based on an internal model like that of ChatGPT, which, as I've said, does not fact check its output.
But bots which do fact check their output are of course possible.
Would you let ChatGPT diagnose illness and prescribe medication? I mean, human doctors are 100% either. What could possibly go wrong?
"
I sincerely believe in the not-so-distant future, we'll have pharmacies and medical institutions where X% of low-grade illnesses will be handled by bots.
But we know how it functions conceptually which is word prediction based on the training data.
"
Yes, so it's all about the body of training data that it builds up. That is what is analogous to anything we could call "knowledge" on which to base its responses.
"
Of course the programers needed to add some feedback and limitations so it is usable, but it still doesn't understand what is outputing, so it's still not reliable.
"
That logic does not necessarily follow. Many people do understand what they output, but their understanding is incorrect, so their output is not reliable. There are many causes of unreliability, it's not clear that ChatGPT's cause of unreliability is its lack of understanding of what it is saying. The problem for me is that it is still quite unclear what humans mean when they say they understand a set of words. We can agree that ChatGPT's approach lacks what we perceive as understanding, but we cannot agree on what our own perception means, so contrasts are vague. Sometimes we agree when our meanings are actually rather different, and sometimes we disagree when our meanings are actually rather similar!
"
Can it be 100% reliable with more data? I don't think so.
The only way it can be reliable is with a completely different model in my opinion.
"
It might come down to figuring out the right way to use it, including an understanding (!) of what it is good at and not so good at, and how to interact with it to mitigate its limitations. I agree that it might never work like the Star Trek computer ("computer, calculate the probability that I will survive if I beam down and attack the Klingons") or like the Hitchhiker's Guide to the Galaxy's attempt to find the ultimate answer to life, the universe, and everything.
Just out of curiosity, what can be 100% reliable?
"
Well, nothing of course. But that's not the point.
Would you let ChatGPT diagnose illness and prescribe medication? I mean, human doctors are 100% either. What could possibly go wrong?
"Hmmm…the search tree shows that no patients who were prescribed cyanide complained ever again. Therefore that must be the most effective treatment."
you have made assertions about what chatGPT is doing and not doing. I am saying that we don't know what it is doing
"
We don't know the exact internals of its neural net, that's true. But we can still make general statements about what it is doing and not doing. For example, we know that it is not fact checking its output against actual data sources.
Can it be 100% reliable with more data?
"
Just out of curiosity, what can be 100% reliable?
Another relevant point is that none of us know what our own brains are doing when we answer questions put to us. Sometimes it feels like we are also spitting out answers one word at a time, with some general awareness of where we are going. It is not at all obvious to me that once we understand how our brains do it, we will not find there are some pretty simple steps involved, coupled to a vastly complex neural net. That sounds a lot like an LLM. So no, I remain unconvinced that LLMs "don't understand what they are saying" in a more fundamental way than this could be said about ourselves, except for the fact that we have experiences attached to our words in many cases. We also come with a self awareness feature, which some people suspect is really a kind of rationalization mechanism that essentially experiences mental processes after they have already finished happening.
"In summary, while the core of AI model behavior relies on patterns learned from data, the goals, values, and behavior guidelines are established by human designers, developers, and organizations. The training process includes human reviewers who follow these guidelines to rate and evaluate model responses, helping to shape the model's behavior and align it with the intended goals, which include providing useful and informative responses and maintaining a respectful and ethical tone. The combination of data-driven learning and human guidance is central to achieving the desired behavior in AI models."
Then I asked Bard if it agreed with the above, since the above is a general statement about AI language models, and it said:
"
I agree with the statement. AI models are trained on data, but the data is not the only factor that determines their behavior. The goals, values, and behavior guidelines established by human designers, developers, and organizations also play a role.
The training process includes human reviewers who follow these guidelines to rate and evaluate model responses. This helps to shape the model's behavior and align it with the intended goals. For example, if the goal is for the model to provide useful and informative responses, then the reviewers will rate responses that are not useful or informative negatively. This will help the model to learn to avoid giving these types of responses in the future."
Can you tell me more about the "embedding" that is done as part of your training?
Certainly! In the context of natural language processing and machine learning, "embedding" refers to the process of representing words, phrases, or entire documents as numerical vectors (arrays of numbers) in a high-dimensional space. These vector representations capture semantic relationships between words and are crucial for various NLP tasks such as text classification, sentiment analysis, machine translation, and more.
The idea behind word embeddings is to map words from a vocabulary to continuous vector spaces in such a way that words with similar meanings are located close to each other in this space. The key concept is that words that appear in similar contexts in a large corpus of text will have similar vector representations.
The training of embeddings typically involves models like Word2Vec, GloVe, or embeddings as a byproduct of neural language models like GPT (Generative Pre-trained Transformer). Here's a simplified overview of how embedding training works:
These word embeddings can then be used as input features for various NLP tasks or can be fine-tuned for specific tasks using transfer learning.
In the case of GPT-3.5, the model doesn't explicitly generate word embeddings, but it learns contextual embeddings as part of its training process. These embeddings capture rich contextual information about words within sentences and can be used for a wide range of NLP tasks without the need for additional pre-processing or training.
A professor tells me of having a colleague who asked the following question: "I asked three AI's the answer to the question X. Below are their responses. Which, if any, are correct and why?"
Apparently, the students are livid and want the professor's head on a pike. Granted, the summer session is "special", but the hostility is still impressive.
FWIW, I don't see this as an uinfair question at all.
"
Yeah, I'm mystified by that. Can you say more about what they were livid about? For me, that professor is using LLMs in exactly the way they should be used, as a springboard to inquiry not an "answer man." The very fact that their answers differ is also an excellent lesson in understanding what they are good for, and what their limitations are. The students are all mixed up somehow, but I wonder how?
I'm trying to understand the answer to Peter's question from your reply. Where did you get this? It sounds like you asked ChatGPT itself. How is that any more reliable than asking any other question of ChatGPT?
"
Part is asking ChatGPT that question, part is from other questions to it, part is from Wolfram's article, and part is just from other knowledge about neural nets and machine learning (though I claim no special expertise there). But I think it's pretty clear that ChatGPT has been specially trained in a lot of ways, and one of them is special training to respond to prompts about ChatGPT. This is my whole point here, there seems to be some kind of claim being made that since ChatGPT's training process results in a trained dataset of word frequencies that it can use to "predict the next token" as it is creating its response, that that somehow means it did not receive quite highly supervised and specialized training along the path to creating that database of word frequencies. I would say that the database so created is just as much a result of that specialized and supervised training as it is a result of the original data on which it was trained (culled from a proprietary list of sources but still way too sparse to produce a well trained database of word frequencies without substantial language modeling and supervised training, hence the term LLM).
So that's why it's more reliable to ask ChatGPT about what ChatGPT is than asking it some random question, it is a type of question that it is well trained to respond to. Just like it is well trained to respond to the prompt "poem" by delivering a poem, and it is well trained to respond to "you have made a mistake in your last answer" with an obsequious apology.
Apparently, the students are livid and want the professor's head on a pike. Granted, the summer session is "special", but the hostility is still impressive.
FWIW, I don't see this as an uinfair question at all.
Then there's the important "transformers", which involve two additional stages, as described. This must be where the frequency of connections between words happens, but it must already encompass some kind of difference between what is in the prompt and what is in the training database. Important there, it seems, is the "nonlinear transformations" that seem to play a role in finding connections between tokens that are important. Again I don't think one can let the LLM create its own nonlinear transformations, the human trainers must have a role in deciding on their structure, which likely involves some trial and error, I'm guessing. Trial and error must have also uncovered the problem of "vanishing gradients", which I believe can cause the process of predicting the next word to get stuck somewhere, as if an iteration process is used to hone in on the predicted probabilities, and that iteration must follow gradients in some kind of cost function to arrive at its result.
That seems to be the guts of it, but even then some additional scaffolding is inserted, more or less manually it sounds like, to detect special elements of the prompt like how long the answer is supposed to be (things that ChatGPT tries to respect but does not do so exactly). All these elements seem to have the fingerprints of the designers on them in many places, so although I'm sure the designers were constantly surprised how significantly the final result was affected by seemingly minor adjustments in the training protocol, nevertheless it seems clear that such adjustments were constantly needed.
Wolfram himself alludes to some of those stages in his article, he just does not go into detail.
"
Yes, but you did. Where are you getting those details from?
"
Probably some of the detail is proprietary anyway.
"
In which case you would not know them. But you must have gotten what you posted from somewhere. Where?
Where are these stages described?
"
Wolfram himself alludes to some of those stages in his article, he just does not go into detail. Probably some of the detail is proprietary anyway. Needless to say there is an extensive process that must occur to go from a bulk of input data to a trained LLM. One stage involves creating the language model itself, which is a huge part of the process that Wolfram does go into some detail about. It is one thing to say that the training creates a body of connections among tokens that can be scanned for frequencies to predict useful connections to the prompt, but it is another to describe the details of how that training process actually occurs. Wolfram mentions that one has to model how the connections work, because the training set will never be dense enough by just using the input dataset for that. There are also a lot of human choices that go into deciding what constitutes a success that should be encouraged.
I think @neobaud must be right that there is a lot that happens "under the hood" that even the human trainers don't understand, which might be a lot like what happens in our brains that we also don't understand. It seems to us there is a "ghost in the machine" of how our brains work, even though a microchemical understanding of our neural system might ultimately involve relatively simple steps (even simple enough to be something like "predicting the next word"). Profoundly complex behavior can emerge from simple systems that are connected in massively complex ways, this has always been a core principle of nonlinear dynamics that I still don't think we've penetrated very far into (which is how we keep surprising ourselves by what happens).
I think the description of what chat GPT is grossly over simplified in this article.
"
Wolfram's description is taken from the documentation for the ChatGPT version he reviewed.
The process involves several stages
"
Where are these stages described?
Self-Consuming Generative Models Go MAD
"
Seismic advances in generative AI algorithms for imagery, text, and other data
types has led to the temptation to use synthetic data to train next-generation
models. Repeating this process creates an autophagous (“self-consuming”) loop
whose properties are poorly understood. We conduct a thorough analytical and
empirical analysis using state-of-the-art generative image models of three families
of autophagous loops that differ in how fixed or fresh real training data is available
through the generations of training and in whether the samples from previous
generation models have been biased to trade off data quality versus diversity. Our
primary conclusion across all scenarios is that without enough fresh real data in
each generation of an autophagous loop, future generative models are doomed to
have their quality (precision) or diversity (recall) progressively decrease. We term
this condition Model Autophagy Disorder (MAD), making analogy to mad cow
disease
"
Certainly! I'd be happy to explain the steps I take from receiving a prompt to generating an output. The process involves several stages, each contributing to the final response:
Throughout these steps, the model aims to generate coherent, contextually relevant text based on the information provided in the prompt and the patterns it learned during training. Keep in mind that while GPT models are advanced, they don't have true understanding or consciousness; they generate responses based on statistical patterns in the data they've seen.
Thats a bit nitpicky in my opinion
"
Look at the title of the thread and the article it references. It is not about "LLMs" in general. It is specifically about ChatGPT. Again, if you want to discuss some other LLM like Bard, or LLMs in general, please start a separate thread.
It is not ChatGPT
"
And that means the article that this thread is about is not about it, so it is off topic here. If you want to discuss Bard, please start a separate thread.
Are you taking issue with the word interpret?
"
To the extent that it implies matching up the question with some kind of semantic model of the world, yes. ChatGPT does not do that; it has no such semantic model of the world. All it has is relative word frequencies from its training data.
And I think those choices are not that complex (not as much as you are suggesting) and the fascinating thing about LLMs is an emergent behaviour out of "simple" coding rules.
"
I agree that it is remarkable how complex behaviors emerge from seemingly simple coding rules, nevertheless it is the manipulation of those coding rules that have a large impact, it's not just the training database choices. The human trainers decide the coding rules, and the training database, with some purpose in mind, and that purpose leaves its mark on the outcome in interesting ways. One very important difference is the language model that is used, something that Wolfram emphasizes is absolutely key. The article you referenced mentioned that when Bard switched from LaMDA to PaLM 2, it got much better at writing code. So someone has to develop the language model that makes the training possible, and that is one place where human intelligence enters the question (until they create LLMs that can create or at least iterate language models). This discussion started when we talked about how LLMs handle prompts that are in the form of corrections (an important way to interact with LLMs if you want better answers), and it seems likely to me that the language model will have a profound effect on how prompts are interpreted, but I don't really know. I do think it's pretty clear that ChatGPT is intentionally programmed, by humans, to treat corrective prompts more obsequiously than Bard is.
"
Why it cannot just easily be a different number of parameters, a different cost function or different training data (which can be vastly different). Of course they are also different language models, which of course the details are proprietary. Not a good reference but still:
https://tech.co/news/google-bard-vs-chatgpt
"
It could be a combination of all those things, but my interest is in the human choices, such as the language model used, and the way the training is supervised to achieve particular goals. (For example, how will monetizability affect future training of LLMs?)
"
The mathematics can only be correct if the training data is correct or there is an additional math algorithm implemented into ChatGPT (which by now it could possibly be). But I wouldn't trust it.
"
What's odd is that sometimes the LLMs invoke Python code to explain how they carry out mathematical calculations, but then they don't actually run the code, because the outcome they report is incorrect! I think you can't ask an LLM how it gets to its answer, because it just predicts words rather than actually tracking its own path. It seems not to invoke any concept of itself or any sort of internal space where it knows things, even though it invokes empty semantics like "I understand" and "I think." So it responds a bit like a salesperson: if you ask it for the reasons behind its statements, it just gives you something that sounds good, but it's not the reason.
I still haven't found anything to show me that ChatGPT or Bard don't predict the next words based on the cost function.
"
Yes, LLMs predict the next words based on their training, and that training involves a cost function. But it involves much more than that (why else would Wolfram describe the whole escapade as an "art"), and it is that "much more" that is of interest. It involves a whole series of very interesting and complex choices by the humans who designed the training architecture, and that is what we are trying to understand. One particular example of this is, it is clear that ChatGPT and Bard handle ambiguous prompts quite differently, and corrective prompts quite differently also. So the question is, why is this? I suspect it reflects different choices in the training architecture, because it doesn't seem to be due to any differences in the database they are trained on. There must have been times when the human trainers decided they were, or were not, getting the behavior they desired, and made various adjustments in response to that, but what those adjustments are might fall under the heading of proprietary details, that's the part I'm not sure about. Certainly Wolfram alludes to a kind of tradeoff between accuracy in following a chain of logic, versus flexibility in terms of being able to handle a wide array of linguistic challenges, which is why ChatGPT is somewhat able to do both mathematical calculations and poetry writing, for example, but is not great at either.
Wolfram says that LLMs are like that, because there are way too many possible combinations of words to look back far at all, when trying to predict the next word. It just wouldn't work at all, unless they had a very good ability to model language, thereby vastly reducing the space of potential words they needed to include in their prediction process. I think the most substantial point that Wolfram made is that there is a kind of tradeoff between what seemed to me like accuracy (which is a bit like completeness) versus span (which is a bit like ability to reduce the search space to increase its reach). He said that irreducible computations are very reliable, but very slow because they have to do all the necessary computations (and these are normally what computers are very good at but it would never work for an LLM or a chess program), whereas modeling ability is only as reliable as the model (hence the accuracy problems of ChatGPT) but is way faster and way more able to make predictions that cross a larger span of text (which is of course essential for maintaining any kind of coherent train of thought when doing language).
So I believe this is very much the kind of "art of training" that Wolfram talks about, how to navigate that tradeoff. The surprise is that it is possible at all, albeit barely it seems, in the sense that the LLM can be trained to have just enough word range to maintain a coherent argument in response to a prompt that was many words in the past, yet still have some reasonably useful level of accuracy. That the accuracy level cannot be higher, however, would seem to be the reason that the training architecture is set up to accommodate an expectation that the user is going to be making followup corrections, or at least further guidance, in a series of prompts. The language modeling capacity must include many other bells and whistles, such that it can give a prominent place in whatever cost function it used in its training for key words in the prompt (like you can tell ChatGPT exactly how many lines to put in the poem, and it will try pretty hard to do it (though it won't succeed, remember it struggles with completeness), even if the prompt contains way more words than the program is capable of correlating the answer with).
I'm trying to understand the special relationship between the prompts and the predictions
"
This is the more interesting problem actually.
Let's take a step back. Chat GPT essentially calculates the probability P(x) that the next word is x, given the last N words were what they are. That's it. The rest are implementation details.
The easiest way to "seed" this on a question is to rewrite the question as a statement and to use that as the first N words of the answer. (Possibly keeping them, possibly dropping them.) I have no idea if this particular piece of code does it this way or some other way – it's just the easiest, and its been around for many, many decades.
"Could you please clarify whether you're looking for an essay about poetry or a poem itself?" It then asked me for further details about what I was looking for. The exact same prompt to Bard gave me, "An Essay in Verse. This sonnet is an essay…" and it wrote a poem about poetry and why it is kind of like an essay. So that is a completely different strategy about how to interpret the prompt, and it must be a function of the training architecture, probably quite a conscious decision by the trainers (because they would have had lots of experience with how their LLM reacts to prompts, and would have tinkered with it to get what they were looking for).
Then I did "poem essay" in a new session to both, and again ChatGPT asked for clarification about what I wanted (imagine the training architecture needed to achieve that result), whereas Bard wrote an essay about poetry! So Bard grants significance to the order of the words in the prompt, whereas ChatGPT does not, so ChatGPT requires further clarification to resolve the ambiguity.
What is also interesting is that when I told ChatGPT that when I prompt of the form "X essay", I want an essay on the topic of X, it then gave me an essay on poetry (since I had already prompted it with "poem essay.") But then when I said "essay poem" to see if it would understand that is of the form "Y poem", it gave me yet another essay about poetry. So it "understood" that "X" in my previous prompt meant "any topic", but it did not understand that I was saying the order mattered. If I prompt it with "essay and also a poem", it gave me first an essay and then a poem, so it understood that "also" in a prompt means "break the prompt into two parts and satisfy both." Wolfram talked about the crucial importance of "modeling language", I think this is a good example, the LLM must make a model of its prompt that includes ideas like "also" means "break the prompt into two separate prompts." The modeling aspect is human supplied, it is not an automatic aspect of the training process.
Well not exactly random, it has to follow the rules of the cost function. Still not see anything strange about that. I can easily see most poems are about nature or love or something else. I can also see that the word "poem" is asociated with the word "poetry" a lot of the time. That's way it's not surprising that such poem are written by a LLM.
"
I'm trying to understand the special relationship between the prompts and the predictions of text continuations, which must be established not by the dataset on which the LLM is trained, but rather on the architecture of the training itself. This is where we will find the concept of a "corrective" prompt, which clearly has a special status in how ChatGPT operates (in the sense that ChatGPT will respond quite differently to a prompt that corrects an error it made, versus some completely new prompt. It seems to expect that you will correct it, the trainers must have expected that it would make mistakes and need corrective prompts, and the training architecture is clearly set up to accomplish that, and in a different way for ChatGPT than for Bard.)
"
Orchestrated by who?
"
The people that set up the training environment, in particular the way prompts are associated with text predictions. It would be easy to train a new LLM based on a vast dataset of users interacting with previous LLMs, because that data would already have the structure of prompt/response, so you could train your new LLM to predict how the other LLMs reacted to their prompts. But if you are using the internet as your database, not past LLM sessions, you have to set up some new way to connect prompts to responses, and then train it to that.
For example, consider the single prompt "poem." You could imagine training an LLM to write essays about poems when prompted like that, or you could train an LLM to actually write a poem in response to that prompt. It seems to me this must be a choice of the training environment, it cannot just work out that if you use the internet as a database, you always end up with LLMs that write poems to this prompt. That must be a trainer choice, orchestrated by the LLM creators, to create certain expectations about what is the purpose of the LLM. That must also relate to how the LLM will react to corrective prompts, and how obsequious it will be. Again, I have found ChatGPT to be way more obsequious to corrections than Bard, even though they are similar LLMs with similar goals and using similar source material. This has to be the fingerprints of the trainers, whose fingerprints are on the training, and that's where the "art" comes in.
Hmm I prompted ChatGPT with the word "poem" several times and every time it generted a random poem.
"
I overstated when I said that the poems are "about poetry", but in a test where I started eight sessions and said "poem", five of the poems self referenced the act of writing a poem in some way. That is very unusual for poems to do, so we know they are not just cobbled together from poetry in some random kind of way. (The poems also generally are about nature, and the term "canvas" appears somewhere in almost all 8, surprisingly, so for some strange reason the training has zeroed in on a few somewhat specific themes when poetry is involved.) But the larger issue is that ChatGPT gives some kind of special significance to the prompt, it is trained in some way to treat the prompt as special and it was a bit of a slog to figure out from Wolfram's description just how that special status is enforced in the training process, apparently a crucial element is that it is trained to "model" language in a way that involves responses to prompts. ChatGPT also wouldn't explain it when I asked it. All I can say is that it appears to be a very specific type of language that it is modeling, in effect a way of predicting the next word that in some way reacts to a prompt, rather than just predicting the next word in a random body of text. (You could imagine training an LLM to do the latter, but you would not get ChatGPT that way, both Wolfram and ChatGPT itself refer to other aspects of the training process and the way the language model works but the specifics are far from clear to me.)
"
It is just scrabling text so that the natural language is uphold and that it rhymes (so that it actually looks like a poem, which there has to be presumably millions of them in the training data). Why would the trainers need to add anything?
"
They need to add the concept of a prompt, and how to alter the training in response to that.
"
Well sure, that is how LLMs are constructed. You could construct one without prompt and it will just write something random in a natural language somwhere at random time. Not really useful.
"
Yes exactly. So we should not say the LLMs are just predicting words that come next, they are doing it in a rather specific way that gives special status to the prompt. They also appear to give special status to a prompt that they are trained to interpret as a correction. This seems to be a difference between ChatGPT and Bard, for example, because in my experience ChatGPT is trained to respond to correction in a much more obsequious way than Bard is. (For example, if you try to correct both into saying that one plus one is three, ChatGPT will say you must be using a different mathematical system, or perhaps are even making a joke (!), while Bard is far less forgiving and said "this is a nonsensical question because 1+1 cannot equal 3. If 1+1=3, then the entire concept of mathematics breaks down", which is certainly not true because I can easily imagine a mathematical system which always upticks any answer of a binary integer arithmetical operation, and mathematics in that system does not break down. Thus Bard not only fails to be obsequious, it fails to be correct in its nonobseqiousness!)
"
Well sure, it has to be trained on, but who said otherwise? Trained on massive data not by people. They just review the responses and give feedback so ChatGPT can optimize itself (as you can also do). At the end of the day It's just predicting which word comes next.
"
The people do the training because they decide how the training will work. So that's not just predicting what word comes next, although it is mostly that. But it is predicting what word comes next in a very carefully orchestrated environment, and Wolfram makes it clear that the people don't completely understand why certain such environments work better than others, but he describes it as an "art", and there's nothing automatic in performing an artform.
"The process of training me involves a cost function, which is used to fine-tune my responses. During training, I'm presented with a wide variety of text and corresponding prompts, and I learn to predict the next word or phrase based on those inputs. The cost function helps adjust the model's internal parameters to minimize the difference between its predictions and the actual training data."
I think the important element of this is that it is trained on prompts and text, not just text. So it does not just predict the next word in a body of text, then see how well it did, it tries to predict responses to prompts. But a prompt is not a normal aspect of either human communication or bodies of text (if I came up to you and said "poem", you would not think you were being prompted, you would have no idea what I wanted and would probably ask me what I was talking about, but ChatGPT does not ask us what we are talking about, it is trained that everything is a prompt). So its training must look at bodies of text as if they were in response to something, and they look for some kind of correlation with something earlier that is then considered a prompt.
It says: "The prompt/response framework is a way to communicate how the model learns context and generates responses, but it doesn't necessarily represent the actual structure of the training data."
Where are these layers in the Wolfram article's description? (I am referring to the Wolfram article that is referenced in the Insights article at the start of this thread.)
"
I presume the "loss function" used for the training, this must be the crucial place that the trainers impose their intentions onto the output. I keep coming back to the fact that you can prompt ChatGPT with the single word "poem" and it will write a poem about poetry. Surely this is not what you can get from simply taking a body of written work and trying to predict the words that come after "poem" in that body of data, because very few poems are about poetry. There must be someplace where the trainers have decided what they expect the user to want from ChatGPT, and have trained it to produce a poem when the prompt is the word "poem", and have trained it to make the poem be about something mentioned in the prompt. That would go beyond simply predict what comes after the word "poem" in some vast body of text. The LLM would have to be trained to predict what words follow other words that also qualify as satisfying the prompt in some way.
So getting back to the issue as a prompt as a correction, the trainers expect that people will want to iterate with ChatGPT to improve its output in a session, so they will give prompts that should be interpreted as corrections. That's when ChatGPT is trained to be obsequious about doing corrections, it has to be built into the way it is trained and not just a simple algorithm for predicting text that follows the text in the prompt (again, humans are rarely so obsequious).
ChatGPT has additional layers on top of that to "push" it to favor some kinds of response more than others.
"
Where are these layers in the Wolfram article's description? (I am referring to the Wolfram article that is referenced in the Insights article at the start of this thread.)
it therefore doesn't know if the input is wrong, and will simply riff on the wrong input. This gives the appearance to the user that they are winning a debate or providing useful clarification on which to get a better answer, when in reality they may just be steering it towards providing or expanding on a wrong answer.
"
Yes, agreed.
It has a database of relative word frequencies. That's the only database it has. (At least, that's the case for the version that was reviewed in the articles being discussed here. Later versions might have changed some things.)
"
Yes good point, it starts with a training dataset and then generates its own, such that it only needs to query its own to generate its responses. But the manner in which that dataset is trained is a key aspect, that's where the intelligence of the trainers leaves its mark. They must have had certain expectations that they built in, which end up looking like there is a difference between a prompt like "poem", that will generate a poem about poetry, versus a followup prompt like "shorten that poem". It can even count the words it uses in its answer, and prohibit certain types of answers, so it has some extra scaffolding in its training.
No, it's designed to accept any input whatever and generate response text based on the relative word frequency algorithm that Wolfram describes in his article (referenced in the Insights article). It has no idea that some input is "criticism/clarification" based on its previous responses. It has no semantics.
"
I understand, but I think you might be reading past my point. As you say, it "accepts any input" regardless of what the input is. What I'm pointing out is that it therefore doesn't know if the input is wrong, and will simply riff on the wrong input. This gives the appearance to the user that they are winning a debate or providing useful clarification on which to get a better answer, when in reality they may just be steering it towards providing or expanding on a wrong answer.
it has a database to inform those frequencies
"
It has a database of relative word frequencies. That's the only database it has. (At least, that's the case for the version that was reviewed in the articles being discussed here. Later versions might have changed some things.)
It doesn't search any database; that was one of the main points of the Insights article. All it does is generate text based on relative word frequencies, using the prompt given as input as its starting point.
"
Yes, but it has a database to inform those frequencies. That database must be "primed" in some way to establish what the ChatGPT is, and how it should relate to prompts. For example, if you prompt it with "describe yourself", it will say "I am ChatGPT, a creation of OpenAI. I'm a language model powered by the GPT-3.5 architecture, designed to understand and generate human-like text based on the input I receive. I have been trained on a diverse range of text sources up until September 2021, so I can provide information, answer questions, assist with writing, generate creative content, and more. However, it's important to note that I don't possess consciousness, emotions, or personal experiences. My responses are based on patterns in the data I've been trained on, and I aim to be a helpful and informative tool for various tasks and conversations." So that's a highly specialized set of data to look for word associations with "describe yourself," it has been trained to favor certain word frequencies in response to certain prompts.
Also, if you correct it, it will invariably apologize obsequiously. So it is in some sense "programmed to accept corrections," in the sense that it uses a word association database that expects to be corrected and is trained to respond to that in certain ways.
It would seem that its training also expects to provide certain types of responses. For example, if you just give it the one word prompt "poem", it will write a poem. Also, since you did not specify a subject, it will write a poem about poetry! I think that was a conscious decision by its programmers, there are built in expectations about what a prompt is trying to accomplish, including corrections. It could be said that ChatGPT inherits some elements of the intelligence of its trainers.
it will typically search its database
"
It doesn't search any database; that was one of the main points of the Insights article. All it does is generate text based on relative word frequencies, using the prompt given as input as its starting point.
it gives the prompt some kind of special status, so when you correct something it said, it tends to become quite obsequious
"
I don't think that's anything explicitly designed, at least not in the version of ChatGPT that was reviewed in the Insights article and the Wolfram article it references. It is just a side effect of its algorithm.
No, it's designed to accept any input whatever and generate response text based on the relative word frequency algorithm that Wolfram describes in his article (referenced in the Insights article). It has no idea that some input is "criticism/clarification" based on its previous responses. It has no semantics.
"
Yet it gives the prompt some kind of special status, so when you correct something it said, it tends to become quite obsequious. It is not normal in human language to be that obsequious, I don't think it could get that from simply predicting the next word. When it has made a mistake and you tell it that it has, it will typically search its database a little differently, placing some kind of emphasis on your correction. If you tell it that one plus one equals three, however, it has enough contrary associations in its database that it sticks to its guns, but it will still not tell you that you have made a mistake (which surely would be the norm in its database of reactions), it will suggest you might be joking or using an alternative type of mathematics. The status it gives to the prompt must be an important element of how it works, and it is constrained to be polite in the extreme, which amounts to "riffing on" your prompt.
it is designed to* accept criticism/clarification/correction
"
No, it's designed to accept any input whatever and generate response text based on the relative word frequency algorithm that Wolfram describes in his article (referenced in the Insights article). It has no idea that some input is "criticism/clarification" based on its previous responses. It has no semantics.
…. If it misunderstands you, you can even clarify your meaning in a way that is really only possible with another human.
"
"
This seems to be vastly overstating ChatGPT's capabilities. You can "clarify" by giving it another input, but it won't have any semantic meaning to ChatGPT any more than your original input did.
"
My understanding is that it is designed to* accept criticism/clarification/correction.That makes such follow up mostly useless, since it will simply be riffing on what you tell it, regardless of whether is accurate or not. In other words, you'll always win a debate with it, even if you're wrong.
*Whether actual design or "emergent" behavior I don't know, but I don't think it matters.
More than I meant to reply….heh. Basically, LLMs have strengths and weaknesses, but none are going to stun an intellectual community in any area that might be relavent.
"
My question is, how much of this is due to the fact that these are just early generation attempts, versus how much is fundamental to the way LLMs must work? If we fix up their ability to recognize logical contradictions, and enhance their ability to do mathematical logic, will we get to a point where it is very hard to distinguish their capabilities from the capabilities of physics teachers who pose the questions in the first place? And if we did, what would that mean for our current ideas about what conceptual understanding is, since physics seems like a place where conceptual understanding plays a crucial role in achieving expertise. These kinds of AI related questions always remind me of B. F. Skinner's great point, "The real question is not whether machines think but whether men do. The mystery which surrounds a thinking machine already surrounds a thinking man.” (Or woman.)
Well LLMs do not have computational algorhitms (yet), they deal with text pattern recognition, so I don't know why it's so surprising they cannot do calculations.
"
It's because I have seen them report a Python code they used to do the calculation, and the Python code does not yield the quantitative result they report. So that's pretty odd, they seem to be able to associate their prompts with actual Python code that is correct, and still get the answer wrong.
"
Here is a proposal for a math extension for LLMs: https://aclanthology.org/2023.acl-industry.4.pdf
"
Yes, this is the kind of thing that is needed, and is what I'm expecting will be in place in a few years, so it seems likely that ten years from now, LLMs will be able to answer physics questions fairly well, as long as they only require associating the question with a formula without conceptual analysis first. It will then be interesting to see how much LLMs have to teach us about what we do and do not comprehend about our own physics, and what physics understanding actually is. This might be pedagogically significant for our students, or something much deeper.
"
Anyway testing ChatGPT (or Bard) a little more I find it useful for initial code generation, but to have a properly functioning script I found myself goning to StackOverflow 70% of the time. The explanation, examples are all already there, wtih LLMs you have to ask a lot of questions (which means typing and waiting) and still don't get the right answers some of the time. And mind you this is not complex code (for that, I never use LLMs), just some small scripts for everyday use.
"
Then the question is, why do you not use LLMs for complex code, and will that still be true in ten years? That might be the coding equivalent of using LLMs to solve physics questions, say on a graduate level final exam.
I see. I was wondering. By what I read, it seemed the only LLM being discussed was ChatGPT, which I now understand why. I do not understand exactly why ChatGPT gets so much flack though, even still. At the end of the day, it's merely a tool.
"
The issue, on this forum at least, it not actually the tool so much as people's misunderstanding about how it works and the reasons for its behavior. People give it too much credit for "intelligence" and so forth.
Ten years from now? Well of course it will have improved enough to give excellent results to the kinds of exams that are currently given. Does that point to a problem in the exams, if they can be answered correctly by a language manipulation model that does not have any comprehension of its source material? Possibly, yes, it may mean that we are not asking the right kinds of questions to our students if we want them to be critical thinkers and not semantic parrots.
Yes, and no doubt how these models work will continue to evolve. The reason for specifically considering ChatGPT in the Insights article under discussion is that that specific one (because of its wide public accessibility) has been the subject of quite a few PF threads, and in a number of those threads it became apparent that there are common misconceptions about how ChatGPT works, which the Insights article was intended to help correct.
Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions
Over the last decade, Q&A platforms have played a crucial role in how programmers seek help online. The emergence of ChatGPT, however, is causing a shift in this pattern. Despite ChatGPT's popularity, there hasn't been a thorough investigation into the quality and usability of its responses to software engineering queries. To address this gap, we undertook a comprehensive analysis of ChatGPT's replies to 517 questions from Stack Overflow (SO). We assessed the correctness, consistency, comprehensiveness, and conciseness of these responses. Additionally, we conducted an extensive linguistic analysis and a user study to gain insights into the linguistic and human aspects of ChatGPT's answers. Our examination revealed that 52% of ChatGPT's answers contain inaccuracies and 77% are verbose. Nevertheless, users still prefer ChatGPT's responses 39.34% of the time due to their comprehensiveness and articulate language style. These findings underscore the need for meticulous error correction in ChatGPT while also raising awareness among users about the potential risks associated with seemingly accurate answers.
Users get tricked by appearance.
"
Our user study results show
that users prefer ChatGPT answers 34.82% of the time. However, 77.27% of these preferences are incorrect answers. We believe this observation is worth investigating. During our study, we observed that only when the error in the ChatGPT answer is obvious, users can identify the error. However, when the error is not readily verifiable or requires external IDE or documentation, users often fail to identify the incorrectness or underestimate the degree of error in the answer. Surprisingly, even when the answer has an obvious error, 2 out of 12 participants still marked them as correct and preferred that answer. From semi-structured interviews, it is
apparent that polite language, articulated and text-book style answers, comprehensiveness, and affiliation in answers make completely wrong answers seem correct. We argue that these seemingly correct-looking answers are the most fatal. They can easily trick users into thinking that they are correct, especially when they
lack the expertise or means to readily verify the correctness. It is even more dangerous when a human is not involved in the generation process and generated results are automatically used elsewhere by another AI. The chain of errors will propagate and have devastating effects in these situations. With the large percentage
of incorrect answers ChatGPT generates, this situation is alarming. Hence it is crucial to communicate the level of correctness to users
"
So I told it that it was wrong, but the rest was right, so just do the multiplication again. This time it got a different answer, but still completely wrong. So I asked it to describe how it carried out the multiplication, step by step, and it couldn't do it. So I told it to carry out the multiplication of the first two numbers, report the answer, then multiply the third one, and so on, and it then did get the answer correct. Then I used Bard and it got the answer wrong also, though it could report the Python code it used, but the Python code did not give the answer Bard gave! So those AIs cannot seem to track how they arrive at their own answers, and I think that may be closely related to why their numerical results are horrendously unreliable. At some point along the way, they seem to do something that in human terms would be called "guessing", but they cannot distinguish that from any other type of analysis they do, including following Python codes.
Its value is that it can interpret your question
"
But it doesn't "interpret" at all. It has no semantics.
"
you can even clarify your meaning in a way that is really only possible with another human
"
This seems to be vastly overstating ChatGPT's capabilities. You can "clarify" by giving it another input, but it won't have any semantic meaning to ChatGPT any more than your original input did.
There are many differences between how biological nervous systems work, and how the computers that execute the algorithms behind large language models work.
"
Yes, and a major difference, at least for the version of ChatGPT that this article discusses, is that biological nervous systems have two-way real-time connections with the outside world, whereas ChatGPT does not–it has a one-way, "frozen" connection from its training data, and that connection is much more limited than the connections biological systems have since it only consists of relative word frequencies in text.
Is there any mathematical logic in it at all?
"
Not in the version of ChatGPT that my Insights article, and the Wolfram article it references, are based on. Future versions might add something along these lines.
Can one break down how a blind statistical process can do that?
"
Is there any mathematical logic in it at all? Obviously search engines and chatbots use word association so "big" = "mass" and from that you get a list of masses. Can it then do =? Seems messy, but I'd think by statistical analysis/word frequency it would associate a 5 digit number with "bigger" than a 4 digit number even without doing the math.
For what it's worth, translating your question into searchenginese doesn't give a one-word answer but does give a table of masses in the 5th hit.
I think this is a good exercise, because people so vastly over-estimate the capabilities of chat-bots. It's amazing how convincing you can make it with a large enough data set without using actual intelligence.