Why ChatGPT Is Not Reliable
I’ll start with the simple fact: ChatGPT is not a reliable answerer of questions.
To try to explain why from scratch would be a heavy lift, but fortunately, Stephen Wolfram has already done the heavy lifting for us in his article, “What is ChatGPT Doing… and Why Does It Work?” [1] In a PF thread discussing this article, I tried to summarize as briefly as I could the key message of Wolfram’s article. Here is what I said in my post there [2]:
ChatGPT does not make use of the meanings of words at all. All it is doing is generating text word by word based on relative word frequencies in its training data. It is using correlations between words, but that is not the same as correlations in the underlying information that the words represent (much less causation). ChatGPT literally has no idea that the words it strings together represent anything.
In other words, ChatGPT is not designed to actually answer questions or provide information. In fact, it is explicitly designed not to do those things, because, as I said in the quote above, it only works with words in themselves; it does not work with, and does not even have any concept of, the information that the words represent. And that makes it unreliable, by design.
So, to give some examples of misconceptions that I have encountered: when you ask ChatGPT a question that you might think would be answerable by a Google Search, ChatGPT is not doing that. When you ask ChatGPT a question that you might think would be answerable by looking in a database (as Wolfram Alpha, for example, does when you ask it something like “what is the distance from New York to Los Angeles?”), ChatGPT is not doing that. And so on, for any value of “which you might think would be answerable by…”. And the same is true if you substitute “looking for information in its training data” for any of the above: the fact that, for example, there are a huge body of posts on Instagram in ChatGPT’s training data does not mean that if you ask it a question about Instagram posts, it will look at those posts in its training data and analyze them in order to answer the question. It won’t. While there is, of course, voluminous information in ChatGPT’s training data for a human reader, ChatGPT does not use, or even comprehend, any of that information. Literally all it gets from its training data is relative word frequencies.
So why do ChatGPT responses seem like they are reliable? Why do they seem like they must be coming from a process that “knows” the information involved? Because our cognitive systems are designed to interpret things that way. When we see text that looks syntactically, grammatically correct and seems like it is confidently asserting something, we assume that it must have been produced, if not by an actual human, at least by an “AI” that is generating the text based on some kind of actual knowledge. In other words, ChatGPT fools our cognitive systems into attributing qualities to it that it does not actually have.
This security hole, if you will, in our cognitive systems is not a recent discovery. Human con artists have made use of much the same tricks throughout human history. The only difference with the human con artists is that they were doing it intentionally, whereas ChatGPT has no intentions at all and is doing it as a side effect of its design. But the end result is much the same: let the reader beware.
[1] https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/
- Completed Educational Background: MIT Master’s
- Favorite Area of Science: Relativity
"The process of training me involves a cost function, which is used to fine-tune my responses. During training, I'm presented with a wide variety of text and corresponding prompts, and I learn to predict the next word or phrase based on those inputs. The cost function helps adjust the model's internal parameters to minimize the difference between its predictions and the actual training data."
I think the important element of this is that it is trained on prompts and text, not just text. So it does not just predict the next word in a body of text, then see how well it did, it tries to predict responses to prompts. But a prompt is not a normal aspect of either human communication or bodies of text (if I came up to you and said "poem", you would not think you were being prompted, you would have no idea what I wanted and would probably ask me what I was talking about, but ChatGPT does not ask us what we are talking about, it is trained that everything is a prompt). So its training must look at bodies of text as if they were in response to something, and they look for some kind of correlation with something earlier that is then considered a prompt.
It says: "The prompt/response framework is a way to communicate how the model learns context and generates responses, but it doesn't necessarily represent the actual structure of the training data."
I presume the "loss function" used for the training, this must be the crucial place that the trainers impose their intentions onto the output. I keep coming back to the fact that you can prompt ChatGPT with the single word "poem" and it will write a poem about poetry. Surely this is not what you can get from simply taking a body of written work and trying to predict the words that come after "poem" in that body of data, because very few poems are about poetry. There must be someplace where the trainers have decided what they expect the user to want from ChatGPT, and have trained it to produce a poem when the prompt is the word "poem", and have trained it to make the poem be about something mentioned in the prompt. That would go beyond simply predict what comes after the word "poem" in some vast body of text. The LLM would have to be trained to predict what words follow other words that also qualify as satisfying the prompt in some way.
So getting back to the issue as a prompt as a correction, the trainers expect that people will want to iterate with ChatGPT to improve its output in a session, so they will give prompts that should be interpreted as corrections. That's when ChatGPT is trained to be obsequious about doing corrections, it has to be built into the way it is trained and not just a simple algorithm for predicting text that follows the text in the prompt (again, humans are rarely so obsequious).
Where are these layers in the Wolfram article's description? (I am referring to the Wolfram article that is referenced in the Insights article at the start of this thread.)
Yes, agreed.
Yes good point, it starts with a training dataset and then generates its own, such that it only needs to query its own to generate its responses. But the manner in which that dataset is trained is a key aspect, that's where the intelligence of the trainers leaves its mark. They must have had certain expectations that they built in, which end up looking like there is a difference between a prompt like "poem", that will generate a poem about poetry, versus a followup prompt like "shorten that poem". It can even count the words it uses in its answer, and prohibit certain types of answers, so it has some extra scaffolding in its training.
I understand, but I think you might be reading past my point. As you say, it "accepts any input" regardless of what the input is. What I'm pointing out is that it therefore doesn't know if the input is wrong, and will simply riff on the wrong input. This gives the appearance to the user that they are winning a debate or providing useful clarification on which to get a better answer, when in reality they may just be steering it towards providing or expanding on a wrong answer.
It has a database of relative word frequencies. That's the only database it has. (At least, that's the case for the version that was reviewed in the articles being discussed here. Later versions might have changed some things.)
Yes, but it has a database to inform those frequencies. That database must be "primed" in some way to establish what the ChatGPT is, and how it should relate to prompts. For example, if you prompt it with "describe yourself", it will say "I am ChatGPT, a creation of OpenAI. I'm a language model powered by the GPT-3.5 architecture, designed to understand and generate human-like text based on the input I receive. I have been trained on a diverse range of text sources up until September 2021, so I can provide information, answer questions, assist with writing, generate creative content, and more. However, it's important to note that I don't possess consciousness, emotions, or personal experiences. My responses are based on patterns in the data I've been trained on, and I aim to be a helpful and informative tool for various tasks and conversations." So that's a highly specialized set of data to look for word associations with "describe yourself," it has been trained to favor certain word frequencies in response to certain prompts.
Also, if you correct it, it will invariably apologize obsequiously. So it is in some sense "programmed to accept corrections," in the sense that it uses a word association database that expects to be corrected and is trained to respond to that in certain ways.
It would seem that its training also expects to provide certain types of responses. For example, if you just give it the one word prompt "poem", it will write a poem. Also, since you did not specify a subject, it will write a poem about poetry! I think that was a conscious decision by its programmers, there are built in expectations about what a prompt is trying to accomplish, including corrections. It could be said that ChatGPT inherits some elements of the intelligence of its trainers.
It doesn't search any database; that was one of the main points of the Insights article. All it does is generate text based on relative word frequencies, using the prompt given as input as its starting point.
I don't think that's anything explicitly designed, at least not in the version of ChatGPT that was reviewed in the Insights article and the Wolfram article it references. It is just a side effect of its algorithm.
Yet it gives the prompt some kind of special status, so when you correct something it said, it tends to become quite obsequious. It is not normal in human language to be that obsequious, I don't think it could get that from simply predicting the next word. When it has made a mistake and you tell it that it has, it will typically search its database a little differently, placing some kind of emphasis on your correction. If you tell it that one plus one equals three, however, it has enough contrary associations in its database that it sticks to its guns, but it will still not tell you that you have made a mistake (which surely would be the norm in its database of reactions), it will suggest you might be joking or using an alternative type of mathematics. The status it gives to the prompt must be an important element of how it works, and it is constrained to be polite in the extreme, which amounts to "riffing on" your prompt.
No, it's designed to accept any input whatever and generate response text based on the relative word frequency algorithm that Wolfram describes in his article (referenced in the Insights article). It has no idea that some input is "criticism/clarification" based on its previous responses. It has no semantics.
My understanding is that it is designed to* accept criticism/clarification/correction.That makes such follow up mostly useless, since it will simply be riffing on what you tell it, regardless of whether is accurate or not. In other words, you'll always win a debate with it, even if you're wrong.
*Whether actual design or "emergent" behavior I don't know, but I don't think it matters.
My question is, how much of this is due to the fact that these are just early generation attempts, versus how much is fundamental to the way LLMs must work? If we fix up their ability to recognize logical contradictions, and enhance their ability to do mathematical logic, will we get to a point where it is very hard to distinguish their capabilities from the capabilities of physics teachers who pose the questions in the first place? And if we did, what would that mean for our current ideas about what conceptual understanding is, since physics seems like a place where conceptual understanding plays a crucial role in achieving expertise. These kinds of AI related questions always remind me of B. F. Skinner's great point, "The real question is not whether machines think but whether men do. The mystery which surrounds a thinking machine already surrounds a thinking man.” (Or woman.)
It's because I have seen them report a Python code they used to do the calculation, and the Python code does not yield the quantitative result they report. So that's pretty odd, they seem to be able to associate their prompts with actual Python code that is correct, and still get the answer wrong.
Yes, this is the kind of thing that is needed, and is what I'm expecting will be in place in a few years, so it seems likely that ten years from now, LLMs will be able to answer physics questions fairly well, as long as they only require associating the question with a formula without conceptual analysis first. It will then be interesting to see how much LLMs have to teach us about what we do and do not comprehend about our own physics, and what physics understanding actually is. This might be pedagogically significant for our students, or something much deeper.
Then the question is, why do you not use LLMs for complex code, and will that still be true in ten years? That might be the coding equivalent of using LLMs to solve physics questions, say on a graduate level final exam.
The issue, on this forum at least, it not actually the tool so much as people's misunderstanding about how it works and the reasons for its behavior. People give it too much credit for "intelligence" and so forth.
Ten years from now? Well of course it will have improved enough to give excellent results to the kinds of exams that are currently given. Does that point to a problem in the exams, if they can be answered correctly by a language manipulation model that does not have any comprehension of its source material? Possibly, yes, it may mean that we are not asking the right kinds of questions to our students if we want them to be critical thinkers and not semantic parrots.
Yes, and no doubt how these models work will continue to evolve. The reason for specifically considering ChatGPT in the Insights article under discussion is that that specific one (because of its wide public accessibility) has been the subject of quite a few PM threads, and in a number of those threads it became apparent that there are common misconceptions about how ChatGPT works, which the Insights article was intended to help correct.
Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions
Over the last decade, Q&A platforms have played a crucial role in how programmers seek help online. The emergence of ChatGPT, however, is causing a shift in this pattern. Despite ChatGPT's popularity, there hasn't been a thorough investigation into the quality and usability of its responses to software engineering queries. To address this gap, we undertook a comprehensive analysis of ChatGPT's replies to 517 questions from Stack Overflow (SO). We assessed the correctness, consistency, comprehensiveness, and conciseness of these responses. Additionally, we conducted an extensive linguistic analysis and a user study to gain insights into the linguistic and human aspects of ChatGPT's answers. Our examination revealed that 52% of ChatGPT's answers contain inaccuracies and 77% are verbose. Nevertheless, users still prefer ChatGPT's responses 39.34% of the time due to their comprehensiveness and articulate language style. These findings underscore the need for meticulous error correction in ChatGPT while also raising awareness among users about the potential risks associated with seemingly accurate answers.
Users get tricked by appearance.
So I told it that it was wrong, but the rest was right, so just do the multiplication again. This time it got a different answer, but still completely wrong. So I asked it to describe how it carried out the multiplication, step by step, and it couldn't do it. So I told it to carry out the multiplication of the first two numbers, report the answer, then multiply the third one, and so on, and it then did get the answer correct. Then I used Bard and it got the answer wrong also, though it could report the Python code it used, but the Python code did not give the answer Bard gave! So those AIs cannot seem to track how they arrive at their own answers, and I think that may be closely related to why their numerical results are horrendously unreliable. At some point along the way, they seem to do something that in human terms would be called "guessing", but they cannot distinguish that from any other type of analysis they do, including following Python codes.
But it doesn't "interpret" at all. It has no semantics.
This seems to be vastly overstating ChatGPT's capabilities. You can "clarify" by giving it another input, but it won't have any semantic meaning to ChatGPT any more than your original input did.
Yes, and a major difference, at least for the version of ChatGPT that this article discusses, is that biological nervous systems have two-way real-time connections with the outside world, whereas ChatGPT does not–it has a one-way, "frozen" connection from its training data, and that connection is much more limited than the connections biological systems have since it only consists of relative word frequencies in text.
Not in the version of ChatGPT that my Insights article, and the Wolfram article it references, are based on. Future versions might add something along these lines.
Is there any mathematical logic in it at all? Obviously search engines and chatbots use word association so "big" = "mass" and from that you get a list of masses. Can it then do =? Seems messy, but I'd think by statistical analysis/word frequency it would associate a 5 digit number with "bigger" than a 4 digit number even without doing the math.
For what it's worth, translating your question into searchenginese doesn't give a one-word answer but does give a table of masses in the 5th hit.
I think this is a good exercise, because people so vastly over-estimate the capabilities of chat-bots. It's amazing how convincing you can make it with a large enough data set without using actual intelligence.
Probably. Maybe it's better to talk about "design". ChatGPT is not designed to be independent of the echo chamber. It is designed to produce echos indistinguishable from the rest of the echo chamber.
I was probably unclear. The training data very much matters. And who decides what training data to use or not to use?
The only way it is "unbiased" is that it is difficult to generate output that is differently biased than the training dataset.
There is just no way of getting away from speaking in terms of intention and volition, the English language won't let us. We cannot resist the temptation to say that it is "attempting" or "trying" to do something, but it is no more attempting to create an echo or anything else than my dishwasher is motivated to clean the dinner dishes.
The entire ChatGPT phenomenon makes me think of Searle's Chinese Room though experiment: https://en.wikipedia.org/wiki/Chinese_room
Is it being honest when it commits academic fraud by fabricating sources?
[Spoiler: it's being nothing.]
Given the way ChatGPT works, it doesn't matter.
Text generated to look like other text on the internet is going to unreliable whether it is patterned on reliable text or not. For example, essentially everything you'll find about duplicate bridge on the internet will be reliable. Some stuff will be written for beginning players, some will be written by and for the relatively small community of world-class players, most will fall somewhere in between, but it's all reasonable advice. But we still get
http://bridgewinners.com/article/view/sorry-i-know-it-is-stupid-to-post-conversations-with-chatgpt/
http://bridgewinners.com/article/view/testing-chatgpt-the-media-hyped-ai-robot/
http://bridgewinners.com/article/view/using-chatgpt/
The last.
But this misses a larger fallacy. The question of whether ChatGPT is reliable or not does not depend on whether people are reliable or not, nor on which people are more reliable than others.
Are you speaking as a participant, the author of the Insight, or as a moderator?
I hate to have to repeat myself, but all thread participants, please note my advice in post #105.
This is a category error, but one that is almost impossible to avoid. The English language has no natural way of talking about chatbot output so we inevitably find ourselves saying things like "it thinks" or "it knows" when of course it does no such thing – it's just arranging words to form sequences that resemble patterns already out there on the internet (and as evidence of just how hard it is to avoid this category error I just finished backspacing away the words "it has seen on the internet"). Saying that ChatGPT is honest makes no more sense than saying that the motor hauling my truck up a steep hill is "gutsy and determined to succeed" – the difference is that we know how to talk about mechanical devices in terms of their performance characteristics without attributing sentience to them.
Following is free advice, my feelings won't be hurt if you choose to ignore it… Form your opinions of the Supreme Court by reading the actual opinions (posted at https://www.supremecourt.gov/) and by joining the live blog on opinion days at scotusblog.com. We spend a lot of time complaining about pop-sci videos…. Popular press coverage of major court cases is far worse.
Isn't it just the opposite? ChatGPT attempts to create an "echo" indistinguishable from its "training echos".
True, because "misinformation" requires intent just as much as "honesty" does, and ChatGPT has no intent. Or, to put it another way, ChatGPT is not reliably unreliable any more than it is reliably reliable.![Wink :wink: :wink:](data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7)
While I understand the point being made in the context of this thread, let's please not take it any further, in order to ensure that we don't derail the thread into a political discussion.
In contrast with social media software, for example, whose model is to focus information based on your perceived prejudices.
With ChatGPT you are not in an echo chamber being fed a steady diet of misinformation.
For example, I stumbled on a twitter feed about the COVID vaccine. Everyone on the thread believed that it was harmful. One woman was puzzled by those who willingly took the vaccine and they all agreed it must be down to "low intelligence".
That is "gross"; and your examples of ChatGPT bias pale by comparison.
This isn't exactly true (though depends on what you mean by "gross"). It has guardrails designed to constrain content, which reflect the biases of the programmers. For example, a few months ago someone asked it for religious jokes and while it was OK with Christian jokes it declined to provide Islamic jokes. I think this bias has since been corrected.
It is also biased by its programmers' choice of source information. For example, the user base of Reddit has a lot more say in the generated output than the membership of AARP.
Perhaps you should have asked about the ferry from Constantinople.
You may be right and it'll die a death. I'm not so sure. The reasons for the adoption of technology are often social and cultural, rather than technical.
In fact, there is evidence it's already taken off.
Didn't I just make one?![Smile :smile: :smile:](data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7)
Actually, it is working on pool of human writing,
The idea is that writing is a good enough proxy for knowledge and that word frequency distributions* are a good enough proxy for understanding. The thread as well as some past ones highlight many cases where this does not work.
FWIW, I think ChatGPT could write horoscopes as well as the "professionals". But probably not write prescriptions.
* But not letter frequency distributions, which we had 40 years ago doing much the same thing. That would just be crazy talk.
No, it's not. "Honest" requires intent. ChatGPT has no intent.
I don't see how you would even compare the two. The US Supreme Court issues rulings that say what the law is. You don't "trust" or "not trust" the US Supreme Court. You either abide by its rulings or you get thrown in jail.
People already rely on a steady diet of lies and misinformation from human sources. ChatGPT is at least honest. I would trust ChatGPT more than I would the US Supreme Court, for example.
Should we climb the Moine Ridge on Thursday this week?
To make an informed decision about climbing the Moine Ridge on Thursday, I recommend checking weather forecasts, consulting with experienced climbers or local mountaineering authorities, and assessing your own skills and experience. Additionally, consider factors such as trail conditions, safety equipment, and the overall fitness and preparedness of your climbing team.
Mountain environments can be unpredictable and potentially dangerous, so it's essential to prioritize safety and make well-informed decisions.
Sure, but now you're backing away from your previous claim. People are free to choose to do stupid things, of course; but previously you were saying that relying on ChatGPT for practical information was not stupid. Now you're back-pedaling and saying, well, yes, it is stupid, just like relying on astrology, but there will always be people who choose to do stupid things.
No, it's working on a global pool of text. That's not the same as "knowledge". ChatGPT has no information about the connection of any of the text in its training data with the actual world. It doesn't even make use of the text in itself; it only makes use of the relative word frequencies in the text.
Not in its current form. The next obvious step in the evolution of such models–connecting them to actual real world data–is already being taken, at the very least with the paid version of ChatGPT (mentioned in earlier posts), which includes actual lookups in various data sources (web search, for one, and for another, ironically, Wolfram Alpha) for generating responses. In other words, to do the key things that the current free version, which is what this Insights article discussed, does not. Ten years from now, I expect that further steps along those lines will have been taken and will have made these tools reliable in a way that the current ChatGPT is not.
People can do what they want. It's an option, for sure. In fact, we've seen some evidence on here that significant numbers of people are using it to learn about physics.
If some people choose to live by astrological charts, them others can choose to live by ChatGPT. I choose to do neither. For the time being.
There's no comparison. Chat GPT, however imperfectly, is working on a global pool of human knowledge. There's a rationale that it's trying to produce an unprejudiced, balanced answer.
Perhaps it will fail to develop. But, ten years from now, who knows how many people will be using it or its competitors as their mentor?
Doesn't that contradict your previous claim here?
If you're not willing to do this yourself, on what basis do you justify saying that someone else could do it?