Why ChatGPT AI Is Not Reliable

BWV · Jul 17, 2023

The interesting angle is how much of the general LLM model's inaccuracy comes from the training set vs. the algorithm. Training these models on the broad Internet of course exposes it to a large amount of bad information. What will likely come next are smaller models trained on a curated data set for a specific application, say medicine or law.

The LLMs appear to make Python calls for calculations rather than look through their training set, for example with Bard (and the answer is correct)

what is the cube root of 123456789?

The cube root of 123456789 is approximately 497.933859. This can be calculated using the following Python code:
Python
import math

def cube_root(number):
"""Calculates the cube root of a number."""
return math.pow(number, 1/3)

print(cube_root(123456789))

Use code with caution. Learn more
content_copy
The output of the code is 497.933859, which is the approximate cube root of 123456789.
Here are some other ways to calculate the cube root of 123456789:

You can use a calculator that has a cube root function.

You can use a search engine to find a website that calculates cube roots.

You can write a custom program in a programming language of your choice to calculate the cube root.

I hope this helps! Let me know if you have any other questions.

,

Vanadium 50 · Jul 17, 2023

PeroK said:

you are not in an echo chamber

Isn't it just the opposite? ChatGPT attempts to create an "echo" indistinguishable from its "training echos".

Nugatory · Jul 17, 2023

PeroK said:

ChatGPT is at least honest.

This is a category error, but one that is almost impossible to avoid. The English language has no natural way of talking about chatbot output so we inevitably find ourselves saying things like "it thinks" or "it knows" when of course it does no such thing - it's just arranging words to form sequences that resemble patterns already out there on the internet (and as evidence of just how hard it is to avoid this category error I just finished backspacing away the words "it has seen on the internet"). Saying that ChatGPT is honest makes no more sense than saying that the motor hauling my truck up a steep hill is "gutsy and determined to succeed" - the difference is that we know how to talk about mechanical devices in terms of their performance characteristics without attributing sentience to them.

I would trust ChatGPT more than I would the US Supreme Court, for example.

Following is free advice, my feelings won't be hurt if you choose to ignore it... Form your opinions of the Supreme Court by reading the actual opinions (posted at https://www.supremecourt.gov/) and by joining the live blog on opinion days at scotusblog.com. We spend a lot of time complaining about pop-sci videos.... Popular press coverage of major court cases is far worse.

PeterDonis · Jul 17, 2023

Nugatory said:

Following is free advice, my feelings won't be hurt if you choose to ignore it... Form your opinions of the Supreme Court by reading the actual opinions (posted at https://www.supremecourt.gov/) and by joining the live blog on opinion days at scotusblog.com. We spend a lot of time complaining about pop-sci videos.... Popular press coverage of major court cases is far worse.

I hate to have to repeat myself, but all thread participants, please note my advice in post #105.

Vanadium 50 · Jul 17, 2023

PeterDonis said:

please note my advice in post #105.

Are you speaking as a participant, the author of the Insight, or as a moderator?

Vanadium 50 · Jul 17, 2023

The issue with "neutrality" is fraught with peril. For example, in the war between Mordor and Gondor, if we only use sources from one side or the other, we might get different opinions. Most people here would say that the side of Gondor is completely truthful and the side of Mordor is nothing but propaganda and lies. But of course the Mordorian side would say differently. Who decided what sources are reliable for training and what ones are not? Ot do we toss them all in and let ChatGPT sort it out? Because then the point of view will be set by whomever can write the most.

But this misses a larger fallacy. The question of whether ChatGPT is reliable or not does not depend on whether people are reliable or not, nor on which people are more reliable than others.

PeterDonis · Jul 17, 2023

Vanadium 50 said:

Are you speaking as a participant, the author of the Insight, or as a moderator?

The last.

Nugatory · Jul 17, 2023

Vanadium 50 said:

Who decided what sources are reliable for training and what ones are not?

Given the way ChatGPT works, it doesn't matter.

Text generated to look like other text on the internet is going to unreliable whether it is patterned on reliable text or not. For example, essentially everything you'll find about duplicate bridge on the internet will be reliable. Some stuff will be written for beginning players, some will be written by and for the relatively small community of world-class players, most will fall somewhere in between, but it's all reasonable advice. But we still get
http://bridgewinners.com/article/view/sorry-i-know-it-is-stupid-to-post-conversations-with-chatgpt/
http://bridgewinners.com/article/view/testing-chatgpt-the-media-hyped-ai-robot/
http://bridgewinners.com/article/view/using-chatgpt/

russ_watters · Jul 17, 2023

PeroK said:

ChatGPT is at least honest.

Is it being honest when it commits academic fraud by fabricating sources?

[Spoiler: it's being nothing.]

Nugatory · Jul 17, 2023

Vanadium 50 said:

ChatGPT attempts to create an "echo" indistinguishable from its "training echos".

There is just no way of getting away from speaking in terms of intention and volition, the English language won't let us. We cannot resist the temptation to say that it is "attempting" or "trying" to do something, but it is no more attempting to create an echo or anything else than my dishwasher is motivated to clean the dinner dishes.

The entire ChatGPT phenomenon makes me think of Searle's Chinese Room though experiment: https://en.wikipedia.org/wiki/Chinese_room

Vanadium 50 · Jul 17, 2023

Nugatory said:

Given the way ChatGPT works, it doesn't matter.

I was probably unclear. The training data very much matters. And who decides what training data to use or not to use?

The only way it is "unbiased" is that it is difficult to generate output that is differently biased than the training dataset.

Vanadium 50 · Jul 17, 2023

Nugatory said:

just no way of getting away from speaking in terms of intention

Probably. Maybe it's better to talk about "design". ChatGPT is not designed to be independent of the echo chamber. It is designed to produce echos indistinguishable from the rest of the echo chamber.

sbrothy · Jul 18, 2023

Applications[edit]

Around 2013, MIT researchers developed BullySpace, an extension of the commonsense knowledgebase ConceptNet, to catch taunting social media comments. BullySpace included over 200 semantic assertions based around stereotypes, to help the system infer that comments like "Put on a wig and lipstick and be who you really are" are more likely to be an insult if directed at a boy than a girl.[11][12][13]

ConceptNet has also been used by chatbots[14] and by computers that compose original fiction.[15] At Lawrence Livermore National Laboratory, common sense knowledge was used in an intelligent software agent to detect violations of a comprehensive nuclear test ban treaty.[16]

---- Wiki on Commonsense knowledge (artificial_intelligence)

-------------------------------------------------

Just a funny observation.

I tried to bold the part about "detecting violations of the comprehensive nuclear test ban treaty". It's quite a contrast of applications.

bagasme · Jul 20, 2023

AndreasC said:

The semantic connections you are talking about are connections between sensory inputs and pre-existing structure inside our brains. You're just reducing what it's doing to the bare basics of its mechanics, but its impressive behavior comes about because of how massively complex the structure is.

I don't know if you've tried it out, but it doesn't just "get lucky". Imagine a student passing one test after another, would you take someone telling you they only "got lucky" seriously, and if yes, how many tests would it take? Plus, it can successfully apply itself to problems it never directly encountered before. Yes, not reliably, but enough that it's beyond "getting lucky".

You talk about it like you haven't actually tried it out. It's not at all the same as previous chatbots, it has really impressive capabilities. It can give you correct answers to unambiguous questions that are non-trivial and that it has not specifically encountered before in its training. And it can do that a lot, repeatably. Nothing to do with how confident it sounds, I am talking about unambiguously correct answers.

Again, I'm not saying it is reliable, but you are seriously downplaying its capabilities if you think that's all it does and I encourage you to try it out for yourself. Especially when it comes to programming, it is incredible. You can put in it complicated code that is undocumented, and it can explain to you what the code does exactly, what problem it probably was intended for, and how to improve it, and it works a lot of the time, much more frequently than "luck".

If all you want to say is that it isn't right all the time, then yeah, that's true. It's very, very frequently wrong. But that has little to do with what you are describing. It could (and will) improve significantly on accuracy, using the same mechanism. And practically, what you are saying doesn't matter. A database doesn't "know" what something is either in your sense of the word, neither does a web crawler, or anything like that. That doesn't make them unreliable. Neither is a human reliable because they "know" something (again going by your definition).

ChatGPT is unreliable because we observe it to be unreliable. That requires no explanation. What does require explanation is why, as @Demystifier said, it is so much more reliable (especially at non trivial, "reasoning" type problems) than you would naively expect.

This reminds me of statistical machine translation (SMT), where the translation quality is (as I have observed) is heavily dependent and correlated on its training model and corpus. And I bet that there is no known SMT models that produces good translation on casual languages unless they are given also casual language corpus.

Maybe related is my observation of Indonesian text translations. Many of there looks stiff and unnatural (think of formal equivalence), which influences me when maintaining Indonesian translation of a popular open source project. In most cases, I spot mood mismatches (e.g. do foo in imperative mood is translated as melakukan foo (descriptive mood due to me- prefix) whereas the correct translation should have been lakukan foo).

Swamp Thing · Aug 3, 2023

Although ChatGPT is merely generating its output according to word statistics boiled down from training data, it is subjectively hard to imagine that it can do things like this without some sense of meaning :-

Swamp Thing said:

Input: In the next input I will give you a list of words. In the input after that I will give you a sentence containing the placeholder "xyz". In response to that, please give me the sentence with the most appropriate word from the list.

Output: OK, I'm ready! Please provide the list of words and the sentence with the placeholder.

I: Mouse, whale, horse.

O: Okay. Please provide the sentence with the placeholder "xyz".

I: The xyz is the biggest animal from the list.

O: The whale is the biggest animal from the list.

Can one break down how a blind statistical process can do that?

One could take it a notch higher by adding something like this: "If the placeholder is xyz, process the sentence as it is. If the placeholder is pqr, replace all adjectives and adverbs by their opposites. For example, replace 'hottest' with 'coldest'".

.

Swamp Thing · Aug 4, 2023

If one has access to, say, OpenAI's GPT via the full API, would there be a way to drill down and get some sense of how a particular input prompt gave rise to a corresponding response? I do realise that huge amounts of data would be involved, which could overwhelm a human. But could one build some kind of graphical visualization tool that would help explore the connection from a specific input to a specific output, to the extent possible?

Motore · Aug 4, 2023

Swamp Thing said:

Can one break down how a blind statistical process can do that?

Well the word whale has already a statisticaly higher chance to have 'bigger', 'biggest', 'large', ... words asocciated with it, so I don't see a problem.

Let's try a more subtle case:

Me: In the next input I will give you a list of words. In the input after that I will give you a sentence containing the placeholder "xyz". In response to that, please give me the sentence with the most appropriate word from the list.
ChatGPT3.5: Sure, I'm ready to assist you. Please provide the list of words in the next input, and in the following input, give me the sentence containing the placeholder "xyz". I'll then replace the placeholder with the most appropriate word from the provided list. Go ahead and provide the list of words.
Me: Horse, zebra, donkey.
ChatGPT3.5: Got it! Please provide the sentence containing the placeholder "xyz" that you'd like me to fill in using the words "horse", "zebra", or "donkey".
Me: The xyz is the biggest animal from the list.
ChatGPT3.5: The zebra is the biggest animal from the list.

In fact zebras are smaller than horses.

Swamp Thing said:

If one has access to, say, OpenAI's GPT via the full API, would there be a way to drill down and get some sense of how a particular input prompt gave rise to a corresponding response? I do realise that huge amounts of data would be involved, which could overwhelm a human.

Not really. ChatGPT3.5 has cca 180 billion parameters, that are densely connected, so it would be really hard to see how the choices are beign made.
The OpenAI team will have to resolve this issue somehow (perhaps giving information about important choices in the tree or some snippets or ...), because eventualy the regulatory agencies will need it to be in place.

russ_watters · Aug 4, 2023

Swamp Thing said:

Can one break down how a blind statistical process can do that?

Is there any mathematical logic in it at all? Obviously search engines and chatbots use word association so "big" = "mass" and from that you get a list of masses. Can it then do <>=? Seems messy, but I'd think by statistical analysis/word frequency it would associate a 5 digit number with "bigger" than a 4 digit number even without doing the math.

For what it's worth, translating your question into searchenginese doesn't give a one-word answer but does give a table of masses in the 5th hit.

I think this is a good exercise, because people so vastly over-estimate the capabilities of chat-bots. It's amazing how convincing you can make it with a large enough data set without using actual intelligence.

PeterDonis · Aug 4, 2023

russ_watters said:

Is there any mathematical logic in it at all?

Not in the version of ChatGPT that my Insights article, and the Wolfram article it references, are based on. Future versions might add something along these lines.

mattt · Aug 5, 2023

Large language models "use statistics about written text" (on which it has been trained) as a proxy of "something else" that we call "the real world".

Biological brains/nervous systems "use (something that we humans ourselves can describe as) temporal and spatial statistical distribution of neuronal spikes" as a proxy of "something else" that we like to call "the real world".

There are many differences between how biological nervous systems work, and how the computers that execute the algorithms behind large language models work.

Some people seem to think that "what we can describe as temporal and spatial statistical distributions of neuronal spikes" (or a subset of it) is the only thing/system/process in the Universe that can also be described as "phenomenological experience", and so they deny any possibility of a different type of system (or part of its processes) to also be described as "phenomenological experiences".

But the reality about this, is that we just don't know...

PeterDonis · Aug 5, 2023

mattt said:

There are many differences between how biological nervous systems work, and how the computers that execute the algorithms behind large language models work.

Yes, and a major difference, at least for the version of ChatGPT that this article discusses, is that biological nervous systems have two-way real-time connections with the outside world, whereas ChatGPT does not--it has a one-way, "frozen" connection from its training data, and that connection is much more limited than the connections biological systems have since it only consists of relative word frequencies in text.

r.vittalkiran · Aug 8, 2023

GPT is not 100% reliable from my experience. Especially the ones that have arisen from the former (Chat GPT).

Filip Larsen · Aug 9, 2023

A recent article on Nature describes situations where LLM becomes unreliable (i.e. fails to provide the correct answer) especially in visual abstraction tasks and rephrased texts that humans in general have no trouble with:

https://www.nature.com/articles/d41586-023-02361-7

I like the title of the printed (PDF) version of the article better though: "The Easy Intelligence Tests that AI Chatbots Fail". While some the emergent behaviors of LLM are indeed surprising it is my opinion people in general are too easily fooled into perceiving this as a sign of "intelligent thinking" when it (so far) is mostly just a very good language pattern matcher.

ryo_br · Aug 12, 2023

PeterDonis said:

Continue reading...

About the article, I believe that there is not such innovative information, but it presents thoughts that some people did not have.
I agree that chatgpt is unreliable, so it needs humans and I consider it as something better than "an excel sheet", that's all.
Some people have blindly used this tool as if it were a human head working for you and that's not it. In this context, I believe that the article has its value to highlight this point and avoid using this form.
I believe that chatgpt is a tool that came to build at work and it is inevitable to use it to optimize some tasks.

neobaud · Aug 12, 2023

PeterDonis said:

Continue reading...

We should be skeptical of the information provided by chatGPT but it is excellent when you have a question that is difficult to compose for a regular search engine. Its value is that it can interpret your question better than google or bing and you can build context. If it misunderstands you, you can even clarify your meaning in a way that is really only possible with another human. You can always verify the facts.

PeterDonis · Aug 12, 2023

neobaud said:

Its value is that it can interpret your question

But it doesn't "interpret" at all. It has no semantics.

neobaud said:

you can even clarify your meaning in a way that is really only possible with another human

This seems to be vastly overstating ChatGPT's capabilities. You can "clarify" by giving it another input, but it won't have any semantic meaning to ChatGPT any more than your original input did.

Ken G · Aug 13, 2023

One thing I have noticed interacting with ChatGPT is that it is no good at all at explaining how it comes up with the answers it does. Recently I asked it a question that required mathematical reasoning (something it is really terrible at for some reason), and it did an excellent calculation that involved invoking physics equations that were not at all obvious from the prompt. So that was impressive, and it arrived at a correct expression that involved simply multiplying a chain of different terms that varied widely in magnitude. After the remarkably correct analysis it did, it got that final multiplication wrong by many orders of magnitude!

So I told it that it was wrong, but the rest was right, so just do the multiplication again. This time it got a different answer, but still completely wrong. So I asked it to describe how it carried out the multiplication, step by step, and it couldn't do it. So I told it to carry out the multiplication of the first two numbers, report the answer, then multiply the third one, and so on, and it then did get the answer correct. Then I used Bard and it got the answer wrong also, though it could report the Python code it used, but the Python code did not give the answer Bard gave! So those AIs cannot seem to track how they arrive at their own answers, and I think that may be closely related to why their numerical results are horrendously unreliable. At some point along the way, they seem to do something that in human terms would be called "guessing", but they cannot distinguish that from any other type of analysis they do, including following Python codes.

AngryBeavers · Aug 13, 2023

I admittedly only read the first page of posts, and then skimmed through the rest to arrive here so please forgive me if someone else has asked, but, all of you are aware that there are many, many more LLMs besides ChatGPT? I understand that it is probably the most well known, but in my experience ( which is vast, but also shallow, and from a humble interacters perspective ) there are better ones out there that are suited to your needs. Google Bard for instance. I think that ChatGPT is faster, and better at creative text generation, but Bard is leaps and bounds better at providing accurate answers. It can also summarize any link you provide to it, such as from Google Scholar or anywhere else for that matter. It's accuracy in this however, is about like ChatGPT's accuracy at being able to site it's sources.

Further, there are now LLMs that can be run locally. I have a few installed, but they require about what my system has, just as a minimum, so my usage of them is limited. Anyone curious, it's called gpt4all and can be found on github. It allows you to download quite a few LLMs directly from the app. The only plugin currently availabe is one that allows you to give the models access to folders or files on your local system. It can do a few things with this ability.

Anyway, this has turned into a rant. Perhaps someone better at coding, and with vastly more patience, as well as a better PC will look into the local ran LLM's and after using various models, realize that LLM's are about a good as a text book. If you do not know what youre looking for, neither will the LLM...very well anyway.

nsaspook · Aug 13, 2023

https://arxiv.org/abs/2308.02312
Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions

Over the last decade, Q&A platforms have played a crucial role in how programmers seek help online. The emergence of ChatGPT, however, is causing a shift in this pattern. Despite ChatGPT's popularity, there hasn't been a thorough investigation into the quality and usability of its responses to software engineering queries. To address this gap, we undertook a comprehensive analysis of ChatGPT's replies to 517 questions from Stack Overflow (SO). We assessed the correctness, consistency, comprehensiveness, and conciseness of these responses. Additionally, we conducted an extensive linguistic analysis and a user study to gain insights into the linguistic and human aspects of ChatGPT's answers. Our examination revealed that 52% of ChatGPT's answers contain inaccuracies and 77% are verbose. Nevertheless, users still prefer ChatGPT's responses 39.34% of the time due to their comprehensiveness and articulate language style. These findings underscore the need for meticulous error correction in ChatGPT while also raising awareness among users about the potential risks associated with seemingly accurate answers.

Users get tricked by appearance.

Our user study results show
that users prefer ChatGPT answers 34.82% of the time. However, 77.27% of these preferences are incorrect answers. We believe this observation is worth investigating. During our study, we observed that only when the error in the ChatGPT answer is obvious, users can identify the error. However, when the error is not readily verifiable or requires external IDE or documentation, users often fail to identify the incorrectness or underestimate the degree of error in the answer. Surprisingly, even when the answer has an obvious error, 2 out of 12 participants still marked them as correct and preferred that answer. From semi-structured interviews, it is
apparent that polite language, articulated and text-book style answers, comprehensiveness, and affiliation in answers make completely wrong answers seem correct. We argue that these seemingly correct-looking answers are the most fatal. They can easily trick users into thinking that they are correct, especially when they
lack the expertise or means to readily verify the correctness. It is even more dangerous when a human is not involved in the generation process and generated results are automatically used elsewhere by another AI. The chain of errors will propagate and have devastating effects in these situations. With the large percentage
of incorrect answers ChatGPT generates, this situation is alarming. Hence it is crucial to communicate the level of correctness to users

PeterDonis · Aug 13, 2023

AngryBeavers said:

all of you are aware that there are many, many more LLMs besides ChatGPT?

Yes, and no doubt how these models work will continue to evolve. The reason for specifically considering ChatGPT in the Insights article under discussion is that that specific one (because of its wide public accessibility) has been the subject of quite a few PF threads, and in a number of those threads it became apparent that there are common misconceptions about how ChatGPT works, which the Insights article was intended to help correct.

AngryBeavers · Aug 13, 2023

I see. I was wondering. By what I read, it seemed the only LLM being discussed was ChatGPT, which I now understand why. I do not understand exactly why ChatGPT gets so much flack though, even still. At the end of the day, it's merely a tool. I've never seen so much hesitance, anger, and irrational fear over a tool...well, ever. Not here so much, I mean in the world in general. Frankly, I have other things to be worried about, such as Property Taxes, or the Grapefruit sized hernia protuding from my stomach.

Sometimes inaccurate information, and the upcoming downfall of mankind to our AI overlords is pretty low on my list of things to get emotional about.

:)

Ken G · Aug 14, 2023

In my rather limited experience, I found Bard was even worse than ChatGPT when it comes to mathematical reasoning that results in a quantitative answer to a physics or astronomy question. ChatGPT often uses the correct physics equation but does the calculation wrong, whereas Bard just makes the physics up completely. If there are any free LLMs that are better at these types of questions than those two I'd like to know about them, because those two are, quite frankly, just awful if you need reliability. That said, I will say that ChatGPT can often be prompted into a correct answer if you check it carefully, whereas I found Bard to be hopeless even when corrected. Hence I do think ChatGPT can be a useful physics tool, but only when used interactively and from a position of some prior knowledge. On the other hand, it is probably correct often enough to be able to give a student a passing grade on most physics exams, though I should not think it would ever result in an A result, at least in my experience with it.

Ten years from now? Well of course it will have improved enough to give excellent results to the kinds of exams that are currently given. Does that point to a problem in the exams, if they can be answered correctly by a language manipulation model that does not have any comprehension of its source material? Possibly, yes, it may mean that we are not asking the right kinds of questions to our students if we want them to be critical thinkers and not semantic parrots.

Motore · Aug 14, 2023

Well LLMs do not have computational algorhitms (yet), they deal with text pattern recognition, so I don't know why it's so surprising they cannot do calculations.
Here is a proposal for a math extension for LLMs: https://aclanthology.org/2023.acl-industry.4.pdf

Anyway testing ChatGPT (or Bard) a little more I find it useful for initial code generation, but to have a properly functioning script I found myself goning to StackOverflow 70% of the time. The explanation, examples are all already there, wtih LLMs you have to ask a lot of questions (which means typing and waiting) and still don't get the right answers some of the time. And mind you this is not complex code (for that, I never use LLMs), just some small scripts for everyday use.

AngryBeavers · Aug 14, 2023

Ken G said:

Possibly, yes, it may mean that we are not asking the right kinds of questions to our students if we want them to be critical thinkers and not semantic parrots.

I think that just about sums up why LLMs are not a risk to most peoples sense of security. I try to think of any LLM as being an interactive search engine. I can search for whatever my little heart desires, but I still have to filter the results myself. Honestly, I get more entertainment value from them in their current state. A lot of frustration also. For the purposes of genealogy, often LLMs are completely wrong and generate responses that merely parrot your intial input in a weird way. However, the "playground" version of ChatGPT gets so close that I find it interesting. The results are off, but if it were a human, I'd probaby consider it the result of "Uncle GPTs bad memory."

More than I meant to reply....heh. Basically, LLMs have strengths and weaknesses, but none are going to stun an intellectual community in any area that might be relavent.

phinds · Aug 14, 2023

AngryBeavers said:

I see. I was wondering. By what I read, it seemed the only LLM being discussed was ChatGPT, which I now understand why. I do not understand exactly why ChatGPT gets so much flack though, even still. At the end of the day, it's merely a tool.

The issue, on this forum at least, it not actually the tool so much as people's misunderstanding about how it works and the reasons for its behavior. People give it too much credit for "intelligence" and so forth.

Why ChatGPT AI Is Not Reliable

what is the cube root of 123456789?

Applications[edit]

Similar threads

Hot Threads

Recent Insights

Why ChatGPT AI Is Not Reliable

what is the cube root of 123456789?​

Applications[edit]​

Similar threads

Hot Threads

Recent Insights

what is the cube root of 123456789?

Applications[edit]