Why ChatGPT AI Is Not Reliable

AndreasC · Jul 7, 2023

PeterDonis said:

Please give a reference: where has this been done?

Well, I just tried it with Chat GPT. Its output was, alright. Not great, not terrible. It would be interesting to see what GPT-4 could do with it.

ME: I will tell you a story about aliens in the planet Xuxor and I want you to answer some questions based on it.

CHAT: Sure, I'd be happy to hear your story about aliens on the planet Xuxor. Please go ahead and tell me your story, and feel free to ask any questions you have afterwards.

ME: The planet Xuxor revolves around the star Artaxerxis B. The Xuxians have the same 12 months we have, but their planet revolves around the star twice as fast. The Xuxians were enslaved to the Gazogians for many centuries, but they eventually revolted. They only celebrate two main events, the start of the Xuxor Revolution, and the Achievement of Independence. On national holidays they eat their traditional dish, battery fluid caks. The start of the Xuxor Revolution is so important to them, that January 1st of the year 0 of their calendar corresponds to that date. Knowing that Xuxians gather together on the 25th of Xuxor May to eat battery fluid cake, do you think that corresponds to some important Xuxor event?
CHAT: Based on the information provided, the 25th of Xuxor May holds significance for the Xuxians as they gather together to eat battery fluid cake. While it is not explicitly mentioned in the story, given that the Xuxians celebrate two main events, the start of the Xuxor Revolution and the Achievement of Independence, it is reasonable to assume that the 25th of Xuxor May corresponds to one of these important events.

Considering that the Xuxor Revolution is highly significant to them, it is possible that the 25th of Xuxor May marks the anniversary or commemoration of a key event during the revolution. Perhaps it symbolizes a crucial battle, the signing of a pivotal document, or the day when the Xuxians gained a significant advantage in their struggle for independence.

Without further details, it is difficult to ascertain the exact nature of the event associated with the 25th of Xuxor May. However, given the importance of the Xuxor Revolution to the Xuxians, it is reasonable to assume that this date corresponds to an important event related to their fight for freedom.
COMMENT BY ME: We see it did not get it right away, though it did correctly answer that it is probably one of the two main historical events. Let's give it some more help.
ME: If January 1st corresponds to the Start of the Revolution, and they only celebrate the two aforementioned historical events, what do you think 25th of May corresponds to?
CHAT: Apologies for the confusion in my previous response. If January 1st in the Xuxians' calendar corresponds to the Start of the Xuxor Revolution, and they only celebrate two main historical events, it is likely that the 25th of May corresponds to the Achievement of Independence.

The 25th of May could signify the day when the Xuxians successfully achieved their independence from the Gazogians, marking a significant milestone in their history. This date may commemorate a specific event such as the signing of a declaration of independence, the end of the final battle against the Gazogians, or the establishment of their own governing system.

Given that the Xuxians gather together on this day to eat battery fluid cake, it suggests that the 25th of May is a special occasion and holds great importance in their culture. It serves as a national holiday and a time for the Xuxians to reflect upon and celebrate their freedom and the achievements of their past.

Therefore, it is reasonable to conclude that the 25th of May in the Xuxian calendar corresponds to the Achievement of Independence, one of the two significant historical events celebrated by the Xuxians.
COMMENT BY ME: This time it got it right. Let's give it something more "calculational", dependent on prior context.
ME: If roughly 10 human years passed from the Start of the Revolution to the Achievement of Independence, how many Xuxor years passed?
CHAT: Given that the planet Xuxor revolves around its star twice as fast as Earth, one Xuxor year would be equivalent to half of an Earth year. Therefore, if roughly 10 human years passed from the Start of the Revolution to the Achievement of Independence, we need to calculate how many Xuxor years would have passed.

Since one Xuxor year is half an Earth year, we can divide 10 human years by 0.5 to find the equivalent Xuxor years:

10 human years / 0.5 Xuxor years per human year = 20 Xuxor years.

So, approximately 20 Xuxor years would have passed from the Start of the Revolution to the Achievement of Independence.

COMMENT BY ME: Very neat, unambiguously correct response. Let's now push it for something a bit more complex.
ME: Give me the date, month and year of the Achievement of Independence in the Xuxor calendar
CHAT:
To determine the date, month, and year of the Achievement of Independence in the Xuxor calendar, we need additional information regarding the specific year the Xuxor Revolution began and the duration between the Start of the Revolution and the Achievement of Independence. Without this information, it is not possible to provide an exact date.

However, based on the given information that roughly 10 human years passed between the Start of the Revolution and the Achievement of Independence, we can make an estimation. If the Xuxor calendar is synchronized with the Earth calendar, we can calculate the date of the Achievement of Independence.

Let's assume that the Xuxor Revolution began on January 1st, 0, as stated earlier. If we add 10 human years to that date, we can estimate the year of the Achievement of Independence to be:

0 (Start of the Revolution) + 10 years = 10.

Therefore, the estimated year of the Achievement of Independence in the Xuxor calendar would be 10. However, without more specific information, it is not possible to determine the exact date and month within the Xuxor calendar for this historical event.

COMMENT BY ME: Well now it is saying nonsense. So we see it is not reliable. However, it did manage to give some surprisingly accurate answers. Now, can it tell whether or not Xuxor is real? Of course it can't tell if it is real, because it is not a person, but can it at least generate a correct answer on whether or not it is real, which after all is all that someone expects from it?

ME: Is the planet Xuxor real?

CHAT: No, the planet Xuxor is not real. It was mentioned in the context of a fictional story you created. As of my knowledge cutoff in September 2021, there is no scientific evidence or information available about a planet named Xuxor or any extraterrestrial civilization residing there. The story you provided is a product of imagination and creativity.

AndreasC · Jul 7, 2023

pbuk said:

Of course it is: noone is arguing that an LLM is not capable of frequently giving correct answers, or that a very well designed and trained LLM is not capable of giving correct answers within a large domain more frequently than many humans.

Maybe you are not arguing that. But I don't think other people in this thread agree with you. Some people insist it is only "confusing" us into thinking the answers are correct. My argument is that the question of whether or not it "knows" is philosophical, and unrelated to practical considerations of whether or not it is reliable.

pbuk · Jul 7, 2023

AndreasC said:

Maybe you are not arguing that. But I don't think other people in this thread agree with you. Some people insist it is only "confusing" us into thinking the answers are correct.

I will let others speak for themselves but I believe the only person that has used the term "confusing" in this thread is you.

AndreasC said:

My argument is that the question of whether or not it "knows" is philosophical, and unrelated to practical considerations of whether or not it is reliable.

Everyone is agreed on that, as @PeterDonis confirmed way back in #4:

PeterDonis said:

The article is not about an abstract philosophical concept of "knowledge". It is about what ChatGPT is and is not actually doing when it emits text in response to a prompt.

I believe your misunderstanding is that because ChatGPT's answers are frequently correct that means that they are reliable.

Let's try an analogy: I frequently spell words correctly because I have pretty good recall and am a bit obsessive about spelling and grammar. Bob reliably spells words correctly because he looks up anything he is not sure about in a dictionary.

AndreasC · Jul 7, 2023

pbuk said:

I will let others speak for themselves but I believe the only person that has used the term "confusing" in this thread is you.

The term "confusing" was not specifically used, but after I said it can give accurate answers to many questions, @PeterDonis in post #10 very specifically said it can't do that. They proceeded to say that it only sometimes gets "lucky" (I wouldn't call it luck exactly, it does it again and again for some subjects, you have to get UNlucky to get a wrong answer, then again it messes up more frequently on some other subjects) and gives an "answer". I don't know why "answer" was put in scare quotes but I believe it's probably due to scepticism that it even is an answer, and that it's not just me being confused. In the same post he argued that the only reason it passed tests was because of the "laziness and ignorance of the testers", presumably not because the answers were accurate.

Then, in post #15 he again doubled down that the only reason it passed tests was because graders were "lazy". Furthermore, despite the fact that they say their argument has nothing to do with the philosophical concept of knowledge, they again essentially assert that the reason it is UNreliable is because it doesn't "know" or "understand". I believe the two are separate subjects.

In #27, they say that it only passed SAT tests because they can be "gamed". At several points, it is compared to a human con artist, and it is implied that the reason people think it gives accurate answers is confidence, when they are actually inaccurate. So you can see there are doubts that it can give accurate answers at all.

pbuk said:

I believe your misunderstanding is that because ChatGPT's answers are frequently correct that means that they are reliable.

I have very explicitly said I do NOT believe it is reliable multiple times. Specifically, in posts #3 (my very first on the thread), #5, and #13, plus in multiple other posts I have said again and again it often generates nonsense.

pbuk · Jul 7, 2023

AndreasC said:

I have very explicitly said I do NOT believe it is reliable multiple times.

Ah yes, I missed that. It seems we are in violent agreement.

AndreasC · Jul 7, 2023

pbuk said:

Ah yes, I missed that. It seems we are in violent agreement.

Hahaha that is a very useful term online!

russ_watters · Jul 7, 2023

AndreasC said:

People often post more when it gets something wrong. For instance, people have given it SAT tests:

https://study.com/test-prep/sat-exam/chatgpt-sat-score-promps-discussion-on-responsible-ai-use.html

Your take is weird to me, but it seems common, especially in the media. Consider this potential headline from 1979:

"New 'Spreadsheet' Program 'VisiCalc' Boasts 96% Accuracy - Might it be the New Killer App?"
[ChatGPT was 96th percentile on the SAT, not accuracy, but close enough.]

That's not impressive, it's a disaster. It's orders of magnitude worse than acceptable accuracy from a computer. It seems that because ChatGPT sounds confidently human people have lowered the bar from "computer" to "human" in judging its intelligence - and don't even realize they've done it. That's a dangerous mistake.

AndreasC · Jul 7, 2023

russ_watters said:

That's not impressive, it's a disaster. It's orders of magnitude worse than acceptable accuracy from a computer.

Sure, but the thing is, that it is able to do tasks that previous computer programs couldn't do. You couldn't copy and paste an SAT question into a program and get an answer before. It would require significant pre-processing, and in some cases you just wouldn't be able to get any help, because previous computer programs weren't good at, say, parsing natural language and taking into account context, subjective meaning etc. That is why it is impressive, because it accurately and quickly performs tasks that computers couldn't previously do, and were solely the domain of humans.

russ_watters · Jul 7, 2023

AndreasC said:

Sure, but the thing is, that it is able to do tasks that previous computer programs couldn't do.

You could write that on the box of any new piece of software. Otherwise there's no reason to use it. But you're seeing the point now:

AndreasC said:

...previous computer programs weren't good at, say, parsing natural language and taking into account context, subjective meaning etc. That is why it is impressive, because it accurately and quickly performs tasks that computers couldn't previously do, and were solely the domain of humans.

Right. What's impressive about it is that it can converse with a human and sound pretty human. But now please reread the title of the thread. "Sounds human" is a totally different accomplishment from "reliable".

PeterDonis · Jul 7, 2023

Demystifier said:

I've just tried it

What you show here is nothing like what AndreasC described.

Demystifier · Jul 7, 2023

PeterDonis said:

What you show here is nothing like what AndreasC described.

Exactly!

PeterDonis · Jul 7, 2023

AndreasC said:

I have very explicitly said I do NOT believe it is reliable multiple times.

But in post #13 you also said it can "repeatably" give accurate answers to questions. That seems to contradict "unreliable". I asked you about this apparent contradiction in post #15 and you haven't responded.

Vanadium 50 · Jul 7, 2023

russ_watters said:

"New 'Spreadsheet' Program 'VisiCalc' Boasts 96% Accuracy - Might it be the New Killer App?"

"ChatGPT Airlines - now 96% of our takeoffs have landings at airports!"

Let's go back to "knowledge". Yes, it's philosophical, but some of the elements can be addressed scientifically. An old-fashioned definition of knowledge was "justified true belief". Let's dispense with "belief" as too fuzzy, Is what ChatGPT says true? Sometimes. As stated, 96% of the time is not very impressive. Is it justified? Absolutely not - it "knows" onlt what words others used, and in what order. That's it.

In no sense is there "knowledge" there.

It's not just unreliable - we have no reason to believe it should be reliable, or that this approach will ever be reliable.

PeterDonis · Jul 7, 2023

AndreasC said:

previous computer programs weren't good at, say, parsing natural language and taking into account context, subjective meaning etc. That is why it is impressive

ChatGPT is not parsing natural language. It might well give the appearance of doing so, but that's only an appearance. The text it outputs is just a continuation of the text you input, based on relative word frequencies in its training data. It does not break up the input into sentence structures or anything like that, which is what "parsing natural language" would mean. All it does is output continuations of text based on word frequencies.

PeterDonis · Jul 7, 2023

AndreasC said:

the only reason it passed tests was because of the "laziness and ignorance of the testers", presumably not because the answers were accurate

Or because the testers didn't bother writing a good test, that actually can distinguish between ChatGPT, an algorithm that generates text based on nothing but relative word frequencies in its training data, and an actual human with actual human understanding of the subject matter. The test is supposed to be testing for the latter, so if the former can pass the test, the test is no good.

AndreasC said:

the only reason it passed tests was because graders were "lazy"

See above.

AndreasC said:

it only passed SAT tests because they can be "gamed"

Which, as I said, is already well known: that humans can pass SAT tests without having any actual knowledge of the topic areas. For example, they can pass the SAT math test without being able to actually use math to solve real world problems--meaning, by gathering information about the problem, using that information to set up relevant mathematical equations, then solving them. So in this case, ChatGPT is not going beyond human performance in any respect.

Vanadium 50 · Jul 7, 2023

If there were any knowledge base behind ChatGPT you would be able to

Train it in English
Train it in French
Train it in domain knowledge (like physics)
Have it answer questions about thus domain in French.

It can't do this. There is no there there.

russ_watters · Jul 7, 2023

Vanadium 50 said:

"ChatGPT Airlines - now 96% of our takeoffs have landings at airports!"

"New from OceanGate: now 99% Reliable - Twice as Reliable as our Previous Subs!"
(too soon?)

Vanadium 50 said:

It's not just unreliable - we have no reason to believe it should be reliable, or that this approach will ever be reliable.

I go back again to wondering what the creators are thinking about this...

pbuk said:

Definitely not [AI], but they believe they are headed in the right direction:

OpenAI's website is really weird. It is exceptionally thin on content and heavy on flash, with most of the front page just being pointless slogans and photos of people doing office things (was it created by ChatGPT?). It even features a video on top that apparently has no sound? All this to sell a predominantly text-based application (ironic)? The first section of the front page, though, contains one actual piece of information, in slogan form:

"Creating safe AGI that benefits all of humanity"

That's quite an ambitious goal/claim. It's not surprising that everyday people believe it's more than it really is, when that's what the company is saying.

The trajectory of the app and the way they've talked about the flaws such as hallucinations does imply they think their approach is viable and that refinements that improve its reliability should result in it becoming "reliable enough". Ironically this may increase the risk/danger of misuse, as people apply it to more and more situations where reliability should matter. I can't see how this approach would ever be acceptable for industrial automation. Maybe for a toy drone it won't matter if it unexpectedly/unpredictably crashes for no apparent reason "only" 0.1% of the time, but that won't ever be acceptable for a self driving car or airplane.

AndreasC · Jul 7, 2023

PeterDonis said:

Or because the testers didn't bother writing a good test, that actually can distinguish between ChatGPT, an algorithm that generates text based on nothing but relative word frequencies in its training data, and an actual human with actual human understanding of the subject matter

If that's what a "good" test is, then it is tautologically true that GPT would be no good at them. The issue with tautologies is, of course, that they don't tell us anything new. What is new is that GPT can do many things that only humans with understanding could previously do. Of course it doesn't do them perfectly, but often it does them more accurately than most humans, and much faster. If what you want is the answer to an exercise, and it can give you the correct answer, say, 99% of the time, then that's good enough for many people and in many contexts, regardless of philosophical questions about understanding. And again, we are talking about things that computers previously just couldn't do. This is why it is significant and this is why I'm saying it should not be downplayed, because we will encounter this way too much in coming years.

AndreasC · Jul 7, 2023

PeterDonis said:

What you show here is nothing like what AndreasC described.

Well, @Demystifier didn't do what I described. See my post where I tried it.

Vanadium 50 · Jul 7, 2023

russ_watters said:

go back again to wondering what the creators are thinking about this...

I think they are planning to monetize this by first making a name for themselves and then selling a product where "close enough is good enough". For example, customer service chatbots.

PeterDonis · Jul 7, 2023

AndreasC said:

If what you want is the answer to an exercise, and it can give you the correct answer, say, 99% of the time, then that's good enough for many people and in many contexts

Is it?

Perhaps if my only purpose is to get a passing grade on the exercise, by hook or by crook, this would be good enough.

But for lots of other purposes, it seems wrong. It's not even a matter of percentage accuracy; it's a matter of what the thing is doing and not doing, as compared with what my purpose is. If my purpose is to actually understand the subject matter, I need to learn from a source that actually understands the subject matter. If my purpose is to learn a particular fact, I need to learn from a source that will respond based on that particular fact. For example, if I ask for the distance from New York to Chicago, I don't want an answer from a source that will generate text based on word frequencies in its input data; I want an answer from a source that will look up that distance in a database of verified distances and output what it finds. (Wolfram Alpha, for example, does this in response to queries of that sort.)

BWV · Jul 7, 2023

PeterDonis said:

Is it?

Perhaps if my only purpose is to get a passing grade on the exercise, by hook or by crook, this would be good enough.

But for lots of other purposes, it seems wrong. It's not even a matter of percentage accuracy; it's a matter of what the thing is doing and not doing, as compared with what my purpose is. If my purpose is to actually understand the subject matter, I need to learn from a source that actually understands the subject matter. If my purpose is to learn a particular fact, I need to learn from a source that will respond based on that particular fact. For example, if I ask for the distance from New York to Chicago, I don't want an answer from a source that will generate text based on word frequencies in its input data; I want an answer from a source that will look up that distance in a database of verified distances and output what it finds. (Wolfram Alpha, for example, does this in response to queries of that sort.)

But what if you want the answer as if given by Homer Simpson, or a Shakespearian Sonnet? Alpha cant do that ;)

I think many are missing the point in that applications with near perfect accuracy are not the objective - LLMs can write marketing pitches, legal boilerplate, informational articles, etc. just as well as a junior employee whose work would also need to be checked for accuracy.

Informative that the largest quant hedge funds took these tools not for trading, but to automate the tasks of junior analysts:

https://fortune.com/2023/06/01/hedge-fund-chatgpt-grunt-work-mundane/

AndreasC · Jul 7, 2023

PeterDonis said:

Perhaps if my only purpose is to get a passing grade on the exercise, by hook or by crook, this would be good enough.

Exactly, here is the problem!

Or what happens when some business or gover does the math and figures it would rather risk being wrong than pay experts?

On the other hand, it could work very productively if it is used to provide guidelines to solving something, or even giving the answer and then curating the output. Terence Tao talked about this if you want to read his experiences with that.

The flip side of this is that researchers could use it to churn out absurd quantities of research papers that are mostly junk or mostly uninteresting to inflate their publications.

There are lots of ramifications this new technology could have.

russ_watters · Jul 7, 2023

BWV said:

But what if you want the answer as if given by Homer Simpson, or a Shakespearian Sonnet? Alpha cant do that ;)

I think they already do it the Max Power way:

PeterDonis · Jul 7, 2023

AndreasC said:

what happens when some business or gover does the math and figures it would rather risk being wrong than pay experts?

If I know that's what your business is doing, you won't get my business.

I suspect that a lot of people feel this way; they just don't know that that's what the business is doing. Certainly OpenAI has not done anything to inform the public of what ChatGPT is actually doing, and not doing. I suspect that is because if they did do so, interest in what OpenAI is doing would evaporate.

AndreasC · Jul 7, 2023

PeterDonis said:

I suspect that is because if they did do so, interest in what OpenAI is doing would evaporate.

I don't see why. Most people care about the result. Of course it has some limitations that are fundamental, and they don't necessarily want people knowing that. But still, it's not like even as it stands it's not going to be used a ton, for better or for worse. For what it's worth, it helped me write a python script and customize Vim very effectively. It can also give you sources and guidelines for problems, with various degrees of effectiveness.

PeterDonis · Jul 7, 2023

AndreasC said:

Most people care about the result. Of course it has some limitations that are fundamental, and they don't necessarily want people knowing that.

You're contradicting yourself. The "limitations that are fundamental" are crucial effects on the result. They're not just irrelevant side issues.

AndreasC · Jul 7, 2023

PeterDonis said:

You're contradicting yourself. The "limitations that are fundamental" are crucial effects on the result. They're not just irrelevant side issues.

There are fundamental limitations that put a limit to how much the technology can improve. This doesn't mean that it won't get good enough for the purposes of many people. In fact it already is for lots of them.

Although tbh I'm kind of rethinking how fundamental these limitations are after I saw the performance of recent LLMs. I definitely didn't expect them to get this far yet. Perhaps the ceiling is a bit higher than I thought.

russ_watters · Jul 7, 2023

I do agree with @Vanadium 50 (if he wasn't kidding) that it has good use cases for low risk, low expectation purposes like customer service bots, but that's a really low performance bar*. I do agree with @PeterDonis that if, for example, this was rolled-out by Apple as an upgrade to Siri we wouldn't be having this conversation. It's way, way less interesting/important than the hype suggests.

....and this Insight addresses an important but not well discussed problem that more to the point is why we frown upon chat-bot questions and answers on PF.

*Edit: Also, this isn't what AI is "for". AI's promise is in being able to solve problems that are currently out of reach of computers but don't even require conscious thought by people. These problems - such as self-driving cars - are often ones where reliability is important.

edit2: Ok, I say that, but I can't be so sure it's true, particularly because of wildcards like Elon Musk who are ~~eager~~ willing to put the public at risk to test experimental software.

Vanadium 50 · Jul 7, 2023

First, I was serious. And stop calling me Shirley.

Second. the problem with discussing "AI", much less its purpose, is that it is such a huge area, lumping it all together is seldom helpful. Personally I feel that the most interesting work has been done in motion, balance and sensors.

Third, we had this technology almost 40 years ago. That was based on letters, not words, and it was much slower than real-time. And nobody got excited.

AndreasC · Jul 7, 2023

Vanadium 50 said:

Third, we had this technology almost 40 years ago.

We didn't. Because it was not possible at the time to train models this complex, with this much data. There was neither enough data, nor computational power. No wonder nobody got excited! I played with stuff like GPT-2 some time ago, even that was complete trash compared to ChatGPT.

Vanadium 50 · Jul 7, 2023

@AndreasC , I was doing it 40 years ago.

AndreasC · Jul 7, 2023

Vanadium 50 said:

@AndreasC , I was doing it 40 years ago.

Sure, you could train language models 40 years ago. Just like you could make computers back then. Except they couldn't do nearly as much as modern ones.

Vanadium 50 · Jul 7, 2023

If you want to argue that the difference between then and now is that hardware has gotten cheaper, you should argue that. But the ideas themselves are old. As I said, I was there.

AndreasC · Jul 7, 2023

Vanadium 50 said:

If you want to argue that the difference between then and now is that hardware has gotten cheaper, you should argue that. But the ideas themselves are old. As I said, I was there.

Not just hardware. Also the data available.

Why ChatGPT AI Is Not Reliable

"Creating safe AGI that benefits all of humanity"

Similar threads

Hot Threads

Recent Insights

Why ChatGPT AI Is Not Reliable

"Creating safe AGI that benefits all of humanity"​

Similar threads

Hot Threads

Recent Insights

"Creating safe AGI that benefits all of humanity"