Why ChatGPT AI Is Not Reliable

Ken G · Aug 14, 2023

Motore said:

Well LLMs do not have computational algorhitms (yet), they deal with text pattern recognition, so I don't know why it's so surprising they cannot do calculations.

It's because I have seen them report a Python code they used to do the calculation, and the Python code does not yield the quantitative result they report. So that's pretty odd, they seem to be able to associate their prompts with actual Python code that is correct, and still get the answer wrong.

Motore said:

Here is a proposal for a math extension for LLMs: https://aclanthology.org/2023.acl-industry.4.pdf

Yes, this is the kind of thing that is needed, and is what I'm expecting will be in place in a few years, so it seems likely that ten years from now, LLMs will be able to answer physics questions fairly well, as long as they only require associating the question with a formula without conceptual analysis first. It will then be interesting to see how much LLMs have to teach us about what we do and do not comprehend about our own physics, and what physics understanding actually is. This might be pedagogically significant for our students, or something much deeper.

Motore said:

Anyway testing ChatGPT (or Bard) a little more I find it useful for initial code generation, but to have a properly functioning script I found myself goning to StackOverflow 70% of the time. The explanation, examples are all already there, wtih LLMs you have to ask a lot of questions (which means typing and waiting) and still don't get the right answers some of the time. And mind you this is not complex code (for that, I never use LLMs), just some small scripts for everyday use.

Then the question is, why do you not use LLMs for complex code, and will that still be true in ten years? That might be the coding equivalent of using LLMs to solve physics questions, say on a graduate level final exam.

Ken G · Aug 14, 2023

AngryBeavers said:

More than I meant to reply....heh. Basically, LLMs have strengths and weaknesses, but none are going to stun an intellectual community in any area that might be relavent.

My question is, how much of this is due to the fact that these are just early generation attempts, versus how much is fundamental to the way LLMs must work? If we fix up their ability to recognize logical contradictions, and enhance their ability to do mathematical logic, will we get to a point where it is very hard to distinguish their capabilities from the capabilities of physics teachers who pose the questions in the first place? And if we did, what would that mean for our current ideas about what conceptual understanding is, since physics seems like a place where conceptual understanding plays a crucial role in achieving expertise. These kinds of AI related questions always remind me of B. F. Skinner's great point, "The real question is not whether machines think but whether men do. The mystery which surrounds a thinking machine already surrounds a thinking man.” (Or woman.)

russ_watters · Aug 14, 2023

neobaud said:

.... If it misunderstands you, you can even clarify your meaning in a way that is really only possible with another human.

PeterDonis said:

This seems to be vastly overstating ChatGPT's capabilities. You can "clarify" by giving it another input, but it won't have any semantic meaning to ChatGPT any more than your original input did.

My understanding is that it is designed to* accept criticism/clarification/correction.That makes such follow up mostly useless, since it will simply be riffing on what you tell it, regardless of whether is accurate or not. In other words, you'll always win a debate with it, even if you're wrong.

*Whether actual design or "emergent" behavior I don't know, but I don't think it matters.

Ken G · Aug 14, 2023

Also, it should perhaps be noted that this question goes way beyond whether we should shun or accept AI in the physics classroom, the question of "what does an LLM really know" goes to the heart of what this type of AI will be useful for in the future. Let us not forget IBM's painful lesson that was Watson! They thought that because it was the best Jeopardy contestant ever, it could analyze patient data and help suggest improved cancer treatments. So far that has been a dismal failure, because of the problem of connecting the machine and its capabilities with the necessary data. A human can better tell what matters and what doesn't, and can integrate disparate forms of information, whereas Watson had vastly more computing power but could not find a way to use it, unlike if it was trying to win a chess game or a Jeopardy match, games with simple rules and relatively straightforward connections between the data. To get AI to live up to its potential, we may have to first understand better what knowledge even is, and what separates it from some vast pile of disjoint information. What will that knowledge tell us about ourselves?

PeterDonis · Aug 14, 2023

russ_watters said:

it is designed to* accept criticism/clarification/correction

No, it's designed to accept any input whatever and generate response text based on the relative word frequency algorithm that Wolfram describes in his article (referenced in the Insights article). It has no idea that some input is "criticism/clarification" based on its previous responses. It has no semantics.

Ken G · Aug 14, 2023

PeterDonis said:

No, it's designed to accept any input whatever and generate response text based on the relative word frequency algorithm that Wolfram describes in his article (referenced in the Insights article). It has no idea that some input is "criticism/clarification" based on its previous responses. It has no semantics.

Yet it gives the prompt some kind of special status, so when you correct something it said, it tends to become quite obsequious. It is not normal in human language to be that obsequious, I don't think it could get that from simply predicting the next word. When it has made a mistake and you tell it that it has, it will typically search its database a little differently, placing some kind of emphasis on your correction. If you tell it that one plus one equals three, however, it has enough contrary associations in its database that it sticks to its guns, but it will still not tell you that you have made a mistake (which surely would be the norm in its database of reactions), it will suggest you might be joking or using an alternative type of mathematics. The status it gives to the prompt must be an important element of how it works, and it is constrained to be polite in the extreme, which amounts to "riffing on" your prompt.

PeterDonis · Aug 14, 2023

Ken G said:

it gives the prompt some kind of special status, so when you correct something it said, it tends to become quite obsequious

I don't think that's anything explicitly designed, at least not in the version of ChatGPT that was reviewed in the Insights article and the Wolfram article it references. It is just a side effect of its algorithm.

PeterDonis · Aug 14, 2023

Ken G said:

it will typically search its database

It doesn't search any database; that was one of the main points of the Insights article. All it does is generate text based on relative word frequencies, using the prompt given as input as its starting point.

Ken G · Aug 14, 2023

PeterDonis said:

It doesn't search any database; that was one of the main points of the Insights article. All it does is generate text based on relative word frequencies, using the prompt given as input as its starting point.

Yes, but it has a database to inform those frequencies. That database must be "primed" in some way to establish what the ChatGPT is, and how it should relate to prompts. For example, if you prompt it with "describe yourself", it will say "I am ChatGPT, a creation of OpenAI. I'm a language model powered by the GPT-3.5 architecture, designed to understand and generate human-like text based on the input I receive. I have been trained on a diverse range of text sources up until September 2021, so I can provide information, answer questions, assist with writing, generate creative content, and more. However, it's important to note that I don't possess consciousness, emotions, or personal experiences. My responses are based on patterns in the data I've been trained on, and I aim to be a helpful and informative tool for various tasks and conversations." So that's a highly specialized set of data to look for word associations with "describe yourself," it has been trained to favor certain word frequencies in response to certain prompts.

Also, if you correct it, it will invariably apologize obsequiously. So it is in some sense "programmed to accept corrections," in the sense that it uses a word association database that expects to be corrected and is trained to respond to that in certain ways.

It would seem that its training also expects to provide certain types of responses. For example, if you just give it the one word prompt "poem", it will write a poem. Also, since you did not specify a subject, it will write a poem about poetry! I think that was a conscious decision by its programmers, there are built in expectations about what a prompt is trying to accomplish, including corrections. It could be said that ChatGPT inherits some elements of the intelligence of its trainers.

PeterDonis · Aug 14, 2023

Ken G said:

it has a database to inform those frequencies

It has a database of relative word frequencies. That's the only database it has. (At least, that's the case for the version that was reviewed in the articles being discussed here. Later versions might have changed some things.)

russ_watters · Aug 14, 2023

PeterDonis said:

No, it's designed to accept any input whatever and generate response text based on the relative word frequency algorithm that Wolfram describes in his article (referenced in the Insights article). It has no idea that some input is "criticism/clarification" based on its previous responses. It has no semantics.

I understand, but I think you might be reading past my point. As you say, it "accepts any input" regardless of what the input is. What I'm pointing out is that it therefore doesn't know if the input is wrong, and will simply riff on the wrong input. This gives the appearance to the user that they are winning a debate or providing useful clarification on which to get a better answer, when in reality they may just be steering it towards providing or expanding on a wrong answer.

AndreasC · Aug 14, 2023

PeterDonis said:

It has a database of relative word frequencies. That's the only database it has. (At least, that's the case for the version that was reviewed in the articles being discussed here. Later versions might have changed some things.)

It's not all it does though. While you are describing GPT correctly, ChatGPT has additional layers on top of that to "push" it to favor some kinds of response more than others. That's also why it's very hard to make it talk about certain things, or to say "bad" words, etc (all the things for which people "jailbreak" it), it's hard-coded to avoid them. It's not pure word predictor. I've noticed the obsequity @Ken G talks about, sometimes it's very stubborn in insisting it got something wrong, to the extent that it's probably a behavior the developers coded in.

Ken G · Aug 14, 2023

PeterDonis said:

It has a database of relative word frequencies. That's the only database it has. (At least, that's the case for the version that was reviewed in the articles being discussed here. Later versions might have changed some things.)

Yes good point, it starts with a training dataset and then generates its own, such that it only needs to query its own to generate its responses. But the manner in which that dataset is trained is a key aspect, that's where the intelligence of the trainers leaves its mark. They must have had certain expectations that they built in, which end up looking like there is a difference between a prompt like "poem", that will generate a poem about poetry, versus a followup prompt like "shorten that poem". It can even count the words it uses in its answer, and prohibit certain types of answers, so it has some extra scaffolding in its training.

PeterDonis · Aug 14, 2023

russ_watters said:

it therefore doesn't know if the input is wrong, and will simply riff on the wrong input. This gives the appearance to the user that they are winning a debate or providing useful clarification on which to get a better answer, when in reality they may just be steering it towards providing or expanding on a wrong answer.

Yes, agreed.

PeterDonis · Aug 14, 2023

AndreasC said:

ChatGPT has additional layers on top of that to "push" it to favor some kinds of response more than others.

Where are these layers in the Wolfram article's description? (I am referring to the Wolfram article that is referenced in the Insights article at the start of this thread.)

Ken G · Aug 14, 2023

PeterDonis said:

Where are these layers in the Wolfram article's description? (I am referring to the Wolfram article that is referenced in the Insights article at the start of this thread.)

I presume the "loss function" used for the training, this must be the crucial place that the trainers impose their intentions onto the output. I keep coming back to the fact that you can prompt ChatGPT with the single word "poem" and it will write a poem about poetry. Surely this is not what you can get from simply taking a body of written work and trying to predict the words that come after "poem" in that body of data, because very few poems are about poetry. There must be someplace where the trainers have decided what they expect the user to want from ChatGPT, and have trained it to produce a poem when the prompt is the word "poem", and have trained it to make the poem be about something mentioned in the prompt. That would go beyond simply predict what comes after the word "poem" in some vast body of text. The LLM would have to be trained to predict what words follow other words that also qualify as satisfying the prompt in some way.

So getting back to the issue as a prompt as a correction, the trainers expect that people will want to iterate with ChatGPT to improve its output in a session, so they will give prompts that should be interpreted as corrections. That's when ChatGPT is trained to be obsequious about doing corrections, it has to be built into the way it is trained and not just a simple algorithm for predicting text that follows the text in the prompt (again, humans are rarely so obsequious).

Ken G · Aug 14, 2023

Perhaps ChatGPT can itself give us some assistance here. When I prompted it to describe why it writes a poem about poetry when given the single word "poem", this was part of its answer:
"The process of training me involves a cost function, which is used to fine-tune my responses. During training, I'm presented with a wide variety of text and corresponding prompts, and I learn to predict the next word or phrase based on those inputs. The cost function helps adjust the model's internal parameters to minimize the difference between its predictions and the actual training data."

I think the important element of this is that it is trained on prompts and text, not just text. So it does not just predict the next word in a body of text, then see how well it did, it tries to predict responses to prompts. But a prompt is not a normal aspect of either human communication or bodies of text (if I came up to you and said "poem", you would not think you were being prompted, you would have no idea what I wanted and would probably ask me what I was talking about, but ChatGPT does not ask us what we are talking about, it is trained that everything is a prompt). So its training must look at bodies of text as if they were in response to something, and they look for some kind of correlation with something earlier that is then considered a prompt.
It says: "The prompt/response framework is a way to communicate how the model learns context and generates responses, but it doesn't necessarily represent the actual structure of the training data."

AngryBeavers · Aug 14, 2023

I have a very difficult time wrapping my head around the idea that all it's doing is predicting text based upon its training data, and algorithms. Only because the the likely hood of certain words being used together frequently enough that it becomes meaningful to a generated response has to rather low at times. Like poem and fiduciary being used togther often enough to have a connection. I am sure the algorithms are more advanced that this, but, still.

Motore · Aug 15, 2023

AngryBeavers said:

I have a very difficult time wrapping my head around the idea that all it's doing is predicting text based upon its training data, and algorithms.

To me, it is pretty intuitive that it can do that with clever algorithms (actually they don't need to be very complex) and a lot of training data. Of course the programers can limit the responses by hard coding certain words theat the chatbot must not use, nothing knew.

Ken G said:

I keep coming back to the fact that you can prompt ChatGPT with the single word "poem" and it will write a poem about poetry. Surely this is not what you can get from simply taking a body of written work and trying to predict the words that come after "poem" in that body of data, because very few poems are about poetry.

Hmm I prompted ChatGPT with the word "poem" several times and every time it generted a random poem. It is just scrabling text so that the natural language is uphold and that it rhymes (so that it actually looks like a poem, which there has to be presumably millions of them in the training data). Why would the trainers need to add anything?

Ken G said:

The LLM would have to be trained to predict what words follow other words that also qualify as satisfying the prompt in some way.

Well sure, that is how LLMs are constructed. You could construct one without prompt and it will just write something random in a natural language somwhere at random time. Not really useful.

Ken G said:

That's when ChatGPT is trained to be obsequious about doing corrections, it has to be built into the way it is trained and not just a simple algorithm for predicting text that follows the text in the prompt (again, humans are rarely so obsequious).

Well sure, it has to be trained on, but who said otherwise? Trained on massive data not by people. They just review the responses and give feedback so ChatGPT can optimize itself (as you can also do). At the end of the day It's just predicting which word comes next.

Ken G · Aug 15, 2023

Motore said:

Hmm I prompted ChatGPT with the word "poem" several times and every time it generted a random poem.

I overstated when I said that the poems are "about poetry", but in a test where I started eight sessions and said "poem", five of the poems self referenced the act of writing a poem in some way. That is very unusual for poems to do, so we know they are not just cobbled together from poetry in some random kind of way. (The poems also generally are about nature, and the term "canvas" appears somewhere in almost all 8, surprisingly, so for some strange reason the training has zeroed in on a few somewhat specific themes when poetry is involved.) But the larger issue is that ChatGPT gives some kind of special significance to the prompt, it is trained in some way to treat the prompt as special and it was a bit of a slog to figure out from Wolfram's description just how that special status is enforced in the training process, apparently a crucial element is that it is trained to "model" language in a way that involves responses to prompts. ChatGPT also wouldn't explain it when I asked it. All I can say is that it appears to be a very specific type of language that it is modeling, in effect a way of predicting the next word that in some way reacts to a prompt, rather than just predicting the next word in a random body of text. (You could imagine training an LLM to do the latter, but you would not get ChatGPT that way, both Wolfram and ChatGPT itself refer to other aspects of the training process and the way the language model works but the specifics are far from clear to me.)

Motore said:

It is just scrabling text so that the natural language is uphold and that it rhymes (so that it actually looks like a poem, which there has to be presumably millions of them in the training data). Why would the trainers need to add anything?

They need to add the concept of a prompt, and how to alter the training in response to that.

Motore said:

Well sure, that is how LLMs are constructed. You could construct one without prompt and it will just write something random in a natural language somwhere at random time. Not really useful.

Yes exactly. So we should not say the LLMs are just predicting words that come next, they are doing it in a rather specific way that gives special status to the prompt. They also appear to give special status to a prompt that they are trained to interpret as a correction. This seems to be a difference between ChatGPT and Bard, for example, because in my experience ChatGPT is trained to respond to correction in a much more obsequious way than Bard is. (For example, if you try to correct both into saying that one plus one is three, ChatGPT will say you must be using a different mathematical system, or perhaps are even making a joke (!), while Bard is far less forgiving and said "this is a nonsensical question because 1+1 cannot equal 3. If 1+1=3, then the entire concept of mathematics breaks down", which is certainly not true because I can easily imagine a mathematical system which always upticks any answer of a binary integer arithmetical operation, and mathematics in that system does not break down. Thus Bard not only fails to be obsequious, it fails to be correct in its nonobseqiousness!)

Motore said:

Well sure, it has to be trained on, but who said otherwise? Trained on massive data not by people. They just review the responses and give feedback so ChatGPT can optimize itself (as you can also do). At the end of the day It's just predicting which word comes next.

The people do the training because they decide how the training will work. So that's not just predicting what word comes next, although it is mostly that. But it is predicting what word comes next in a very carefully orchestrated environment, and Wolfram makes it clear that the people don't completely understand why certain such environments work better than others, but he describes it as an "art", and there's nothing automatic in performing an artform.

Motore · Aug 15, 2023

Ken G said:

I overstated when I said that the poems are "about poetry", but in a test where I started eight sessions and said "poem", five of the poems self referenced the act of writing a poem in some way. That is very unusual for poems to do, so we know they are not just cobbled together from poetry in some random kind of way.

Well not exactly random, it has to follow the rules of the cost function. Still not see anything strange about that. I can easily see most poems are about nature or love or something else. I can also see that the word "poem" is asociated with the word "poetry" a lot of the time. That's way it's not surprising that such poem are written by a LLM.

Ken G said:

So we should not say the LLMs are just predicting words that come next, they are doing it in a rather specific way that gives special status to the prompt.

In such a way that satisfies the cost function and utilizes all those millions of parameters. That is how the prediction and selection of the next word is generated. Even for the prompt. You input the prompt and the LLM just need to correlate that with the response using the the cost function (very simplifed of course, but still).

Ken G said:

while Bard is far less forgiving and said "this is a nonsensical question because 1+1 cannot equal 3. If 1+1=3, then the entire concept of mathematics breaks down", which is certainly not true because I can easily imagine a mathematical system which always upticks any answer of a binary integer arithmetical operation, and mathematics in that system does not break down.

I can easily see the Bard response above as being an actual reponse from a question on Stackexchange or Quora. And that gets added to the training data.
And sure is not correct. because it doesn't understand mathematics and doesn't have a separate code for mathematics. It's just predicting words base on the training data.

Ken G said:

The people do the training because they decide how the training will work. So that's not just predicting what word comes next, although it is mostly that.

They made the a cost function so of course they decide which training data is exceptable, but I don't see how that has any impact on the way the LLM generates the response. As I said, they can hardcode some limitations like certain bad words or ideas and make the cost function for that minimal, so the LLM will take that first into account.

Ken G said:

But it is predicting what word comes next in a very carefully orchestrated environment

Orchestrated by who? The parameters (weights) are optimized by the algorithm itself based on the training data. As it is known with such many parameters it's almost impossible to exactly analyze the decision procces the LLMs does for the specific response generation.

I am not saying it's easy to make such a LLM, of course it's not. I just don't see anything else as word prediction based on an enourmous data set, enourmous parameter space and a complex cost function.

Ken G · Aug 15, 2023

Motore said:

Well not exactly random, it has to follow the rules of the cost function. Still not see anything strange about that. I can easily see most poems are about nature or love or something else. I can also see that the word "poem" is asociated with the word "poetry" a lot of the time. That's way it's not surprising that such poem are written by a LLM.

I'm trying to understand the special relationship between the prompts and the predictions of text continuations, which must be established not by the dataset on which the LLM is trained, but rather on the architecture of the training itself. This is where we will find the concept of a "corrective" prompt, which clearly has a special status in how ChatGPT operates (in the sense that ChatGPT will respond quite differently to a prompt that corrects an error it made, versus some completely new prompt. It seems to expect that you will correct it, the trainers must have expected that it would make mistakes and need corrective prompts, and the training architecture is clearly set up to accomplish that, and in a different way for ChatGPT than for Bard.)

Motore said:

Orchestrated by who?

The people that set up the training environment, in particular the way prompts are associated with text predictions. It would be easy to train a new LLM based on a vast dataset of users interacting with previous LLMs, because that data would already have the structure of prompt/response, so you could train your new LLM to predict how the other LLMs reacted to their prompts. But if you are using the internet as your database, not past LLM sessions, you have to set up some new way to connect prompts to responses, and then train it to that.

For example, consider the single prompt "poem." You could imagine training an LLM to write essays about poems when prompted like that, or you could train an LLM to actually write a poem in response to that prompt. It seems to me this must be a choice of the training environment, it cannot just work out that if you use the internet as a database, you always end up with LLMs that write poems to this prompt. That must be a trainer choice, orchestrated by the LLM creators, to create certain expectations about what is the purpose of the LLM. That must also relate to how the LLM will react to corrective prompts, and how obsequious it will be. Again, I have found ChatGPT to be way more obsequious to corrections than Bard, even though they are similar LLMs with similar goals and using similar source material. This has to be the fingerprints of the trainers, whose fingerprints are on the training, and that's where the "art" comes in.

Ken G · Aug 15, 2023

Here is a good example of what I'm talking about. When I used the prompt "essay poem" to ChatGPT, to see if it would write an essay about poems or a poem about essays, this is what I got:
"Could you please clarify whether you're looking for an essay about poetry or a poem itself?" It then asked me for further details about what I was looking for. The exact same prompt to Bard gave me, "An Essay in Verse. This sonnet is an essay..." and it wrote a poem about poetry and why it is kind of like an essay. So that is a completely different strategy about how to interpret the prompt, and it must be a function of the training architecture, probably quite a conscious decision by the trainers (because they would have had lots of experience with how their LLM reacts to prompts, and would have tinkered with it to get what they were looking for).

Then I did "poem essay" in a new session to both, and again ChatGPT asked for clarification about what I wanted (imagine the training architecture needed to achieve that result), whereas Bard wrote an essay about poetry! So Bard grants significance to the order of the words in the prompt, whereas ChatGPT does not, so ChatGPT requires further clarification to resolve the ambiguity.

What is also interesting is that when I told ChatGPT that when I prompt of the form "X essay", I want an essay on the topic of X, it then gave me an essay on poetry (since I had already prompted it with "poem essay.") But then when I said "essay poem" to see if it would understand that is of the form "Y poem", it gave me yet another essay about poetry. So it "understood" that "X" in my previous prompt meant "any topic", but it did not understand that I was saying the order mattered. If I prompt it with "essay and also a poem", it gave me first an essay and then a poem, so it understood that "also" in a prompt means "break the prompt into two parts and satisfy both." Wolfram talked about the crucial importance of "modeling language", I think this is a good example, the LLM must make a model of its prompt that includes ideas like "also" means "break the prompt into two separate prompts." The modeling aspect is human supplied, it is not an automatic aspect of the training process.

Vanadium 50 · Aug 15, 2023

Ken G said:

I'm trying to understand the special relationship between the prompts and the predictions

This is the more interesting problem actually.

Let's take a step back. Chat GPT essentially calculates the probability P(x) that the next word is x, given the last N words were what they are. That's it. The rest are implementation details.

The easiest way to "seed" this on a question is to rewrite the question as a statement and to use that as the first N words of the answer. (Possibly keeping them, possibly dropping them.) I have no idea if this particular piece of code does it this way or some other way - it's just the easiest, and its been around for many, many decades.

Ken G · Aug 15, 2023

I didn't get through the entire Wolfram article, so perhaps he got into this in more detail, but he did say that the situation was (it seemed to me) a bit like programming a computer to play chess. Bad chess programs take the board position and look ahead three or four moves, searching all possibilities, and maximizing the board position down the road. But there are just way too many possibilities to search if you want to go farther than that, whereas human chess masters see at times a dozen moves ahead, because they know which avenues to search. They carry with them a kind of model of how a chess game works, and use it to reduce the space of possibilities. The great chess programs combine searching power with modeling ability, and are famous for winning games in over 100 moves without ever losing to a human (and you know they have no idea how to search that far in the future, so they must have better models of how chess games work).

Wolfram says that LLMs are like that, because there are way too many possible combinations of words to look back far at all, when trying to predict the next word. It just wouldn't work at all, unless they had a very good ability to model language, thereby vastly reducing the space of potential words they needed to include in their prediction process. I think the most substantial point that Wolfram made is that there is a kind of tradeoff between what seemed to me like accuracy (which is a bit like completeness) versus span (which is a bit like ability to reduce the search space to increase its reach). He said that irreducible computations are very reliable, but very slow because they have to do all the necessary computations (and these are normally what computers are very good at but it would never work for an LLM or a chess program), whereas modeling ability is only as reliable as the model (hence the accuracy problems of ChatGPT) but is way faster and way more able to make predictions that cross a larger span of text (which is of course essential for maintaining any kind of coherent train of thought when doing language).

So I believe this is very much the kind of "art of training" that Wolfram talks about, how to navigate that tradeoff. The surprise is that it is possible at all, albeit barely it seems, in the sense that the LLM can be trained to have just enough word range to maintain a coherent argument in response to a prompt that was many words in the past, yet still have some reasonably useful level of accuracy. That the accuracy level cannot be higher, however, would seem to be the reason that the training architecture is set up to accommodate an expectation that the user is going to be making followup corrections, or at least further guidance, in a series of prompts. The language modeling capacity must include many other bells and whistles, such that it can give a prominent place in whatever cost function it used in its training for key words in the prompt (like you can tell ChatGPT exactly how many lines to put in the poem, and it will try pretty hard to do it (though it won't succeed, remember it struggles with completeness), even if the prompt contains way more words than the program is capable of correlating the answer with).

Motore · Aug 16, 2023

Ken G said:

What is also interesting is that when I told ChatGPT that when I prompt of the form "X essay", I want an essay on the topic of X, it then gave me an essay on poetry (since I had already prompted it with "poem essay.") But then when I said "essay poem" to see if it would understand that is of the form "Y poem", it gave me yet another essay about poetry. So it "understood" that "X" in my previous prompt meant "any topic", but it did not understand that I was saying the order mattered.

Why is that so interesting? The characters "x", "y" are often asociated with a variable. And yes it doesn't understand (as we said here ad nausem already) what you are asking it, it's just looking if the "poem essay" asociation result in a lower cost function than "essay poem" (based on the training data) and then selects the lower one. As I said there are millions of parameters to make a decision with so it's a little more intricate than that, but the basic is at least for me quite intuitive.

Ken G said:

Bad chess programs take the board position and look ahead three or four moves, searching all possibilities, and maximizing the board position down the road. But there are just way too many possibilities to search if you want to go farther than that, whereas human chess masters see at times a dozen moves ahead, because they know which avenues to search.

Well a normal brute force chess programs only looks X number of moves ahead. A more sophisticated brute force chess programs also asigns different values to different chess pieces and it optimizes the selection of the next move. This is a case for Stockfish for example and also the style of most modern human chess players.
AlphaZero on the other hand ustilizes deep learning and neural networks where it teaches itself through playing against iteslf the best strategy to win. It still looks a lot of moves ahead but compared to Stockfish it needs a couple of orders of magnitude less moves to check and it beat Stockfish a lot of times, others were draws and it never lost.
Chess players describe AlphaZero as a more "human" way to play, so perhaps there will be a shift in ytle of playing because of AlpahaZero..
Anyway, even a most basic chess program will beat a chess grandmaster with enough computing power.

Ken G said:

Wolfram says that LLMs are like that, because there are way too many possible combinations of words to look back far at all, when trying to predict the next word. It just wouldn't work at all, unless they had a very good ability to model language, thereby vastly reducing the space of potential words they needed to include in their prediction process.

That is precisley what deep neural networks do. They don't search every possible combination (that wouldn't even make sense), they optimize the selection with parameters which had been weighted acordingly based on the training data. If you have do much natural language text to train on it is inevitable that the algorithm would be fluent in natural language. It cannot just spout random sentences because random sentences are not prevalent in the training data.

Do not forget when Tay (the Microsoft chatbo AI) launched on twitter it became quickly rasist and tweeted offensive language because its training data was tweets and of course it hadn't any limits implemented.
Again nothing particulary new here.
I still haven't found anything to show me that ChatGPT or Bard don't predict the next words based on the cost function.

Ken G · Aug 16, 2023

Motore said:

I still haven't found anything to show me that ChatGPT or Bard don't predict the next words based on the cost function.

Yes, LLMs predict the next words based on their training, and that training involves a cost function. But it involves much more than that (why else would Wolfram describe the whole escapade as an "art"), and it is that "much more" that is of interest. It involves a whole series of very interesting and complex choices by the humans who designed the training architecture, and that is what we are trying to understand. One particular example of this is, it is clear that ChatGPT and Bard handle ambiguous prompts quite differently, and corrective prompts quite differently also. So the question is, why is this? I suspect it reflects different choices in the training architecture, because it doesn't seem to be due to any differences in the database they are trained on. There must have been times when the human trainers decided they were, or were not, getting the behavior they desired, and made various adjustments in response to that, but what those adjustments are might fall under the heading of proprietary details, that's the part I'm not sure about. Certainly Wolfram alludes to a kind of tradeoff between accuracy in following a chain of logic, versus flexibility in terms of being able to handle a wide array of linguistic challenges, which is why ChatGPT is somewhat able to do both mathematical calculations and poetry writing, for example, but is not great at either.

Motore · Aug 16, 2023

Ken G said:

But it involves much more than that (why else would Wolfram describe the whole escapade as an "art"), and it is that "much more" that is of interest.

Yes that is an emergent property of the LLMs from those millions of parameters and millions of decision treee branches, which is analogous to an emergent consciousness out of billions of neural synapses. Which is indeed interesting but I thought not the topic we are on now.

Ken G said:

It involves a whole series of very interesting and complex choices by the humans who designed the training architecture, and that is what we are trying to understand.

And I think those choices are not that complex (not as much as you are suggesting) and the fascinating thing about LLMs is an emergent behaviour out of "simple" coding rules.

Ken G said:

One particular example of this is, it is clear that ChatGPT and Bard handle ambiguous prompts quite differently, and corrective prompts quite differently also. So the question is, why is this?

Why it cannot just easily be a different number of parameters, a different cost function or different training data (which can be vastly different). Of course they are also different language models, which of course the details are proprietary. Not a good reference but still:
https://tech.co/news/google-bard-vs-chatgpt

Ken G said:

which is why ChatGPT is somewhat able to do both mathematical calculations and poetry writing, for example, but is not great at either.

The mathematics can only be correct if the training data is correct or there is an additional math algorithm implemented into ChatGPT (which by now it could possibly be). But I wouldn't trust it.

Ken G · Aug 16, 2023

Motore said:

And I think those choices are not that complex (not as much as you are suggesting) and the fascinating thing about LLMs is an emergent behaviour out of "simple" coding rules.

I agree that it is remarkable how complex behaviors emerge from seemingly simple coding rules, nevertheless it is the manipulation of those coding rules that have a large impact, it's not just the training database choices. The human trainers decide the coding rules, and the training database, with some purpose in mind, and that purpose leaves its mark on the outcome in interesting ways. One very important difference is the language model that is used, something that Wolfram emphasizes is absolutely key. The article you referenced mentioned that when Bard switched from LaMDA to PaLM 2, it got much better at writing code. So someone has to develop the language model that makes the training possible, and that is one place where human intelligence enters the question (until they create LLMs that can create or at least iterate language models). This discussion started when we talked about how LLMs handle prompts that are in the form of corrections (an important way to interact with LLMs if you want better answers), and it seems likely to me that the language model will have a profound effect on how prompts are interpreted, but I don't really know. I do think it's pretty clear that ChatGPT is intentionally programmed, by humans, to treat corrective prompts more obsequiously than Bard is.

Motore said:

Why it cannot just easily be a different number of parameters, a different cost function or different training data (which can be vastly different). Of course they are also different language models, which of course the details are proprietary. Not a good reference but still:
https://tech.co/news/google-bard-vs-chatgpt

It could be a combination of all those things, but my interest is in the human choices, such as the language model used, and the way the training is supervised to achieve particular goals. (For example, how will monetizability affect future training of LLMs?)

Motore said:

The mathematics can only be correct if the training data is correct or there is an additional math algorithm implemented into ChatGPT (which by now it could possibly be). But I wouldn't trust it.

What's odd is that sometimes the LLMs invoke Python code to explain how they carry out mathematical calculations, but then they don't actually run the code, because the outcome they report is incorrect! I think you can't ask an LLM how it gets to its answer, because it just predicts words rather than actually tracking its own path. It seems not to invoke any concept of itself or any sort of internal space where it knows things, even though it invokes empty semantics like "I understand" and "I think." So it responds a bit like a salesperson: if you ask it for the reasons behind its statements, it just gives you something that sounds good, but it's not the reason.

neobaud · Aug 27, 2023

PeterDonis said:

But it doesn't "interpret" at all. It has no semantics.This seems to be vastly overstating ChatGPT's capabilities. You can "clarify" by giving it another input, but it won't have any semantic meaning to ChatGPT any more than your original input did.

Are you taking issue with the word interpret? Regardless of how it gets there, questions are interpreted. Otherwise the responses would not be useful. The user can also clarify otherwise it would be totally lost with each new user input. I think these abilities are emergent. See below for an example.

Anyway, I am responding to the point of your article. I am promoting a use model where you don't fully trust the responses but use them as guide for finding information quickly.

PeterDonis · Aug 27, 2023

neobaud said:

Are you taking issue with the word interpret?

To the extent that it implies matching up the question with some kind of semantic model of the world, yes. ChatGPT does not do that; it has no such semantic model of the world. All it has is relative word frequencies from its training data.

AngryBeavers · Aug 27, 2023

I hope that this post will not be flagged, as I feel it is keeping to the topic, but, the way I understand it, and if Bard is not having a session of lies with me, then it does have more than "relative word frequencies from it's training data". It is not ChatGPT, but it's serves the same purpose, and have a similar process from code to end user.

Anyway, if you ask in the right way, and whatever LLM youre using has a good rapport with you ( yes, it can appear to be temperamental, and withhold information if it, by whatever standard it makes the determination, thinks you are a jerk. Dont believe me? Be rude and see how helpful it becomes. ) you can get interesting information. I say interesting because, well, it is. But it could also not be entirely accurate. Someone more skilled than myself would have to weigh in. I tend to think that Bard has got the main points correct, perhaps even the small details. Dunno.

PeterDonis · Aug 27, 2023

AngryBeavers said:

It is not ChatGPT

And that means the article that this thread is about is not about it, so it is off topic here. If you want to discuss Bard, please start a separate thread.

AngryBeavers · Aug 27, 2023

Thats a bit nitpicky in my opinion. As both are LLMs and function the same. If you want to understand an LLM, do not limit yourself to simply one, such as ChatGPT, especially when there are so many others. There is something to be said for starting small, such as to better understand bread, one must first understand flour. But at the end of the day, you wont learn much about bread from only studying flour.

No disrespect.

PeterDonis · Aug 27, 2023

AngryBeavers said:

Thats a bit nitpicky in my opinion

Look at the title of the thread and the article it references. It is not about "LLMs" in general. It is specifically about ChatGPT. Again, if you want to discuss some other LLM like Bard, or LLMs in general, please start a separate thread.

Why ChatGPT AI Is Not Reliable

Attachments

Similar threads

Hot Threads

Recent Insights