Can the probability of specific tweets be accurately calculated?

DAJ · Apr 17, 2012

Hello PF,

I have a question.
I am an artist working with big numbers and language. I created a twitter account that will post all possible tweets. (Edit: link removed)

I am interesting in calculating the probability of specific tweets. Example: what is the probability that my next tweet will be "End here. Us then. Finn, again! Take. Bussoftlhee, mememormee!" or just gibberish like "j^F9c@# 64l[". Is this doable? I want to use this to reduce other tweets on twitter to probabilities, I like the idea of converting meaningful language into a number. Also, How do I calculate the probability of a specific sequence of tweets?

And ok... I realize that these numbers are really really big, that the probability is basically zero, but I am not interested in reality.

THANKS!

SW VandeCarr · Apr 19, 2012

DAJ said:

Hello PF,

am interesting in calculating the probability of specific tweets. Example: what is the probability that my next tweet will be "End here. Us then. Finn, again! Take. Bussoftlhee, mememormee!" or just gibberish like "j^F9c@# 64l[". Is this doable? I want to use this to reduce other tweets on twitter to probabilities, I like the idea of converting meaningful language into a number. Also, How do I calculate the probability of a specific sequence of tweets?

And ok... I realize that these numbers are really really big, that the probability is basically zero, but I am not interested in reality.

THANKS!

If you're asking about the number of possible arrangements of characters in a 140 character tweet, the answer is fairly straightforward. Given an alphabet of k characters (including spaces), the number of possible arrangements is [itex] k^{140} [/itex]. Of course, most of these arrangements will be nonsense. The probability of any specific sequence is just [itex] 1/k^{140}[/itex] assuming every character has an equal probability of occurring. Obviously the problem is more complicated if this assumption doesn't hold.

oleador · Apr 19, 2012

Even if we assume that the probability of each character occurring is the same and that the probabilities are independent of what other characters are, we still need to consider tweets with <140 characters. E.g. there are k¹⁰⁰ 100-character tweets.
Therefore, the total amount of all possible non-empty tweets is [itex]\sum[/itex] [itex]^{140}_{i=1}[/itex] kⁱ, and the probability of observing any given tweet is then 1/[itex]\sum[/itex] [itex]^{140}_{i=1}[/itex] kⁱ.

SteveL27 · Apr 19, 2012

SW VandeCarr said:

If you're asking about the number of possible arrangements of characters in a 140 character tweet, the answer is fairly straightforward. Given an alphabet of k characters (including spaces), the number of possible arrangements is [itex] k^{140} [/itex]. Of course, most of these arrangements will be nonsense. The probability of any specific sequence is just [itex] 1/k^{140}[/itex] assuming every character has an equal probability of occurring. Obviously the problem is more complicated if this assumption doesn't hold.

Don't forget that urls of the form http://anything are automatically shortened by Twitter. See http://support.twitter.com/entries/109623.

This is now a much more interesting problem, since you have to calculate the probability that a random string contains a syntactically correct url.

SW VandeCarr · Apr 19, 2012

oleador said:

Even if we assume that the probability of each character occurring is the same and that the probabilities are independent of what other characters are, we still need to consider tweets with <140 characters. E.g. there are k¹⁰⁰ 100-character tweets.
Therefore, the total amount of all possible non-empty tweets is [itex]\sum[/itex] [itex]^{140}_{i=1}[/itex] kⁱ, and the probability of observing any given tweet is then 1/[itex]\sum[/itex] [itex]^{140}_{i=1}[/itex] kⁱ.

I believe it can be shown that, by including a space as a character, my formulation is equivalent to yours. For a string of 140, the probability that the last character is a space is 1/k, for the last two characters being spaces, [itex]P=1/k^2[/itex], ...,for the "last" 140 characters, [itex]P=1/k^{140}[/itex]. This at least was my intent in including an empty space character.

oleador · Apr 19, 2012

SW VandeCarr said:

I believe it can be shown that, by including a space as a character, my formulation is equivalent to yours. For a string of 140, the probability that the last character is a space is 1/k, for the last two characters being spaces, [itex]P=1/k^2[/itex], ...,for the "last" 140 characters, [itex]P=1/k^{140}[/itex]. This at least was my intent in including an empty space character.

My bad. You are totally right. Using my approach the tweet "good_morning" is different from the tweet "good_morning_________", where "_" stands for space. Clearly, this does not make sense in this setting.

SW VandeCarr · Apr 19, 2012

oleador said:

My bad. You are totally right. Using my approach the tweet "good_morning" is different from the tweet "good_morning_________", where "_" stands for space. Clearly, this does not make sense in this setting.

No problem. Unfortunately, real language processing is orders of magnitude more complicated and involves inputting huge amounts of data re allowable strings and syntax. English is more difficult than continental European languages because of its "quaint" spelling. Add to that, the highly abbreviated non-standard language used in tweets and you have a real challenge. (Although some kind of standardized English based "Tweetish" would be probably easier to process than the standard dialect).

chiro · Apr 19, 2012

Take a look at grammars and construct of realizations of a particular grammar. Then you can supply a dictionary, a set of conditional probability distributions and then you can generate the things using a random number generator for the distributions and grammar distributions.

The grammars will markovian in nature and the specificity of this will depend on how you arrange the tags, how they are linked together and their internal structure vs global structure.

SW VandeCarr · Apr 19, 2012

chiro said:

Take a look at grammars and construct of realizations of a particular grammar. Then you can supply a dictionary, a set of conditional probability distributions and then you can generate the things using a random number generator for the distributions and grammar distributions.

The grammars will markovian in nature and the specificity of this will depend on how you arrange the tags, how they are linked together and their internal structure vs global structure.

If I understand you, generating grammatically correct sentences isn't equivalent to generating sensible sentences. "He baked a ward with Venus and bled incoherently."

chiro · Apr 19, 2012

SW VandeCarr said:

If I understand you, generating grammatically correct sentences isn't equivalent to generating sensible sentences. "He baked a ward with Venus and bled incoherently."

You can add as much constraint as you want with the grammar: it doesn't have to correspond to a normal spoken or written version we use: you can add features of txtspeak and any other kind of realization you want to include.

Don't confuse grammar with English grammar: it's a general grammatical structure.

SW VandeCarr · Apr 20, 2012

chiro said:

You can add as much constraint as you want with the grammar: it doesn't have to correspond to a normal spoken or written version we use: you can add features of txtspeak and any other kind of realization you want to include.

Don't confuse grammar with English grammar: it's a general grammatical structure.

My point was if we want to consider only tweets that make sense, as opposed to random sequences of characters, we have to consider both grammar and semantics. There are programs that can do this to a limited degree, but as far as I know, there's no feasible way to assuredly obtain every sensible statement within the 140 character limit regardless of whatever non-ideographic written general purpose language you choose. Regarding ideographic texts (ie Mandarin), I have no idea.

chiro · Apr 20, 2012

SW VandeCarr said:

My point was if we want to consider only tweets that make sense, as opposed to random sequences of characters, we have to consider both grammar and semantics. There are programs that can do this to a limited degree, but as far as I know, there's no feasible way to assuredly obtain every sensible statement within the 140 character limit regardless of whatever non-ideographic written general purpose language you choose. Regarding ideographic texts (ie Mandarin), I have no idea.

In terms of the actual grammatical structure in terms of the tags, their relationships to other tags and the structure overall, semantics just add to the structure.

The grammar can be as detailed and as complex as you want where greater complexity allows you to have possibilities that you could not have and gives you more control than you would have lower complexity, but again it's in the grammar definition.

As an example of what I mean at its most extreme, you could have all possible sentences in a tag each and then create an output tag that is basically an XOR statement of all the possible leaf tag definitions. Although you wouldn't do this, the point is that the grammatical structure can generate whatever you want it to generate.

Of course you wouldn't do this: you would get a linguist to specify the semantics and syntantic issues to generate the final grammar which would be optimal in terms of description: in other words you want to generate a grammar of minimum complexity while retaining all the semantic and syntactic information for the valid realizations: in other words you are solving a kind of optimization problem with the constraints determined by the syntax, semantics and other relevant information that a linguistic specialist would supply.

This is a language independent phenomenon and you could apply it even to Mandarin, just as you can apply it to representing the data structure of a bitmap, just as for specifying english text. As long as the alphabet is quantized (and you could extend it to a non-quantized alphabet in terms of the idea involved), then the idea doesn't change.

Fredrik · Apr 20, 2012

DAJ said:

And ok... I realize that these numbers are really really big, that the probability is basically zero, but I am not interested in reality.

I'm not sure what you're trying to do, but that sounds like a mistake. The numbers are absolutely ridiculous.

I don't know what characters are allowed in a tweet, so I'll guess that there are 70 of them (26 lowercase letters, 26 uppercase letters, 10 numbers, a few non-alphabetic symbols). So there are 70^140 ≈ 2.05932837 × 10²⁵⁸ possible tweets. For comparison, the current age of the universe (≈13.7 billion years) is less than 10¹⁸ seconds.

If you could generate a billion tweets per second for 13.7 billion years, then you will have generated about 4.32 × 10²⁶ messages. That's a lot, right? But the number of tweets you still haven't generated is approximately

2.05932837*10^258 - 4.32*10^26 = 2.05932837*10^258.

So the number of tweets you still haven't generated is essentially unchanged...after a billion tweets per second for 13.7 billion years.

How is this possible? 2.05932837*10^258 is a 259-digit number that starts with 2059328370000000000 (240 more zeroes after that). The computer has obviously rounded off to 9 significant figures. When we subtract the 27 digit number 4.32*10^26 from that, we get a 259-digit number that starts with 20593283699999999999 and then has nothing but nines until the last 28 digits. So when the computer displays the answer of the subtraction it rounds off 2.0593283699...(220 more nines, followed by 28 more digits) to 2.05932837. The error introduced by this roundoff is completely insignificant compared to the error that was introduced by keeping only 9 significant figures in the original calculation of 70^140. We would have had to keep at least 232 significant figures just to see that the number of remaining tweets will be smaller after 13.7 billion years.

I haven't tried to calculate this, but I think the probability that any of the tweets generated in those 13.7 billion years will make sense is extremely small (if they are generated randomly).

Steely Dan · Apr 20, 2012

Fredrik said:

I haven't tried to calculate this, but I think the probability that any of the tweets generated in those 13.7 billion years will make sense is extremely small (if they are generated randomly).

Yes, but if you have an infinite number of teenage girls tweeting, one of the tweets will eventually make sense as [itex]t \rightarrow \infty[/itex].

PAllen · Apr 24, 2012

Steely Dan said:

Yes, but if you have an infinite number of teenage girls tweeting, one of the tweets will eventually make sense as [itex]t \rightarrow \infty[/itex].

No, that's known to be p=0; selection is not random and excludes sensible tweets. (apologies to teenage girls).

sunjin09 · Apr 24, 2012

PAllen said:

No, that's known to be p=0; selection is not random and excludes sensible tweets. (apologies to teenage girls).

no need to apologize, since there are infinitely many, p=0 only means "almost impossible", each singleton has probability 0.

sid9221 · Apr 25, 2012

DAJ said:

I want to use this to reduce other tweets on twitter to probabilities, I like the idea of converting meaningful language into a number. Also, How do I calculate the probability of a specific sequence of tweets?

The technical bits have been explained already, but I would like to point out the glaring hole in your idea is the assumption of independance.

For example, twitter tweets are not independant at all, hence the calculating the probabilities get infinitely more complicated as you have to factor in external events.

Basically without investment your millions of dollars into research I don't see how what you want to do can be done.

Can the probability of specific tweets be accurately calculated?

FAQ: Can the probability of specific tweets be accurately calculated?

What is a "set of all possible tweets"?

Why is a "set of all possible tweets" important to study?

How is a "set of all possible tweets" determined?

Can a "set of all possible tweets" be calculated?

What are the potential applications of studying a "set of all possible tweets"?

Similar threads

Hot Threads

Recent Insights