Can the probability of specific tweets be accurately calculated?

  • Thread starter DAJ
  • Start date
  • Tags
    Set
In summary, the conversation discusses the probability of specific tweets and the challenges of calculating this probability given the large number of possible arrangements of characters. The conversation also touches on the use of grammar and language processing in generating meaningful tweets.
  • #1
DAJ
1
0
Hello PF,

I have a question.
I am an artist working with big numbers and language. I created a twitter account that will post all possible tweets. (Edit: link removed)

I am interesting in calculating the probability of specific tweets. Example: what is the probability that my next tweet will be "End here. Us then. Finn, again! Take. Bussoftlhee, mememormee!" or just gibberish like "j^F9c@# 64l[". Is this doable? I want to use this to reduce other tweets on twitter to probabilities, I like the idea of converting meaningful language into a number. Also, How do I calculate the probability of a specific sequence of tweets?

And ok... I realize that these numbers are really really big, that the probability is basically zero, but I am not interested in reality.

THANKS!
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
DAJ said:
Hello PF,

am interesting in calculating the probability of specific tweets. Example: what is the probability that my next tweet will be "End here. Us then. Finn, again! Take. Bussoftlhee, mememormee!" or just gibberish like "j^F9c@# 64l[". Is this doable? I want to use this to reduce other tweets on twitter to probabilities, I like the idea of converting meaningful language into a number. Also, How do I calculate the probability of a specific sequence of tweets?

And ok... I realize that these numbers are really really big, that the probability is basically zero, but I am not interested in reality.

THANKS!

If you're asking about the number of possible arrangements of characters in a 140 character tweet, the answer is fairly straightforward. Given an alphabet of k characters (including spaces), the number of possible arrangements is [itex] k^{140} [/itex]. Of course, most of these arrangements will be nonsense. The probability of any specific sequence is just [itex] 1/k^{140}[/itex] assuming every character has an equal probability of occurring. Obviously the problem is more complicated if this assumption doesn't hold.
 
Last edited:
  • #3
Even if we assume that the probability of each character occurring is the same and that the probabilities are independent of what other characters are, we still need to consider tweets with <140 characters. E.g. there are k100 100-character tweets.
Therefore, the total amount of all possible non-empty tweets is [itex]\sum[/itex] [itex]^{140}_{i=1}[/itex] ki, and the probability of observing any given tweet is then 1/[itex]\sum[/itex] [itex]^{140}_{i=1}[/itex] ki.
 
  • #4
SW VandeCarr said:
If you're asking about the number of possible arrangements of characters in a 140 character tweet, the answer is fairly straightforward. Given an alphabet of k characters (including spaces), the number of possible arrangements is [itex] k^{140} [/itex]. Of course, most of these arrangements will be nonsense. The probability of any specific sequence is just [itex] 1/k^{140}[/itex] assuming every character has an equal probability of occurring. Obviously the problem is more complicated if this assumption doesn't hold.

Don't forget that urls of the form http://anything are automatically shortened by Twitter. See http://support.twitter.com/entries/109623.

This is now a much more interesting problem, since you have to calculate the probability that a random string contains a syntactically correct url.
 
Last edited by a moderator:
  • #5
oleador said:
Even if we assume that the probability of each character occurring is the same and that the probabilities are independent of what other characters are, we still need to consider tweets with <140 characters. E.g. there are k100 100-character tweets.
Therefore, the total amount of all possible non-empty tweets is [itex]\sum[/itex] [itex]^{140}_{i=1}[/itex] ki, and the probability of observing any given tweet is then 1/[itex]\sum[/itex] [itex]^{140}_{i=1}[/itex] ki.

I believe it can be shown that, by including a space as a character, my formulation is equivalent to yours. For a string of 140, the probability that the last character is a space is 1/k, for the last two characters being spaces, [itex]P=1/k^2[/itex], ...,for the "last" 140 characters, [itex]P=1/k^{140}[/itex]. This at least was my intent in including an empty space character.
 
  • #6
SW VandeCarr said:
I believe it can be shown that, by including a space as a character, my formulation is equivalent to yours. For a string of 140, the probability that the last character is a space is 1/k, for the last two characters being spaces, [itex]P=1/k^2[/itex], ...,for the "last" 140 characters, [itex]P=1/k^{140}[/itex]. This at least was my intent in including an empty space character.

My bad. You are totally right. Using my approach the tweet "good_morning" is different from the tweet "good_morning_________", where "_" stands for space. Clearly, this does not make sense in this setting.
 
  • #7
oleador said:
My bad. You are totally right. Using my approach the tweet "good_morning" is different from the tweet "good_morning_________", where "_" stands for space. Clearly, this does not make sense in this setting.

No problem. Unfortunately, real language processing is orders of magnitude more complicated and involves inputting huge amounts of data re allowable strings and syntax. English is more difficult than continental European languages because of its "quaint" spelling. Add to that, the highly abbreviated non-standard language used in tweets and you have a real challenge. (Although some kind of standardized English based "Tweetish" would be probably easier to process than the standard dialect).
 
Last edited:
  • #8
Take a look at grammars and construct of realizations of a particular grammar. Then you can supply a dictionary, a set of conditional probability distributions and then you can generate the things using a random number generator for the distributions and grammar distributions.

The grammars will markovian in nature and the specificity of this will depend on how you arrange the tags, how they are linked together and their internal structure vs global structure.
 
  • #9
chiro said:
Take a look at grammars and construct of realizations of a particular grammar. Then you can supply a dictionary, a set of conditional probability distributions and then you can generate the things using a random number generator for the distributions and grammar distributions.

The grammars will markovian in nature and the specificity of this will depend on how you arrange the tags, how they are linked together and their internal structure vs global structure.

If I understand you, generating grammatically correct sentences isn't equivalent to generating sensible sentences. "He baked a ward with Venus and bled incoherently."
 
  • #10
SW VandeCarr said:
If I understand you, generating grammatically correct sentences isn't equivalent to generating sensible sentences. "He baked a ward with Venus and bled incoherently."

You can add as much constraint as you want with the grammar: it doesn't have to correspond to a normal spoken or written version we use: you can add features of txtspeak and any other kind of realization you want to include.

Don't confuse grammar with English grammar: it's a general grammatical structure.
 
  • #11
chiro said:
You can add as much constraint as you want with the grammar: it doesn't have to correspond to a normal spoken or written version we use: you can add features of txtspeak and any other kind of realization you want to include.

Don't confuse grammar with English grammar: it's a general grammatical structure.

My point was if we want to consider only tweets that make sense, as opposed to random sequences of characters, we have to consider both grammar and semantics. There are programs that can do this to a limited degree, but as far as I know, there's no feasible way to assuredly obtain every sensible statement within the 140 character limit regardless of whatever non-ideographic written general purpose language you choose. Regarding ideographic texts (ie Mandarin), I have no idea.
 
Last edited:
  • #12
SW VandeCarr said:
My point was if we want to consider only tweets that make sense, as opposed to random sequences of characters, we have to consider both grammar and semantics. There are programs that can do this to a limited degree, but as far as I know, there's no feasible way to assuredly obtain every sensible statement within the 140 character limit regardless of whatever non-ideographic written general purpose language you choose. Regarding ideographic texts (ie Mandarin), I have no idea.

In terms of the actual grammatical structure in terms of the tags, their relationships to other tags and the structure overall, semantics just add to the structure.

The grammar can be as detailed and as complex as you want where greater complexity allows you to have possibilities that you could not have and gives you more control than you would have lower complexity, but again it's in the grammar definition.

As an example of what I mean at its most extreme, you could have all possible sentences in a tag each and then create an output tag that is basically an XOR statement of all the possible leaf tag definitions. Although you wouldn't do this, the point is that the grammatical structure can generate whatever you want it to generate.

Of course you wouldn't do this: you would get a linguist to specify the semantics and syntantic issues to generate the final grammar which would be optimal in terms of description: in other words you want to generate a grammar of minimum complexity while retaining all the semantic and syntactic information for the valid realizations: in other words you are solving a kind of optimization problem with the constraints determined by the syntax, semantics and other relevant information that a linguistic specialist would supply.

This is a language independent phenomenon and you could apply it even to Mandarin, just as you can apply it to representing the data structure of a bitmap, just as for specifying english text. As long as the alphabet is quantized (and you could extend it to a non-quantized alphabet in terms of the idea involved), then the idea doesn't change.
 
  • #13
DAJ said:
And ok... I realize that these numbers are really really big, that the probability is basically zero, but I am not interested in reality.
I'm not sure what you're trying to do, but that sounds like a mistake. The numbers are absolutely ridiculous.

I don't know what characters are allowed in a tweet, so I'll guess that there are 70 of them (26 lowercase letters, 26 uppercase letters, 10 numbers, a few non-alphabetic symbols). So there are 70^140 ≈ 2.05932837 × 10258 possible tweets. For comparison, the current age of the universe (≈13.7 billion years) is less than 1018 seconds.

If you could generate a billion tweets per second for 13.7 billion years, then you will have generated about 4.32 × 1026 messages. That's a lot, right? But the number of tweets you still haven't generated is approximately

2.05932837*10^258 - 4.32*10^26 = 2.05932837*10^258.

So the number of tweets you still haven't generated is essentially unchanged...after a billion tweets per second for 13.7 billion years.

How is this possible? 2.05932837*10^258 is a 259-digit number that starts with 2059328370000000000 (240 more zeroes after that). The computer has obviously rounded off to 9 significant figures. When we subtract the 27 digit number 4.32*10^26 from that, we get a 259-digit number that starts with 20593283699999999999 and then has nothing but nines until the last 28 digits. So when the computer displays the answer of the subtraction it rounds off 2.0593283699...(220 more nines, followed by 28 more digits) to 2.05932837. The error introduced by this roundoff is completely insignificant compared to the error that was introduced by keeping only 9 significant figures in the original calculation of 70^140. We would have had to keep at least 232 significant figures just to see that the number of remaining tweets will be smaller after 13.7 billion years.

I haven't tried to calculate this, but I think the probability that any of the tweets generated in those 13.7 billion years will make sense is extremely small (if they are generated randomly).
 
Last edited:
  • #14
Fredrik said:
I haven't tried to calculate this, but I think the probability that any of the tweets generated in those 13.7 billion years will make sense is extremely small (if they are generated randomly).

Yes, but if you have an infinite number of teenage girls tweeting, one of the tweets will eventually make sense as [itex]t \rightarrow \infty[/itex].
 
  • #15
Steely Dan said:
Yes, but if you have an infinite number of teenage girls tweeting, one of the tweets will eventually make sense as [itex]t \rightarrow \infty[/itex].

No, that's known to be p=0; selection is not random and excludes sensible tweets. (apologies to teenage girls).
 
  • #16
PAllen said:
No, that's known to be p=0; selection is not random and excludes sensible tweets. (apologies to teenage girls).

no need to apologize, since there are infinitely many, p=0 only means "almost impossible", each singleton has probability 0.
 
  • #17
DAJ said:
I want to use this to reduce other tweets on twitter to probabilities, I like the idea of converting meaningful language into a number. Also, How do I calculate the probability of a specific sequence of tweets?

The technical bits have been explained already, but I would like to point out the glaring hole in your idea is the assumption of independance.

For example, twitter tweets are not independant at all, hence the calculating the probabilities get infinitely more complicated as you have to factor in external events.

Basically without investment your millions of dollars into research I don't see how what you want to do can be done.
 

FAQ: Can the probability of specific tweets be accurately calculated?

What is a "set of all possible tweets"?

A "set of all possible tweets" refers to the collection of all possible combinations of words, characters, and symbols that can be used to create a tweet on a social media platform such as Twitter.

Why is a "set of all possible tweets" important to study?

Studying a "set of all possible tweets" can provide insights into the patterns and trends of language usage on social media platforms. It can also help in understanding the impact of social media on communication and society.

How is a "set of all possible tweets" determined?

A "set of all possible tweets" is determined by the combination of the maximum character limit on a social media platform, the available characters and symbols, and the rules and regulations set by the platform.

Can a "set of all possible tweets" be calculated?

While it is possible to calculate the total number of possible tweets based on the above factors, it is not practical to do so due to the constant changes and updates in social media platforms.

What are the potential applications of studying a "set of all possible tweets"?

Studying a "set of all possible tweets" can have various applications, including improving natural language processing algorithms, analyzing sentiment and trends, and identifying fake or automated accounts. It can also be useful in creating effective social media marketing strategies.

Back
Top