Tough Textual Criticism math statistics question

  • I
  • Thread starter Eturnal
  • Start date
  • Tags
    Probability
  • #1
Eturnal
9
1
TL;DR Summary
If text A is 87% similar to text Z

And text B is 87% similar to text Z

Text A & B are 92% similar

Each text is 10000 words (approx)



What are the odds that when Text A and Text B disagree one of them will agree with Text Z?
Okay let me rephrase this math question and frame it. It is math dealing with ancient Biblical texts and textual criticism.
.
Codex 01 (350AD) agrees with the MT (mjority text) about 87% of the time.

Codex 03 (350AD) Agrees with MT about 87% of the time.

01 and 03 agree with each other about 92% of the time.

When 01 and 03 disagree, there is an 87% chance that one of them agrees with the MT.

Wouldn't you expect that number to be lower if these disagreements were random?

Please help math geniuses. Thank you!
 
Last edited:
  • Haha
Likes Agent Smith
Physics news on Phys.org
  • #2
If this is a homework-type problem then there is a specific format for that. You have to show some work and then we can give hints and guidance. Can you express each problem statement in terms of conditional probabilities or other formal mathematical expressions?
 
  • #3
Eturnal said:
TL;DR Summary: If text A is 87% similar to text Z

And text B is 87% similar to text Z

Text A & B are 92% similar

Each text is 10000 words (approx)
What are the odds that when Text A and Text B disagree one of them will agree with Text Z?

Okay let me rephrase this math question and frame it. It is math dealing with ancient Biblical texts and textual criticism.
.
Codex 01 (350AD) agrees with the MT (mjority text) about 87% of the time.

Codex 03 (350AD) Agrees with MT about 87% of the time.

01 and 03 agree with each other about 92% of the time.

When 01 and 03 disagree, there is an 87% chance that one of them agrees with the MT.

Wouldn't you expect that number to be lower if these disagreements were random?

Please help math geniuses. Thank you!
I can't make complete sense of what you are assuming and what you want to calculate based on those assumptions. There is also, perhaps, a missing factor about how many options are there for a text, which deteremines the underlying probability that two random texts will agree on something? Finally, you are more in the realm of statistical inference (hypothesis testing) here than in simple probabilities. More detail on this follows:

It seems you could model the situation by considering each text to be a list of things - possibly statements in this case. Perhaps each text makes one hundred statements about some things. If these are true/false statements, then each text is modelled by a binary string one hundred characters long. But, if these statements have more options than just true/false or A/B - let's say each statement has five options - then each is a string of 1/2/3/4/5 or A/B/C/D/E one hundred characters long. The first thing you need is an appropriate model like this. This is technically called a sample space.

Once you have an appropriate sample space for your problem, then you can start doing some hypothesis testing. That means to you to (precisely) frame a hypothesis to test!
 
  • Like
Likes Eturnal and Dale
  • #4
Eturnal said:
TL;DR Summary: If text A is 87% similar to text Z

And text B is 87% similar to text Z

Text A & B are 92% similar

Each text is 10000 words (approx)
What are the odds that when Text A and Text B disagree one of them will agree with Text Z?

Wouldn't you expect that number to be lower if these disagreements were random?
I don’t think that we can answer this. The similarity measure doesn’t seem like a probability. So we can’t really have any expectations on what the similarity measure should be in unmeasured situations.
 
  • Like
Likes Agent Smith, FactChecker and Eturnal
  • #5
Thanks for your reply. Having a hard time getting started on this problem and my math is rusty. The answer is not 87% of the
 
  • #6
So there are about 7,000 words in each text in the texted section.
.
but we should be able to do the percentages so that shouldn't matter right? We have 8% of each text agreeing to disagree. Then we take a random splatter of 26% between the two texts (13% each text). But oh yeah we would have to account for all the letter/ word possibilities in those slots as well wouldn't we?
 
  • #7
Eturnal said:
So there are about 7,000 words in each text in the texted section.
.
but we should be able to do the percentages so that shouldn't matter right? We have 8% of each text agreeing to disagree. Then we take a random splatter of 26% between the two texts (13% each text). But oh yeah we would have to account for all the letter/ word possibilities in those slots as well wouldn't we?
This is even less clear than your original post. Mathematics and statistics require a well-defined problem - even if the definition includes uncertainties and probabilities. This is different from the humanities where you can argue endlessly over ill-defined concepts!
 
  • Like
Likes Dale
  • #8
PeroK said:
This is even less clear than your original post. Mathematics and statistics require a well-defined problem - even if the definition includes uncertainties and probabilities. This is different from the humanities where you can argue endlessly over ill-defined concepts!
Tell me if I'm heading in the right direction here.

Take two texts of 100 characters and highlight 8% representing the disagreement between 01 and 03.

Now randomly select 13% of each text representing the disagreements between 01/03 & the MT.

What are the odds those 13 places overlap the 8%?

Now, should we account for each character having 24 different options (letters in the Greek alphabet)? Or should we just pretend it is an A/B in eaxh character slot for simplicity sake?
 
  • #9
PeroK said:
This is even less clear than your original post. Mathematics and statistics require a well-defined problem - even if the definition includes uncertainties and probabilities. This is different from the humanities where you can argue endlessly over ill-defined concepts!
Someone posited that there is an 87% chance that the disagreements between 01 and 03 land on MT just because each agrees with MT 87%. I don't feel like their math is correct. Thank you for your help!!!
 
  • #10
Eturnal said:
Take two texts of 100 characters and highlight 8% representing the disagreement between 01 and 03.

Now randomly select 13% of each text representing the disagreements between 01/03 & the MT.

What are the odds those 13 places overlap the 8%?

Now, should we account for each character having 24 different options (letters in the Greek alphabet)? Or should we just pretend it is an A/B in eaxh character slot for simplicity sake?
You can only produce probabilities from a large sample of data or by understanding where the source material came from and hence you have some underlying assumptions.

It's a common misconception that probabilities can be conjured from one sample of data without underlying assumptions. I think this is what you trying to do here. That somehow, these percentages themselves will reveal something mathematically robust.

You can determine from the texts how correlated they are (and, indeed, that's what your percentages are trying to show). But, there is no magic wand that will tell you how likely that correlation was. The probability of a given correlation is not inherent in the data. It can only be calculated when you have a model for how the data was generated. The same correlations might be almost inevitable in one case and highly unlikely in another - even in cases where the raw data is the same.
 
  • #11
PeroK said:
You can only produce probabilities from a large sample of data or by understanding where the source material came from and hence you have some underlying assumptions.

It's a common misconception that probabilities can be conjured from one sample of data without underlying assumptions. I think this is what you trying to do here. That somehow, these percentages themselves will reveal something mathematically robust.

You can determine from the texts how correlated they are (and, indeed, that's what your percentages are trying to show). But, there is no magic wand that will tell you how likely that correlation was. The probability of a given correlation is not inherent in the data. It can only be calculated when you have a model for how the data was generated. The same correlations might be almost inevitable in one case and highly unlikely in another - even in cases where the raw data is the same.
Humor me. I'm sure someone can draw up a useful piece of math on this although yes everything will have some assumptions plugged in.
 
  • #12
Eturnal said:
Humor me. I'm sure someone can draw up a useful piece of math on this although yes everything will have some assumptions plugged in.
Being an inveterate frequentist, I'll leave that to the Bayesians!
 
  • Haha
Likes Dale
  • #13
PeroK said:
Being an inveterate frequentist, I'll leave that to the Bayesians!
It's challenging I know
 
  • #14
Eturnal said:
Someone posited that there is an 87% chance that the disagreements between 01 and 03 land on MT just because each agrees with MT 87%. I don't feel like their math is correct. Thank you for your help!!!
I don’t think that these percentages for agreement are probabilities. Probabilities are between zero and one, so it is common to write them as percentages. But that doesn’t imply that everything that is written as a percentage is a probability.

In particular, a probability is always a measure on some space of events. For example, if you are rolling a single dice then the space of events could be “a 1 is rolled”, “a 2 is rolled”, …, “a 6 is rolled”.

Here, I cannot see that there is a space of events. So I don’t think that “text A is 87% similar to text Z” is a probability. If it is a probability then what exactly is the event space and what sample of events is described by the statement?

Eturnal said:
Humor me. I'm sure someone can draw up a useful piece of math on this although yes everything will have some assumptions plugged in.
I don’t think that will be possible without some additional information about the similarity measure. It doesn’t seem like a probability to me. So I don’t think the math of probability will apply.
 
  • #15
Dale said:
I don’t think that will be possible without some additional information about the similarity measure. It doesn’t seem like a probability to me. So I don’t think the math of probability will apply.
And if a Bayesian can't do it, then nobody can!
 
  • Like
Likes Dale
  • #16
PeroK said:
Being an inveterate frequentist, I'll leave that to the Bayesians!
I am a Bayesian, so I am happy to assign probabilities without a lot of data. But I still need an event space, just like the frequentists.
 
  • Like
Likes Agent Smith
  • #17
Dale said:
I don’t think that these percentages for agreement are probabilities. Probabilities are between zero and one, so it is common to write them as percentages. But that doesn’t imply that everything that is written as a percentage is a probability.

In particular, a probability is always a measure on some space of events. For example, if you are rolling a single dice then the space of events could be “a 1 is rolled”, “a 2 is rolled”, …, “a 6 is rolled”.

Here, I cannot see that there is a space of events. So I don’t think that “text A is 87% similar to text Z” is a probability. If it is a probability then what exactly is the event space and what sample of events is described by the statement?

I don’t think that will be possible without some additional information about the similarity measure. It doesn’t seem like a probability to me. So I don’t think the math of probability will apply.
 
  • #18
Anyone want to try their hand at being amazing?
 
  • #19
I'm getting a bizarre result: ##261 = 211##
 
  • #20
Eturnal said:
Anyone want to try their hand at being amazing?
Oh we are all amazing here, but this is more a problem in combinatorics than probablity or statistics.

And I am afraid you won't find an answer in combinatorics either, because there is not enough information there to provide a unique answer. To demonstrate this without using 10,000 word examples I will use something simpler:

  • Strings are ordered sequences of 10 letters.
  • 70% of the letters of string A are identical to the corresponding letters of string Z.
  • 70% of the letters of string B are identical to the corresponding letters of string Z.
  • 80% of the letters of string A are identical to the corresponding letters of string B.

Example 1:
A = "ZZZZZZZAAA"
B = "ZZZZZZZABB"
Z = "ZZZZZZZZZZ"
Here when A disagrees with B (in the 9th and 10th positions), neither A nor B ever agrees with Z. Now consider

Example 2:
A = "ZZZZZZAAAZ"
B = "ZZZZZZAAZA"
Z = "ZZZZZZZZZZ"
Here when A disagrees with B (again in the 9th and 10th positions), either A or B always agrees with Z.

Now consider the question in your opening post:

Eturnal said:
When 01 and 03 disagree, there is an 87% chance that one of them agrees with the MT.

Wouldn't you expect that number to be lower if these disagreements were random?

The problem with this question is that you have already specified that the disagreements are not (or rather very unlikely to be) random by stating that
Eturnal said:
Text A & B are 92% similar

It is this number that you would expect to be smaller if the differences between texts A and Z and the differences between B and Z were "random", or rather uncorrelated which is the correct term here. In fact you would expect it to be around ## 0.87^2 ## or 76%.

Finally, I assume you are referring to religious texts here and you should note:
  • we don't discuss relegion here
  • you don't need statistics to tell you that it is unlikely that differences between different versions of similar texts are uncorrelated because the process that gives rise to these differences is clearly not random
  • the content of these posts is copyright PhysicsForums and you must not publish it elsewhere other than in accordance with this site's terms and conditions. In particular do NOT post on some relegious crackpot site that Science has proved that the codex sinaticus is the One True Word of God or whatever.
 
Last edited:
  • Haha
Likes Agent Smith

FAQ: Tough Textual Criticism math statistics question

What is textual criticism in the context of mathematical statistics?

Textual criticism in mathematical statistics involves analyzing and evaluating the accuracy and reliability of texts, such as research papers or data sets, using statistical methods. This can include identifying errors, inconsistencies, and biases in the data or text.

How can statistical methods be applied to textual criticism?

Statistical methods can be applied to textual criticism by using techniques such as hypothesis testing, regression analysis, and clustering. These methods help in identifying patterns, anomalies, and relationships within the text or data, which can then be used to assess its reliability and validity.

What are common statistical tools used in textual criticism?

Common statistical tools used in textual criticism include chi-square tests, t-tests, ANOVA, correlation coefficients, and principal component analysis (PCA). These tools help in quantifying the degree of agreement or discrepancy among different texts or data sets.

What challenges are faced in applying statistical methods to textual criticism?

Challenges in applying statistical methods to textual criticism include dealing with incomplete or inconsistent data, distinguishing between meaningful patterns and random noise, and ensuring that the statistical methods are appropriate for the type of text or data being analyzed. Additionally, interpreting the results in a meaningful way can be complex.

Can machine learning be used in textual criticism, and if so, how?

Yes, machine learning can be used in textual criticism. Techniques such as natural language processing (NLP), clustering algorithms, and supervised learning can help in automating the analysis of large texts, identifying patterns, and making predictions about the reliability and accuracy of the text. Machine learning models can be trained to detect anomalies, classify texts, and even suggest corrections.

Back
Top