Text fitness
This online calculator computes text "fitness". That is, how similar the given text to other texts written in English language.
The calculator below is an example of fitness function which can be applied to the text.
According to Wikipedia,
A fitness function is a particular type of objective function used to summarise, as a single figure of merit, how close a given design solution is to achieving the set aims. Fitness functions are used in genetic programming and genetic algorithms to guide simulations towards optimal design solutions.1
Why may we need it for texts? Unlike humans, computers can't look at the text and say if it is normal text or gibberish. So it needs something measurable. This particular implementation calculates such measure (or fitness score) based on quadgrams (aka 4-grams, aka tetragraphs) statistics. Thanks to Google Books, those team released their Ngram statistics under a Creative Commons Attribution 3.0 Unported License; we can actually calculate occurrences of any n-grams in whole Google Corpus Data (here is the link to Ngram Viewer). And thanks to Peter Norvig, who actually calculated these occurrences, so I do not need to download 23Gb of text and calculate it by myself (here is the link to Peter Norvig's article English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU).
For my fitness function, I used 20000 most often occurred quadgrams. The total number of quadrams analysed is 1 467 684 913 428. To get the idea, here is top ten, along with their frequencies (which is calculated by diving number of occurrences to total number of quadgrams):
Quadgram | Occurrences | Frequency |
---|---|---|
TION | 16 665 142 795 | 0,0113547142 |
ATIO | 8 806 923 204 | 0,0060005544 |
THAT | 8 006 441 484 | 0,0054551501 |
THER | 6 709 891 631 | 0,0045717521 |
WITH | 6 099 136 075 | 0,0041556168 |
MENT | 5 424 670 138 | 0,0036960727 |
IONS | 4 103 605 496 | 0,0027959717 |
THIS | 3 830 166 510 | 0,0026096654 |
HERE | 3 590 397 215 | 0,0024462997 |
FROM | 3 473 404 890 | 0,0023665876 |
Having these frequencies, technically, we can estimate the probability to find given text in whole text corpus (which is a good candidate for fitness measure). For example, let our text be the word "MENTION". It consists of the following quadgrams: MENT - ENTI - NTIO - TION. So,
Well, of course, approximately. Language rules impose additional limitations, but we do not care much about them as long as our fitness function works as expected. The real problem here, however, is that the probabilities are quite small, so multiplication of those quickly goes to even smaller values, introduces rounding errors, and is not quite usable. The solution is known - apply the logarithm function. In this case,
As you can see, multiplication is replaced with addition. Since the probabilities are less than one but greater than zero, the logarithm of base 10 gives us negative values. And the more rare quadgrams we have, the bigger the negative value we got. By the way, for quadgrams outside of the first 1000, I used a very small constant probability of 1/1 467 684 913 428; that logarithm is -12.1666328301.
So, this is exactly how the fitness metric is calculated in the calculator below. I break text into quadgrams, sum all logarithms of probabilities, normalize by diving to the text's length, and take the absolute value of the result (just for convenience). The more rare quadgrams appear in the text, the bigger the value we got, the less rare quadgrams appear in the text, the less value we got.
Of course, this is one of the possible text metrics, and, taken alone, it actually means nothing. The power comes from the comparison of texts. Let's compare several cases:
- The random article from NYT (Source)
The score is 5.61
- The "To be or not to be" speech from Hamlet.
The score is 6.08
- The JABBERWOCKY by Lewis Carroll
The score is 6.53
- The random letters sequence produced by Random Letters Generator
The score is 11.46.
As you can see, now we have some meaning in the results. The computer can tell you that the NYT article is certainly "more English" than a random number of letters. This can be used in many applications, for example, in automatic cracking of classical simple substitution ciphers (actually, this is why I need this function). Of course, like all statistical measures, this heavily relies on text to be "normal" English text. It fails miserably if text statistics differs from normal, as in the widely known example from Simon Singh’s book "The Code Book"
From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags
You can play with this quote right below if you are interested.
Click here to calculate fitness score
Kommentare