A few days ago, I started a project to expand my Hindi vocabulary based on dialogues from the movie ‘Frozen’. One of the questions I asked myself then was very interesting: How many words do you need to know to understand the text?
Therefore, I decided to conduct a similar experiment using subtitles from the movie ‘Frozen’ in Polish, English, German, French, and Arabic.”
I wondered about the following issues:
Does having 20% of the words from the subtitles really allow understanding 80% of the entire text in each of these languages?
Also, does a similar proportion only apply to languages where the conjugation of verbs and nouns is very simple, such as English, Swedish, etc.? Or does it also apply to learning languages like Polish, where words can appear in various forms?”
Here’s the plan I’ve set for myself to carry out:
First, I will use subtitles from the movie ‘Frozen’ in various languages and break down all the sentences into individual words.
Secondly, I will treat as a word:
- All words, including proper nouns like names such as Elsa, Anna, etc.
- Every form of a word that appears in the text. For example, verb forms like “is,” “was,” “will be” or nouns like “snowman,” “snowmen” will be treated as separate words.
- Words connected by an apostrophe will be treated as one word, for example, “I’m,” “he’s” in English or “c’est” in French.
- Words connected by a hyphen will be separated, for example, “attrape-moi” in French becomes two words: “attrape” and “moi.”
- If a word can have two meanings, I will consider it as one word in my analysis. For example, “może” meaning “maybe” and “może” meaning “he/she can” will be counted as one word.
- If a word in writing is connected to another, for example, the Arabic “and,” which is “و,” it will be treated as part of the next word. So, for instance, “ولكن” is one word, not two.
So, here are the results of my experiments:
How many words do you need to know to understand a text in English?
Here is the basic data:
- Total number of words (including repetitions): 7747
- Number of unique words: 1241
- Words are repeated on average: 6.2 times
Percentage of words: | Allows to understand: |
5% | 52,3% |
10% | 65,64% |
20% | 78.55% |
35% | 87,20% |
50% | 91.98% |
10 most common words:
- you, I, the, to, a, and, no, it, me, is
How many words do you need to know to understand a text in Polish?
Here is the basic data:
- Total number of words (including repetitions): 6374
- Number of unique words: 1885
- Words are repeated on average: 3.4 times
Percentage of words: | Allows to understand: |
5% | 48,59% |
10% | 59,81% |
20% | 70,76% |
35% | 80,31% |
50% | 85,21% |
10 most common words:
- nie, to, się, jest, i, w, co, z, tak, na
How many words do you need to know to understand a text in German?
Here is the basic data:
- Total number of words (including repetitions): 6022
- Number of unique words: 1366
- Words are repeated on average: 4.4 times
Percentage of words: | Allows to understand: |
5% | 49,04% |
10% | 63,05% |
20% | 75,12% |
35% | 83,68% |
50% | 88,66% |
10 most common words:
- ich, ist, du, nicht, das, und, sie, es, wir, die
How many words do you need to know to understand a text in French?
Here is the basic data:
- Total number of words (including repetitions): 7630
- Number of unique words: 1471
- Words are repeated on average: 5.2 times
Percentage of words: | Allows to understand: |
5% | 52,27% |
10% | 65,53% |
20% | 77,47% |
35% | 85,56% |
50% | 90,37% |
10 most common words:
- je, de, la, pas, tu, ne, le, c’est, que, un
How many words do you need to know to understand a text in Arabic?
Here is the basic data:
- Total number of words (including repetitions): 5988
- Number of unique words: 2441
- Words are repeated on average: 2.5 times
Percentage of words: | Allows to understand: |
5% | 42,20% |
10% | 52,83% |
20% | 63,65% |
35% | 73,54% |
50% | 79,64% |
10 most common words:
- لا, أن, من, هذا, في, كلا, آنا, أنا, على, ما
Conclusions from the experiment
The Pareto principle works well in languages dominated by words with few forms. In languages where words are repeated even up to 6 times, such as English, knowing 20% of the words is enough to understand 80% of the text. The same applies to French and German, where 20% of words are sufficient to understand over 75-77% of the text.
In the case of the Polish language, which has a significant number of noun forms, the situation is not as favorable. Knowing 20% of words allows you to understand only about 70% of the text. To understand 80% of the text, you need to know 35% of the words.
Basically, Arabic fared the worst, mainly because the word “and” is combined with the next word, significantly reducing the repetition of individual words in the text. Additionally, from what I noticed, the words in the subtitles I had access to were not always consistently separated by spaces, which could have affected the result. When I separated the word “and,” the results were much closer to the Polish language (20% of words allowed understanding 66% of the text). If I had consistently separated all words, I believe the results would have been even closer to the Polish language.”
In general, we can draw the following conclusions:
Knowing 20% of lexemes is enough to understand 80% of the text in practically any language. A lexeme here is understood as a unit containing all word forms, so, for example, the lexeme ‘bałwan’ also includes forms like ‘bałwana,’ ‘bałwanem,’ etc.
When we take into account all words, not just lexemes, knowing 20% allows you to understand 60-80% of the text depending on the complexity of the language’s grammar. The more verb and noun forms and the more word combinations a language allows, the lower the percentage of text comprehension.
Therefore, the difficulty of learning a language can be approximated by the average word repetition rate in the text. The higher this rate, the fewer words are required to understand the text. Of course, another factor to consider is the percentage of exceptions to the rule when dealing with verb and noun inflections. For example, in Polish, a challenge is creating the genitive singular form of masculine nouns. In other languages, there can be many exceptions in verb conjugations. In contrast, Esperanto, which theoretically has more inflected forms than English, may be easier to learn because all forms are regular.
If you want to conduct a similar experiment on your own, here’s a brief step-by-step guide:
- Find subtitles for the movie and open them in Excel.
- Replace all punctuation marks with spaces, e.g., ,.!?”();:
- Replace hyphens with a space, i.e., “-” with ” “.
- Replace double spaces with single spaces.
- Sort the column with sentences alphabetically.
- Remove lines with line numbers in the subtitles and time annotation lines, e.g., “00:01:56,866 –> 00:02:01,037.”
- Execute the “Text to Columns” command and set space as the separator.
- Individual words will be placed into columns.
- Sort each column to eliminate empty cells.
- Transfer the content of each column to the first column.
- Add a column title, e.g., “Words.”
- Create a pivot table, putting “Words” in the row field and “Count words” in the data field.
Article originally published at sekretypoliglotow.pl in Polish. You can find it here.