Skip to content

N-gram Frequency Analysis

To analyze a keyboard layout, we calculate various metrics on groups of letters called n-grams. For example, one of the most important metrics is Same-Finger Bigram (SFB). SFB is calculated on bigrams—that is, pairs of adjacent letters. Take the word plot, which contains three bigrams: pl, lo, ot. When using standard typing technique on QWERTY, the bigram lo is an SFB because both l and o are typed with the right ring finger:

~`
!1
@2
#3
$4
%5
^6
&7
*8
(9
)0
_-
+=
Q
W
E
R
T
Y
U
I
O
P
{[
}]
|\
A
S
D
F
G
H
J
K
L
:;
"'
Z
X
C
V
B
N
M
<,
>.
?/

So within the word plot, because one of the three bigrams is an SFB, we say that QWERTY has 33.33% SFBs.

Of course, not all words contain an SFB, and some words have multiple SFBs. To properly analyze a layout, we average the SFB ratio over a large collection of text, which we call a corpus (plural corpora). Corpora might contain books, articles, tweets, chat logs, computer code, or even logs of every key someone pressed while working. The idea is that this collection of text represents what a person normally types, so any metrics calculated over the corpus can be extrapolated to future typing.

The Shai/iWeb corpus is a popular English corpus for keyboard layout design. For a given layout, a computer can count the number of SFBs in Shai/iWeb, and divide that by the total number of bigrams. Using this method, we can calculate that SFBs are 5.92% of all bigrams for QWERTY.

Shai/iWeb contains approximately 91 million words and 527 million characters. If we run the typing simulation using optimized native code, this analysis may take a few minutes on a typical computer. If we are only interested in analyzing one layout, this would be fine. But here’s the problem: When optimizing keyboard layouts, we need to evaluate millions of different layout candidates. 1 million minutes is approximately 2 years. That’s far too long to be practical.

Because language is highly repetitive, a relatively small number of unigrams, bigrams, and trigrams comprise the vast majority of any large corpus. Rather than analyzing each occurrence individually, we can extract and count every unique n-gram once, then use those counts to calculate metrics such as SFB.

Furthermore, since the feel of a layout is determined almost exclusively by the most common n-grams, we can completely ignore n-grams that fall below a frequency cutoff. With a threshold of 100 parts per million for bigrams and 10 parts per million for trigrams, counting the n-grams for Shai/iWeb gives us a table like this:

UnigramCountBigramCountTrigramCount
Total526,555,491Total526,555,490Total526,555,489
88,658,235e␣15,454,487␣th8,256,380
e50,497,522␣t13,180,277the6,802,477
t38,338,645th10,712,957he␣5,421,447
28 more…415 more…4,211 more…
z415,633'l53,754vul5,267

After processing Shai/iWeb, we find that the bigram lo occurs 1,457,994 times out of 526,555,490 total bigrams. We go through and make a count of every bigram like this.

Now, when evaluating a layout’s SFB:

sfb_count := 0
for each bigram in bigrams:
finger1 := the finger that presses bigram[0]
finger2 := the finger that presses bigram[1]
if finger1 == finger2:
add bigram's count to sfb_count
sfb_ratio := divide by the total bigrams count

This approach can then be generalized across all of our metrics, since our metrics are defined as functions of the layout and unigram, bigram, and trigram ratios. As a result, instead of simulating over 500 million keystrokes individually, we can add up metrics for under 5,000 n-grams, allowing us to evaluate layouts in hours and days rather than years.