Skip to content

N-gram Frequency Analysis

To analyze a keyboard layout, we calculate various metrics on groups of letters called n-grams. For example, one of the most important metrics is Same-Finger Bigram (SFB). SFB is calculated on bigrams—that is, pairs of adjacent letters. Take the word plot, which contains three bigrams: pl, lo, ot. When using standard typing technique on QWERTY, the bigram lo is an SFB because both l and o are typed with the right ring finger:

~`
!1
@2
#3
$4
%5
^6
&7
*8
(9
)0
_-
+=
Q
W
E
R
T
Y
U
I
O
P
{[
}]
|\
A
S
D
F
G
H
J
K
L
:;
"'
Z
X
C
V
B
N
M
<,
>.
?/

So within the word plot, because one of the three bigrams is an SFB, we say that QWERTY has 33.33% SFBs.

Of course, not all words contain an SFB, and some words have multiple SFBs. To properly analyze a layout, we average the SFB ratio over a large collection of text, which we call a corpus (plural corpora). Corpora might contain books, articles, tweets, chat logs, computer code, or even logs of every key someone pressed while working. The idea is that this collection of text represents what a person normally types, so any metrics calculated over the corpus can be extrapolated to future typing.

Layouts Wiki uses the Reddit Corpus (small) from ConvoKit for keyboard layout analysis. This corpus is a collection of nearly 300,000 real conversations from Reddit, gathered from 100 popular communities in September 2018. It contains posts and comments from over 100,000 different users discussing everything from technology to hobbies to current events. What makes this corpus particularly valuable for keyboard layout analysis is that it represents how people actually communicate online—complete with informal language, abbreviations, slang, and the natural flow of digital conversation.

This is quite different from traditional text sources used in other language research, such as the Corpus of Contemporary American English (COCA), which draws from academic journals, magazines, and newspapers, or the iWeb corpus, which contains web pages but often includes more formal content like news articles and reference materials. Since most of our daily typing involves casual communication like emails, messages, and social media, using Reddit text gives us a more realistic picture of what most people type on a day-to-day basis.

The Reddit Corpus (small) contains approximately 54 million characters. If we run the typing simulation using optimized native code, this analysis may take a few minutes on a typical computer. If we are only interested in analyzing one layout, this would be fine. But here’s the problem: When optimizing keyboard layouts, we need to evaluate millions of different layout candidates. 1 million minutes is approximately 2 years. That’s far too long to be practical.

Because language is highly repetitive, a relatively small number of unigrams, bigrams, and trigrams comprise the vast majority of any large corpus. Rather than analyzing each occurrence individually, we can extract and count every unique n-gram once, then use those counts to calculate metrics such as SFB. After processing the Reddit Corpus (small), counting the n-grams gives us a table like this:

UnigramCountBigramCountTrigramCount
Total54,137,118Total53,469,486Total52,802,304
9,301,056e␣1,615,452␣th903,554
e5,111,241␣t1,393,468the649,747
t4,189,134th1,177,155he␣450,322
28 more…1,015 more…20,146 more…
q35,383kx1jhh1

After processing the Reddit Corpus (small), we find that the bigram lo occurs 128,850 times out of 53,469,486 total bigrams. We go through and make a count of every bigram like this.

Now, when evaluating a layout’s SFB:

sfb_count := 0
for each bigram in bigrams:
finger1 := the finger that presses bigram[0]
finger2 := the finger that presses bigram[1]
if finger1 == finger2:
add bigram's count to sfb_count
sfb_ratio := divide by the total bigrams count

This approach can then be generalized across all of our metrics, since our metrics are defined as functions of the layout and unigram, bigram, and trigram ratios. As a result, instead of simulating over 50 million keystrokes individually, we can add up metrics for approximately 21,000 n-grams, allowing us to evaluate layout stats in milliseconds and making it practical to use stochastic techniques such as simulated annealing to optimize layouts.