library(dplyr)
library(purrr)
library(stringr)
Have you heard of Wordle? Who am I kidding, of course you’ve heard of Wordle! In fact, I’m pretty certain we’re way past peak Wordle at this point.
Here’s a Wordle helper that doesn’t completely take the fun out of the guessing, while also making sure you’ve got a good chance at winning every time.
Type your guesses below and then use the buttons below each letter to report Wordle’s response. Press return to start the next guess. Use delete to remove letters or words you’ve entered.
As soon as you add the results for a new word, the table of next guess candidates will update! Pick wisely.
Intro
I started this post about a month ago, roughly at the same time that every other person with a blog about doing things with computers also decided to start writing a post about Wordle.
I’ve been tempted to just walk away from this post more than once. After all, since I’ve started writing this post Wordle has been solved more than once, Winston Chang rewrote Wordle in Shiny, roughly 70 other people wrote Wordle clones or helper apps and packages in R alone. Felienne Hermans wrote a Twitter bot to guess the word from shared game emojis. Someone else wrote a bot to intentionally ruin everyone’s day by spoling the answer to the next day’s Wordle. (Both bots were eventually suspended by Twitter.) Oh and Wordle was bought for big money by the New York Times who fumbled the handoff and lost more than a few player’s word streaks in the transfer.
I should admit up-front that I’ve never really played Wordle. It’s exactly the kind of task that immediately cries out to be automated: I’d apparently much rather spend a month’s worth of after-hours tinkering time to think through and codify a decent strategy than to just think up some words on my own.
And yet I love Wordle. I think it’s awesome. The rules are simple, but deceptively ambiguous. The game play is so concise it can fit in a tweet (even though that’s annoying for accessibility reasons). Still, the UI is simple, intuitive and fun without trying to hack your brain to be addictive. It’s a feel good game.
Another reason to love Wordle: there are so many great programming tasks around Wordle. It’s easy to describe the mechanics, to understand the game play, to look at the app and think: I can do that. Which is why, right now, programmers are hard at work tinkering over word lists or practicing web development in their favorite framework using Wordle.
As an educator, it means you can tailor a Wordle-based programming challenge to be as simple or complicated as needed. Once you start to break down the game, it’s more complicated than it appears at first glance, and there’s so much to choose from. State management, data structures, browser storage, game theory, CSS, user interface design, accessibility. You can go deep on any of these topics.
So if you have a Wordle idea you want to tinker with, I wholeheartedly encourage you to run with it. Let Wordle inspire you to practice using regular expressions with stringr, web scraping with httr, text processing with Python, working with Twitter data with rtweet, or making accessible plots with ggplot2.
What follows here is a bit of a journey. It is not the best strategy for Wordle or even the best way to play. But along the way we’ll learn a few text processing tricks, we’ll write a few functions, and we’ll learn how to move seamlessly from R to the browser in the same document or blog post. (The R code and data I write below create the word data used in the table and app above!)
Let’s look at some words
Let’s dig in. To get started, I’m using a few of the usual suspects from the tidyverse package. Out of habit, I’ll load the ones I want specifically. (I think I also used tidyr somewhere in here, too.)
Now we’re ready to load our word list. At first I started with Scrabble’s word list, but it turns out that Wordle included the complete word list in its source code. (You could call it a hack but only in the state of Missouri.)
I used my elite hacker copy-and-paste skills to store Wordle’s word list as a JSON file (165K).
<- jsonlite::fromJSON(
wordle_words "wordle.json",
simplifyVector = TRUE
)
It turns out that Wordle maintains two separate lists. One list contains the 2,315 words used as solutions
sample(wordle_words$answers, 5)
[1] "exert" "petty" "scent" "fishy" "askew"
and the other contains the 10,657 words that the game considers a valid guess.
sample(wordle_words$words, 5)
[1] "vires" "neese" "koppa" "huhus" "ryked"
Do the two word lists overlap?
%>%
wordle_words reduce(intersect) %>%
length()
[1] 0
No, they do not (the intersection of the two word lists is empty). We could make things super easy for ourselves by only considering the words on the solution list, but that would really ruin the fun. So let’s combine the two lists.
<- unlist(wordle_words) words
Now lets turn those words into data we can work with.
A letter popularity contest
Popularity by word
My first thought (and I think it’s many people first thought) was to consider the probability that a letter appears in a word. In other words: does R appear in more words than F?
To answer this we can split each word into a vector of letters, take only the unique letters, and then count how many times each letter appears in a word.
Splitting the word into a vector of letters is something we’ll be doing a lot, and stringr::str_split()
or strsplit()
can help. The trick is to use an empty string as the split pattern to break apart each string character by character.
str_split(c("unhip", "jeans"), "")
[[1]]
[1] "u" "n" "h" "i" "p"
[[2]]
[1] "j" "e" "a" "n" "s"
Note that this process takes our vector and gives us a list of vectors, which means we’ll be seeing a lot of purrr’s map()
function in this post.
<-
letter_freq %>%
words # Split each word into a vector of letters
str_split("") %>%
# Keep one of each letter per word
map(unique) %>%
# Unlist into a big vector of letters
unlist() %>%
# Count the letters (each appearance in a word)
table() %>%
# Most popular letters first
sort(decreasing = TRUE) %>%
# Turn into frequency table
`/`(length(words)) %>%
# Remove attributes from table()
c()
letter_freq
s e a o r i
0.457600987 0.439793401 0.410884983 0.301495529 0.301341351 0.276672834
l t n u d y
0.240055504 0.233811286 0.214847364 0.187789084 0.177150786 0.156567993
c p m h g b
0.148011101 0.145312982 0.144002467 0.131668208 0.118948504 0.117098366
k w f v z j
0.111316682 0.079247610 0.076318224 0.051958064 0.030141844 0.022278754
x q
0.022124576 0.008556892
Note that we only counted each letter once per word, so we now know that R appears in 30% of the words in the word list, while F appears in only 8%. A first guess that includes R would probably be better than one with an F.
Popularity by position
Another way to look at letter frequency would be to consider the position of the letter in the word. What if we know that R and F are in the word: which is a more likely choice as the fourth letter?
To do this we…
- First turn the word list into a tibble with one row per word.
- Then, using
tidyr::separate_rows()
, we can add a new column with the letters in each word. - Grouping by
word
and adding arow_number()
gives us the position of each letter in the word. - Then we can count the number of times each letter occurs in a given position with a new
group_by()
andsummarize()
(we could have usedcount()
with anotherungroup()
, too). - Then, if we re-use our letter-word counts from the last step, we can count the number of words that have a the letter in question so that our frequency is effectively given the letter R, how often does it appear as the fourth letter?
- Finally,
tidyr::pivot_wider()
moves the positions to the columns so the table is easier to read.
<-
letter_freq_pos tibble(word = words) %>%
select(word) %>%
mutate(letter = word) %>%
::separate_rows(letter, sep = "") %>%
tidyrfilter(letter != "") %>%
group_by(word) %>%
mutate(position = row_number()) %>%
group_by(letter, position) %>%
summarize(n = n(), .groups = "drop") %>%
mutate(
words = letter_freq[letter] * length(!!words),
freq = n / words
%>%
) select(-n, -words) %>%
::pivot_wider(
tidyrnames_from = position,
values_from = freq,
values_fill = 0,
names_prefix = "p"
)
letter_freq_pos
# A tibble: 26 × 6
letter p1 p2 p3 p4 p5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 0.138 0.425 0.232 0.202 0.128
2 b 0.598 0.0533 0.221 0.160 0.0388
3 c 0.480 0.0917 0.204 0.214 0.0661
4 d 0.298 0.0366 0.170 0.205 0.358
5 e 0.0531 0.285 0.155 0.408 0.267
6 f 0.604 0.0242 0.180 0.235 0.0828
7 g 0.413 0.0493 0.236 0.274 0.0927
8 h 0.286 0.320 0.0703 0.138 0.217
9 i 0.0460 0.385 0.293 0.245 0.0780
10 j 0.699 0.0381 0.159 0.100 0.0104
# ℹ 16 more rows
Now we can answer our question about R and F in the fourth position.
%>%
letter_freq_pos filter(letter %in% c("r", 'f')) %>%
select(letter, p4)
# A tibble: 2 × 2
letter p4
<chr> <dbl>
1 f 0.235
2 r 0.184
So R is the fourth letter in 18% of the words containing R — but F is the fourth letter in 24% of its words.
Ideally this information will help us filter guesses when we know that a set of letters are in the solution, but we don’t yet know where.
First Choice
What word should we guess first? Ideally, we want a word whose answer gives us the most information. Intuitively, if we pick a word that has the most popular letters and each letter is different, we’ll be able to discard or include the most words when Wordle tells us which letters are in or out.
Formally, this calculation is called entropy. It measures how much information is contained in a particular instance of a random process. In this case, words with higher entropy give us more information because they encode more information.
This is all a little hand-wavy, so I’ll just duck the details and call this number a score. The higher the score, the better the word choice.
To calculate the entropy score, we take a word, split it into it’s letters, and then get the probability of each letter appearing in the word. Duplicated letters don’t tell us much, so we set second appearances of a letter close to zero. And then we calculate entropy
\[-\sum_{i=1}^{n} p_i \log_2 p_i\]
which in R code is
- sum(p * log(p, base = 2))
where p
is a vector of probabilities for a given outome.
We can wrap all of that up into a function score_entropy()
:
<- function(word) {
score_entropy <- str_split(word, "")[[1]]
chars <- letter_freq[chars]
p # we learn something but not much from duplicated letters
duplicated(chars)] <- min(letter_freq)
p[- sum(p * log(p, base = 2))
}
Notice that score_entropy()
isn’t vectorized, so we’ll have to use a map()
function to call it over a vector of words. We can be even more specific and use map_dbl()
since we know that score_entropy()
returns a number.
c("unhip", "jeans", "pools") %>%
set_names() %>%
map_dbl(score_entropy)
unhip jeans pools
2.232150 2.163479 1.994939
This tells us, broadly, that unhip is a better choice than jeans and pools is worse than either. (Intuitively: you don’t learn much from the second O.)
Let’s use this to create a table of words and their associated entropy scores. Taking a peek at the highest scoring words tells us…
<-
words_first_choice tibble(word = words) %>%
mutate(score = map_dbl(word, score_entropy)) %>%
arrange(desc(score))
words_first_choice
# A tibble: 12,972 × 2
word score
<chr> <dbl>
1 arose 2.61
2 aeros 2.61
3 soare 2.61
4 arise 2.60
5 raise 2.60
6 aesir 2.60
7 reais 2.60
8 serai 2.60
9 osier 2.59
10 realo 2.59
# ℹ 12,962 more rows
… that according to this measure, the best first-choice words are arose, aeros, and soare. arose uses all five of the letters that most commonly appear in a word, and is also probably (okay, it is) on the answers list, so hello new first word choice!
Second Choice
After your first choice, you know up to three pieces of additional information. Some of the letters in your guess
- dark square aren’t in the solution
- yellow square are in the solution but not where you guessed
- green square are in the solution and are where you guessed
None of the letters are in the solution
What if you guessed arose and got five gray boxes telling you that none of those letters appear in the solution?
We need to discard any words with a, r, o, s, or E in them. To do this, we’ll write a small function str_has_none_of()
that takes a vector of words and a vector of letters, and checks if any of the letters are in each of the words. Technically, we use our same str_split()
trick to split each word into a vector of letters and then check that the intersection of word letters and unwanted letters is empty.
<- function(words, letters) {
str_has_none_of <- str_split(words, "")
words map_lgl(words, ~ length(intersect(letters, .x)) == 0)
}
Using this function, we can quickly reduce our word list from 12,972 to 577 words.
%>%
words_first_choice filter(str_has_none_of(word, c("a", "r", "o", "s", "e")))
# A tibble: 577 × 2
word score
<chr> <dbl>
1 unlit 2.43
2 until 2.43
3 linty 2.39
4 clint 2.38
5 unlid 2.38
6 culti 2.36
7 tulip 2.35
8 uplit 2.35
9 unity 2.35
10 lindy 2.34
# ℹ 567 more rows
This new word list suggests that unlit or until would be a good next choice, so we’ll go with until. And if none of the letters in arose and until appear in the solution…
<- str_split("arose until", "")[[1]]
letters_guess
%>%
words_first_choice filter(str_has_none_of(word, letters_guess))
# A tibble: 3 × 2
word score
<chr> <dbl>
1 pygmy 1.65
2 hyphy 1.33
3 gyppy 1.31
then your answer is most definitely one of pygmy, hyphy, or gyppy.
Right letter, wrong place
If you learn something from the guess, though, you can filter the word list based on the information you just learned.
Say we guess arose and wordle reveals that R and O appear in the solution.
We now know that the solution:
- Doesn’t have A, S or E
- Does contain R and O
- Doesn’t have R as the 2nd letter or O as the 3rd.
We’ve already implemented this the first step by discarding words with str_has_none_of()
. We also need a similar version called str_has_all_of()
to keep only words that have letters we know are in the solution.
<- function(words, letters) {
str_has_all_of <- str_split(words, "")
words map_lgl(words, ~ length(setdiff(letters, .x)) == 0)
}
str_has_all_of("rhino", c("r", "o"))
[1] TRUE
And finally we can use regular expressions to keep track of the third piece of information:
.[^r][^o]..
A .
means any letter at that spot in the word (other than the ones we’ve excluded). The []
indicate a set of options that could be present at a location in the string. The opening ^
negates the selection, so [^r]
means a character that isn’t r
.
%>%
words_first_choice filter(
str_has_none_of(word, c("a", "s", "e")),
str_has_all_of(word, c("r", "o")),
str_detect(word, ".[^r][^o]..")
)
# A tibble: 142 × 2
word score
<chr> <dbl>
1 lirot 2.54
2 intro 2.52
3 nitro 2.52
4 nidor 2.47
5 roily 2.47
6 loric 2.46
7 toric 2.45
8 milor 2.45
9 corni 2.44
10 porin 2.44
# ℹ 132 more rows
lirot is an unusual word, so let’s choose the next word on the list: intro.
Wordle thinks and tells us that we have T in the right spot! Also, we now know that I and N aren’t in the solution, and we still haven’t got R and O in the right place.
Right letter, right place
We can repeat the step above, but using a new regular expression:
.[^r]t[^ro][^o]
Notice that we know a little more about where R and O can’t be, but importantly the t
in the middle letter ensures we find words with T in the right place.
This leaves us with a few good choices:
%>%
words_first_choice filter(
str_has_none_of(word, c("a", "s", "e", "i", "n")),
str_has_all_of(word, c("r", "o", "t")),
str_detect(word, ".[^r]t[^r][^o]")
)
# A tibble: 4 × 2
word score
<chr> <dbl>
1 rotch 2.33
2 tutor 2.05
3 motor 1.99
4 rotor 1.65
rotch seems very unlikely, so we can pick from tutor, motor and rotor. But notice that the these include a small set of the same letters. In a sense, we might ask ourselves a new question — which is the more likely starting combination: tu, mo or ro?
At this point, you could just guess. It is a game after all! But no, let’s power forward and add more complexity to this blog post.
What if we switched our scoring at this point and considered the position of the letters in the candidate words? Doing something medium-naive, let’s frame this as: what’s the probability of T in the first position and U in the second and so on…
<- function(word) {
score_by_position <- str_split(word, "")[[1]]
chars
<- c()
res for (i in seq_along(chars)) {
<- which(letters == chars[i])
pos_alpha <- letter_freq_pos[[str_c("p", i)]][pos_alpha]
p <- c(res, p)
res
}
prod(res)
}
<-
words_score_pos tibble(word = words) %>%
mutate(
score_pos = map_dbl(word, score_by_position),
score_pos = score_pos / diff(range(score_pos))
%>%
) arrange(desc(score_pos))
words_score_pos
# A tibble: 12,972 × 2
word score_pos
<chr> <dbl>
1 foxes 1.00
2 boxes 0.991
3 jones 0.864
4 juves 0.808
5 coxes 0.795
6 faxes 0.792
7 poxes 0.754
8 fones 0.746
9 bones 0.739
10 fixes 0.719
# ℹ 12,962 more rows
If we join this with our “new information” score, we now have to scores to choose from:
%>%
words_first_choice filter(
str_has_none_of(word, c("a", "s", "e", "i", "n")),
str_has_all_of(word, c("r", "o", "t")),
str_detect(word, ".[^r]t[^r][^o]")
%>%
) left_join(words_score_pos) %>%
arrange(desc(score_pos))
# A tibble: 4 × 3
word score score_pos
<chr> <dbl> <dbl>
1 motor 1.99 0.0304
2 tutor 2.05 0.0200
3 rotch 2.33 0.0199
4 rotor 1.65 0.0132
Now we see that motor and tutor are the most likely words based on their position. We guess motor… and we’re right!
It only took three guesses! It’s almost like I planned this example to work out just like I wanted.
Generalizing
Okay, let’s do this for any number of guesses. First, let’s join our scored words into a single data frame.
<-
words_scored left_join(
words_first_choice,
words_score_pos,by = "word"
)
Then, we need a function that takes our guesses and results and generalizes them into the pieces of information our guesses reveal about the solution. This function is going to take a vector of guesses
and a vector of results
. The guesses
are just the words we guessed, but we’ll need to invent a syntax to concicesly report the results. Here’s the syntax I decided to use:
- . means the letter is absent
- - means the letter is present (wrong place)
- + means the letter is correct (right place)
In broad strokes, the function will take each guess and use the result
- Pull out the correct letters and their positions in
exact
so we can pick out words with letters in those spots. - Pull out present letters and their positions into
exclude
so we can compose the regular expression to filter out words that have these letters in those places. - Add the present by wrong place letters to
bucket_keep
, a bucket of letters that we know are in the solution. - And add any absent letters to
bucket_dicard
so we can filter out words that have any of these letters. - The last step is to compose the regular expression
pattern
fromexact
andexclude
, and then return the regexp and the letters to keep and discard.
#' @param guesses A vector of words that you have guessed
#' @param result A vector of results for each guess using `.` for a miss, `-`
#' for a letter in the solution that isn't in the right place and `+` for a
#' letter that's in the right spot.
<- function(guesses, results) {
summarize_guesses stopifnot(all(str_length(c(guesses, results)) == 5))
<- str_split(guesses, "")
guesses <- str_split(results, "")
results
<- character(5)
exclude <- character(5)
exact <- c()
bucket_keep <- c()
bucket_discard
for (i in seq_along(guesses)) {
<- guesses[[i]]
g <- results[[i]]
r
if (any(r == "+")) {
== "+"] <- g[r == "+"]
exact[r <<- c(bucket_keep, g[r == "+"])
bucket_keep
}if (any(r == "-")) {
<- c(bucket_keep, g[r == "-"])
bucket_keep == "-"] <- paste0(exclude[r == "-"], g[r == "-"])
exclude[r
}if (any(r == ".")) {
<- c(bucket_discard, g[r == "."])
bucket_discard
}
}
!= ""] <- paste0("[^", exclude[exclude != ""], "]")
exclude[exclude == ""] <- NA_character_
exact[exact == ""] <- NA_character_
exclude[exclude
<- coalesce(coalesce(exact, exclude), ".")
pattern
# Say you guess a word with two Ts,
# but there's only one T in the solution.
# T will appear on keep and discard bucket,
# so we need to explicitly keep it.
# (we could use that info, though, e.g. at most 1 T)
<- setdiff(bucket_discard, bucket_keep)
bucket_discard
list(
discard = unique(bucket_discard),
keep = unique(bucket_keep),
pattern = str_c(pattern, collapse = "")
) }
Remember when we guessed arose and got this result?
Our new function summarizes the information we’ve learned from this guess.
summarize_guesses(
guesses = "arose",
results = ".--.."
)
$discard
[1] "a" "s" "e"
$keep
[1] "r" "o"
$pattern
[1] ".[^r][^o].."
Then we guessed intro and got this result.
And again we have this summary.
<-
guess_results summarize_guesses(
guesses = c("arose", "intro"),
results = c(".--..", "..+--")
)
guess_results
$discard
[1] "a" "s" "e" "i" "n"
$keep
[1] "r" "o"
$pattern
[1] ".[^r]t[^r][^o]"
To get the remaining possible words, we can use this information to filter down to the words that
- have none of the
$discard
letters - have all of the
$keep
letters - match the regular expression
$pattern
.
%>%
words_scored filter(
str_has_none_of(word, guess_results$discard),
str_has_all_of(word, guess_results$keep),
str_detect(word, guess_results$pattern)
)
# A tibble: 4 × 3
word score score_pos
<chr> <dbl> <dbl>
1 rotch 2.33 0.0199
2 tutor 2.05 0.0200
3 motor 1.99 0.0304
4 rotor 1.65 0.0132
All together now
Now that we know how to summarize and use the guess results to filter our next word choices, we can do this in one step with another small function, score_next_guess()
.
<- function(guesses, results) {
score_next_guess <- summarize_guesses(guesses, results)
guess_results
%>%
words_scored filter(
str_has_none_of(word, guess_results$discard),
str_has_all_of(word, guess_results$keep),
str_detect(word, guess_results$pattern)
) }
Having guessed arose and intro, what would happen if we guessed rotch1 next?
score_next_guess(
guesses = c("arose", "intro", "rotch"),
results = c(".--..", "..+--", "-++..")
)
# A tibble: 1 × 3
word score score_pos
<chr> <dbl> <dbl>
1 motor 1.99 0.0304
From rotch we learn that the first letter isn’t R, but the second letter is o, which leaves us just one choice: motor.
Guessing Wordle words in real life
Beginner’s Luck
I wrapped up the score_next_guess()
function on January 16th, 2022, which happened to be the easiest Wordle day of any day I’ve “played”. But it was a nice motivator to feel like I had spent my Sunday tinkering time well.
Opening with arose lead to a pleasant surprise.
From 12,972 words down to 37 words with our first guess. Nice!
# 2022-01-16
score_next_guess(
guesses = c("arose"),
results = c("----.")
)
# A tibble: 37 × 3
word score score_pos
<chr> <dbl> <dbl>
1 solar 2.58 0.0327
2 soral 2.58 0.0327
3 ratos 2.58 0.0404
4 rotas 2.58 0.0576
5 sorta 2.58 0.0401
6 taros 2.58 0.102
7 toras 2.58 0.145
8 sonar 2.56 0.0416
9 roans 2.56 0.0923
10 sarod 2.53 0.0537
# ℹ 27 more rows
Let’s just pick the first word on the list: solar.
Very nice!
Problematic words
In working on this, I ran into more than a few posts that had trouble with a few more obscure words, like igloo and ferry.
igloo
How many guesses would it take for us to get to igloo?
Round 1
Opening with arose is helpfulish.
score_next_guess(
guesses = c("arose"),
results = c("..-..")
)
# A tibble: 463 × 3
word score score_pos
<chr> <dbl> <dbl>
1 doilt 2.46 0.0680
2 indol 2.45 0.000646
3 tondi 2.44 0.0195
4 lotic 2.43 0.00802
5 noily 2.42 0.0711
6 pilot 2.42 0.0501
7 colin 2.41 0.0801
8 nicol 2.41 0.00613
9 tonic 2.41 0.0198
10 ontic 2.41 0.000670
# ℹ 453 more rows
Many of the words are obviously not the answer. Pilot is the first reasonable word on the list, and its score is relatively similar to the other top word choices, so I’d go with pilot.
Round 2
Picking pilot is a good choice!
score_next_guess(
guesses = c("arose", "pilot"),
results = c("..-..", ".-++.")
)
# A tibble: 1 × 3
word score score_pos
<chr> <dbl> <dbl>
1 igloo 1.95 0.000268
Round 3
:tada: Great work!
ferry
Apparently there was a general furor about ferry when it was the Wordle solution of the day. Let’s see how long it takes us to get to that word.
Round 1
Opening with arose narrows down our word choices to 357 words.
score_next_guess(
guesses = c("arose"),
results = c(".-..-")
)
# A tibble: 357 × 3
word score score_pos
<chr> <dbl> <dbl>
1 liter 2.54 0.0250
2 relit 2.54 0.0180
3 tiler 2.54 0.0485
4 liner 2.53 0.0425
5 inert 2.52 0.000951
6 inter 2.52 0.00199
7 niter 2.52 0.0157
8 uteri 2.50 0.000332
9 idler 2.49 0.000788
10 riled 2.49 0.0604
# ℹ 347 more rows
liter is both a word and at the top of our list, so it’s an easy next choice.
Round 2
The word list is now full of words with similar patterns, so let’s sort by position score to help us choose.
score_next_guess(
guesses = c("arose", "liter"),
results = c(".-..-", "...--")
%>%
) arrange(desc(score_pos))
# A tibble: 50 × 3
word score score_pos
<chr> <dbl> <dbl>
1 jerky 1.94 0.334
2 ferny 2.22 0.235
3 perky 2.22 0.218
4 jerry 1.64 0.177
5 query 1.97 0.153
6 ferry 1.80 0.153
7 berry 1.88 0.151
8 pervy 2.09 0.145
9 perdy 2.31 0.128
10 kerky 1.87 0.125
# ℹ 40 more rows
Round 3
Now we’re down to 17 words to choose from. Still complicated. But if we arrange by position score, our top two choices are ferny and ferry.
You can see where this is headed, but let’s pretend we had no idea. Which would you pick?
score_next_guess(
guesses = c("arose", "liter", "jerky"),
results = c(".-..-", "...--", ".++.+")
%>%
) arrange(desc(score_pos))
# A tibble: 17 × 3
word score score_pos
<chr> <dbl> <dbl>
1 ferny 2.22 0.235
2 ferry 1.80 0.153
3 berry 1.88 0.151
4 pervy 2.09 0.145
5 perdy 2.31 0.128
6 germy 2.23 0.122
7 derny 2.38 0.116
8 perry 1.92 0.115
9 mercy 2.27 0.109
10 merry 1.92 0.0937
11 verry 1.74 0.0907
12 derry 1.96 0.0753
13 herry 1.91 0.0723
14 derby 2.27 0.0655
15 herby 2.21 0.0629
16 nervy 2.16 0.0371
17 nerdy 2.38 0.0328
Round 4
Now we’re down to 1 words to choose from.
score_next_guess(
guesses = c("arose", "liter", "jerky", "ferny"),
results = c(".-..-", "...--", ".++.+", "+++.+")
%>%
) arrange(desc(score_pos))
# A tibble: 1 × 3
word score score_pos
<chr> <dbl> <dbl>
1 ferry 1.80 0.153
Round 5
:tada: We did it! 5 isn’t bad, especially considering the terrible choices we had in round 3.
Make it an app
It’s awesome being able to run R code to test things out, but it’s also a little tedious. Since we’ve done the heavy lifting of prepping and scoring words, it’d be great if we could have a little web app that would help us
- Input our guesses and results
- Show us possible words after each round
And since I’m writing this blog post in R Markdown via blogdown, I can do it all right here!
Move the data from R to the web
The first thing we need to do is save our data in a way that it can be accessed by JavaScript in the browser. To do this, we’ll take our words_scored
table and use jsonlite::write_json() to save the data frame as JSON.
%>%
words_scored mutate(across(starts_with("score"), round, digits = 2)) %>%
::write_json("wordle-scored.json") jsonlite
Now we have the data in a JSON file (that you can download if you want).
But to make life even easier, I’m going to use a trick I learned from htmlwidgets. What we can do is embed in the JSON file, which is only 589K, in a <script type="application/json">
tag with a specific id
that makes it easy to find later on.
::tags$script(
htmltoolsid = "words-scored",
type = "application/json",
readLines("wordle-scored.json")
)
Now that we have the data in a place where we can get it, let’s switch gears and write some JavaScript!
Start working in JavaScript
Here’s the cool thing: from here on out, the actual computation of the rest of the blog post is done in your browser. To facilitate this, I’ll use an extension I built for knitr for literate JavaScript programming with the js4shiny package.
js4shiny
Setting up literate JavaScript in blogdown is pretty straight-forward thanks to a little helper function from js4shiny.
::html_setup_blogdown(stylize = "none") js4shiny
tidyjs
The other cool thing we’ll use is tidyjs. It’s a really neat JavaScript library that makes it easy to work with data frames in the browser. If you squint really hard, it’s remarkably similar to the tidyverse, just with a JavaScript spin.
I wrapped tidyjs in an R package that automatically stays up to date with the latest version of tidyjs. To use tidyjs, we just need to call use_tidyjs()
.
::use_tidyjs() tidyjs
Now that we’ve included tidyjs in this page, we can finally switch to writing JavaScript instead of R.
First, we need to import a couple of functions from tidyjs that we’re going to want to use. With tidyjs, all transformations are wrapped in a call tidy()
, so we have to import tidy
. We also need filter()
and sliceMax()
for easy filtering.
const { tidy, filter, sliceMax } = Tidy
Load our data
The next step is to find the JSON data that we just serialized and stashed in our page. We can use document.getElementById()
to find the element with the id
'words-scored'
, and then grab the JSON text itself from the .innerText
property of that object. Finally, we call JSON.parse()
on the json text to parse it into a JavaScript object.
= JSON.parse(
wordsScored document.getElementById('words-scored').innerText
)
Preview the data
Here’s a quick preview of the data. In tidyjs you wrap a pipeline in tidy()
and then each additional argument to tidy()
is the next step in the pipe. To make it look a little more familiar to R users, I’ve added the %>%
in the comments.
tidy(
, // %>%
wordsScoredsliceMax(5, 'score')
)
Same song, different dance
Summarizing guesses
Next, we translate summarize_guesses()
from R to summarizeGuesses()
in JavaScript.
function summarizeGuesses ({ guesses, results }) {
// Check that all guesses and results have 5 characters
const allComplete = [...guesses, ...results].every(s => s.length == 5)
if (!allComplete) {
console.error('All guesses and results must have 5 characters.')
return
}
// R: str_split(x, '')
= guesses.map(s => s.split(''))
guesses = results.map(s => s.split(''))
results
let exclude = Array(5).fill('')
let exact = Array(5).fill('')
let keep = []
let discard = []
for (i = 0; i < guesses.length; i++) {
let g = guesses[i] // g: an array of 5 letters of a guess
let r = results[i] // r: an array of 5 letters of the result
for (j = 0; j < r.length; j++) {
if (r[j] == '+') {
// this letter is exactly right
= g[j]
exact[j] .push(g[j])
keepelse if (r[j] == '-') {
} // this letter is included, wrong place
.push(g[j])
keep// so exclude it from this position
+= g[j]
exclude[j] else {
} // this letter isn't in the solution
.push(g[j])
discard
}
}
}
// build up the regex pattern blending `exact` and `exclude`
const pattern = Array(5).fill('.')
for (i = 0; i < 5; i++) {
if (exact[i] != '') {
= exact[i]
pattern[i] else if (exclude[i] != '') {
} = `[^${exclude[i]}]`
pattern[i]
}
}
= discard.filter(x => !keep.includes(x))
discard return {discard, keep, pattern: pattern.join('')}
}
Here’s a quick preview of summarizeGuesses()
.
let summary = summarizeGuesses({
guesses: ["arose", "indol"],
results: ["..-..", "+..+-"]
})console.log(summary)
Searching for the next word
And then we need to do the same for score_next_guess()
. Of course, at this point I’m older and wiser and choose a better name: searchNextGuess()
.
function searchNextGuess ({ guesses, results }) {
const guessResult = summarizeGuesses({guesses, results})
return tidy(
,
wordsScored// discard words that contain a letter in the discard pile
filter(d => !guessResult.discard.some(l => d.word.includes(l))),
// keep only words that have all letters in the keep pile
filter(d => guessResult.keep.every(l => d.word.includes(l))),
// keep words that are consistent with results to date
filter(d => RegExp(guessResult.pattern).test(d.word))
) }
Let’s prove to ourselves that these functions work.
let next = searchNextGuess({
guesses: ["arose", "indol"],
results: ["..-..", "+..+-"]
})
console.log(`There is ${next.length} word available for our next guess:`)
console.table(next[0])
Let’s try again. What if we chose a different second guess?
let rounds = {
guesses: ["arose", "intro"],
results: [".--..", "..+--"]
}let next = searchNextGuess(rounds)
console.log('Guess summary ----')
console.log(summarizeGuesses(rounds))
console.log('Next word choices ----')
.forEach(ws => console.log(`${ws.word} (${ws.score})`)) next
Now build the rest of the owl
Okay, this is the point where I confess that I went way off-track in building the little app at the top of this post. I fully intended to write about that part too, but honestly I’ve done a good job curing myself of the Wordle bug with this post.
For the curious, all the JavaScript code for the guess helper lives in wordle-component.js. Or, right click on this page and pick Inspect Element and find your way to the Sources or Debugger tab for a better look. It’s all vanilla JavaScript.
Also a quick shout-out to gridjs, which turned out to be a very easy way to create the table of sorted words.
<script src="https://unpkg.com/gridjs/dist/gridjs.umd.js"></script>
<link href="https://unpkg.com/gridjs/dist/theme/mermaid.min.css" rel="stylesheet" />