#> ── Attaching core tidyverse packages ───────────────────────
#> ✔ dplyr 1.0.10 ✔ readr 2.1.3
#> ✔ forcats 0.5.2 ✔ stringr 1.5.0
#> ✔ ggplot2 3.4.0 ✔ tibble 3.1.8
#> ✔ lubridate 1.9.0 ✔ tidyr 1.2.1
#> ✔ purrr 1.0.1
#> ── Conflicts ────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Twitter finds itself in an… interesting… transition period. Whether or not you’re considering jumping ship to another service — you can find me lurking on Mastdon — you should download an archive of your Twitter data. Not only does the archive include all of your tweets, it also contains a variety of other interesting data about your account: who you followed and who followed you; the tweets you liked; the ads you were served; and much more.
This post, very much inspired by the awesome Observable notebook, Planning to leave Twitter?, shows you how to use R to read and explore your archive, using my own archive as an example.
Read on to learn how to read your Twitter archive into R, or how to tidy your tweets. The second half of the post showcases a collection of plots about monthy tweet volume, popular tweets, the time of day when tweets were sent, and the app used to send the tweet.
I’ve also included a section on using rtweet to collect a full dataset about the tweets you’ve liked and another section about the advertising data in your Twitter archive.
Reading your Twitter archive
Get your Twitter data archive
First things first, you need to have your Twitter data archive. If you don’t have it yet, go to Settings and Privacy and click Download an archive of your data. After you submit the request, it takes about a day or so for an email to show up in your inbox.
@grrrck your Twitter data is ready
Your Twitter archive is ready for you to download and view using your desktop browser. Make sure you download it before Nov 12, 2022, 9:46:31 PM
The archive downloads as a zip file containing a standalone web page — called Your archive.html
— for exploring your data. But the real archive lives in the included data/
folder as a bunch of .js
files. I’ve copied that data/
directory into my working directory for this post.
Setup
On the R side, we’ll need the usual suspects: tidyverse and glue.
library(tidyverse)
library(glue)
(I’m using the dev version of tidyverse (1.3.2.9000), which loads lubridate automatically, and the dev version of purrr that is slated to become version 1.0.0.)
To read in the data files, I’ll use jsonlite to read the archive JSON data, with a small assist from brio for fast file reading. I’m also going to have some fun with ggiraph for turning static ggplot2 plots into interactive plots.
Finally, the Twitter archive doesn’t require API access to Twitter, but you can use it to augment the data in the archive. The rtweet package is excellent for this, even though it takes a little effort to get it set up.
Read the manifest
The data/
folder is surprisingly well structured! There are two key files to help you find your way around the archive. First, the README.txt
file explains the structure and layout of the files, and includes descriptions of the data contained in all of the files.
Here’s how the README describes the account.js
data file:
account.js
- email: Email address currently associated with the account if an email address has been provided.
- createdVia: Client application used when the account was created. For example: “web” if the account was created from a browser.
- username: The account’s current @username. Note that the @username may change but the account ID will remain the same for the lifetime of the account.
- accountId: Unique identifier for the account.
- createdAt: Date and time when the account was created.
- accountDisplayName: The account’s name as displayed on the profile.
The data/
folder also contains a manifest.js
file that can be used to help read the data included in the archive. Let’s start by assuming this file is JSON and reading it in.
::fromJSON("data/manifest.js")
jsonlite#> Error in parse_con(txt, bigint_as_char): lexical error: invalid char in json text.
#> window.__THAR_CONFIG = { "use
#> (right here) ------^
Here we hit our first snag. The archive files are packaged as JSON, but they’re not strictly compliant JSON files; they include some JavaScript to assign JSON objects to the global namespace (called window
in the browser). Here’s the data/manifest.js
file as an example.
window.__THAR_CONFIG = {
// ... data ...
}
If we just remove everything up to the first the {
or sometimes [
on the first line, we can turn the data into valid JSON.
1] <- sub("^[^{[]+([{[])", "\\1", lines[1])
lines[<- jsonlite::fromJSON(lines) manifest
This worked, but… jsonlite was designed for statistical work, so it transforms the data structure when reading in the JSON. For example, by default it converts arrays that look like JSON-ified data frames into actual data.frame
s.
$dataTypes[1:2] |> str()
manifest#> List of 2
#> $ account :List of 1
#> ..$ files:'data.frame': 1 obs. of 3 variables:
#> .. ..$ fileName : chr "data/account.js"
#> .. ..$ globalName: chr "YTD.account.part0"
#> .. ..$ count : chr "1"
#> $ accountCreationIp:List of 1
#> ..$ files:'data.frame': 1 obs. of 3 variables:
#> .. ..$ fileName : chr "data/account-creation-ip.js"
#> .. ..$ globalName: chr "YTD.account_creation_ip.part0"
#> .. ..$ count : chr "1"
That’s often quite helpful! But I find it’s safer, when trying to generalize data reading, to disable the simplification and know for certain that the data strcutre matches the original JSON. For that reason, I tend to disable the matrix and data.frame simplifications and only allow jsonlite to simplify vectors.
Here’s a quick helper function that includes those setting changes and the first line substitution needed to read the archive JSON files.
<- function(path) {
read_archive_json <- brio::read_lines(path)
lines 1] <- sub("^[^{[]+([{[])", "\\1", lines[1])
lines[
::fromJSON(
jsonlitetxt = lines,
simplifyVector = TRUE,
simplifyDataFrame = FALSE,
simplifyMatrix = FALSE
) }
Now we’re ready to read the manifest again.
<- read_archive_json("data/manifest.js")
manifest names(manifest)
#> [1] "userInfo" "archiveInfo" "readmeInfo" "dataTypes"
The manifest file contains some information about the user and the archive,
str(manifest$userInfo)
#> List of 3
#> $ accountId : chr "47332433"
#> $ userName : chr "grrrck"
#> $ displayName: chr "garrick aden-buie"
plus details about all of the various data included in the archive, like the data about my account
.
str(manifest$dataTypes$account)
#> List of 1
#> $ files:List of 1
#> ..$ :List of 3
#> .. ..$ fileName : chr "data/account.js"
#> .. ..$ globalName: chr "YTD.account.part0"
#> .. ..$ count : chr "1"
Each dataType
in the manifest points us to a file (or files) in the archive and helpfully tells us how many records are included.
Here are the data files with the most records.
Code: Manifest, Top Records
$dataTypes |>
manifest# All data types we can read have a "files" item
keep(~ "files" %in% names(.x)) |>
# We keep the files objects but still as a list of lists within a list
map("files") |>
# Turn the files into tibbles (list of tibbles within a list)
map_depth(2, as_tibble) |>
# Then combine the files tables for each item keeping track of the file index
map(list_rbind, names_to = "index") |>
# And finally combine files for all items
list_rbind(names_to = "item") |>
mutate(across(count, as.integer)) |>
select(-globalName, -index) |>
slice_max(count, n = 15) |>
::kable(
knitrformat.args = list(big.mark = ","),
table.attr = 'class="table"',
format = "html"
)
item | fileName | count |
---|---|---|
like | data/like.js | 11,773 |
follower | data/follower.js | 9,030 |
tweetHeaders | data/tweet-headers.js | 6,225 |
tweets | data/tweets.js | 6,225 |
ipAudit | data/ip-audit.js | 3,787 |
following | data/following.js | 1,519 |
contact | data/contact.js | 645 |
listsMember | data/lists-member.js | 254 |
block | data/block.js | 242 |
adImpressions | data/ad-impressions.js | 173 |
adEngagements | data/ad-engagements.js | 171 |
directMessageHeaders | data/direct-message-headers.js | 97 |
directMessages | data/direct-messages.js | 97 |
userLinkClicks | data/user-link-clicks.js | 67 |
connectedApplication | data/connected-application.js | 63 |
Reading the account data file
For a first example, let’s read the data/account.js
archive file. We start by inspecting the manifest
, where manifest$dataTypes$account
tells us which files hold the account data and how many records are in each.
$dataTypes$account |> str()
manifest#> List of 1
#> $ files:List of 1
#> ..$ :List of 3
#> .. ..$ fileName : chr "data/account.js"
#> .. ..$ globalName: chr "YTD.account.part0"
#> .. ..$ count : chr "1"
Here there’s only one file containing a single account record: data/account.js
. Inside that file is a small bit of JavaScript. Like the manifest, it’s almost JSON, except that it assigns the JavaScript object to window.YTD.account.part0
.
window.YTD.account.part0 = [
{"account" : {
"email" : "my-email@example.com",
"createdVia" : "web",
"username" : "grrrck",
"accountId" : "47332433",
"createdAt" : "2009-06-15T13:21:50.000Z",
"accountDisplayName" : "garrick aden-buie"
}
} ]
And again, if we clean up the first line, this is valid JSON that we can read in directly with jsonlite.
<- read_archive_json("data/account.js")
account str(account)
#> List of 1
#> $ :List of 1
#> ..$ account:List of 6
#> .. ..$ email : chr "my-email@example.com"
#> .. ..$ createdVia : chr "web"
#> .. ..$ username : chr "grrrck"
#> .. ..$ accountId : chr "47332433"
#> .. ..$ createdAt : chr "2009-06-15T13:21:50.000Z"
#> .. ..$ accountDisplayName: chr "garrick aden-buie"
This leads us to our first fun fact: I created my Twitter account on June 15, 2009, which means that I’ve been using Twitter (on and off) for 13.6 years. That’s 4,981 days of twittering!
Read any archive item
Let’s generalize what we learned into a few helper functions we can reuse. I’ve placed everything into a single code block so that you can copy and paste it into your R session or script to use it right away.
#' Read the Twitter Archive JSON
#'
#' @param path Path to a Twitter archve `.js` file
<- function(path) {
read_archive_json <- brio::read_lines(path)
lines 1] <- sub("^[^{[]+([{[])", "\\1", lines[1])
lines[
::fromJSON(
jsonlitetxt = lines,
simplifyVector = TRUE,
simplifyDataFrame = FALSE,
simplifyMatrix = FALSE
)
}
#' Read an twitter archive data item
#'
#' @param manifest The list from `manifest.js`
#' @param item The name of an item in the manifest
<- function(manifest, item) {
read_twitter_data $dataTypes[[item]]$files |>
manifest::transpose() |>
purrr::pmap(\(fileName, ...) read_archive_json(fileName))
purrr
}
#' Simplify the data, if possible and easy
#'
#' @param x A list of lists as returned from `read_twitter_data()`
#' @param simplifier A function that's applied to each item in the
#' list of lists and that can be used to simplify the output data.
<- function(x, simplifier = identity) {
simplify_twitter_data <- purrr::flatten(x)
x <- x |> purrr::map(names) |> purrr::reduce(union)
item_names if (length(item_names) > 1) return(x)
|>
x ::map(item_names) |>
purrr::map_dfr(simplifier)
purrr }
Quick recap: to use the functions above, load your archive manifest with read_archive_json()
and then pass it to read_twitter_data()
along with an item name from the archive. If the data in the archive item is reasonably structured, you can call simplify_twitter_data()
to get a tidy tibble1.
<- read_archive_json("data/manifest.js")
manifest <- read_twitter_data(manifest, "account")
account
simplify_twitter_data(account)
#> # A tibble: 1 × 6
#> email creat…¹ usern…² accou…³ creat…⁴ accou…⁵
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 my-email@example.… web grrrck 473324… 2009-0… garric…
#> # … with abbreviated variable names ¹createdVia, ²username,
#> # ³accountId, ⁴createdAt, ⁵accountDisplayName
Example: my followers
Let’s use this on another archive item to find the earliest Twitter adopters among my followers.
# These tables are wide, you may need to scroll to see the preview
options(width = 120)
<-
followers read_twitter_data(manifest, "follower") |>
simplify_twitter_data()
Then we can arrange the rows of followers
by accountId
as a proxy for date of account creation.
<-
early_followers |>
followers arrange(as.numeric(accountId)) |>
slice_head(n = 11)
# Top 11 earliest followers
early_followers
#> # A tibble: 11 × 2
#> accountId userLink
#> <chr> <chr>
#> 1 1496 https://twitter.com/intent/user?user_id=1496
#> 2 11309 https://twitter.com/intent/user?user_id=11309
#> 3 37193 https://twitter.com/intent/user?user_id=37193
#> 4 716213 https://twitter.com/intent/user?user_id=716213
#> 5 741803 https://twitter.com/intent/user?user_id=741803
#> 6 755726 https://twitter.com/intent/user?user_id=755726
#> 7 774234 https://twitter.com/intent/user?user_id=774234
#> 8 787219 https://twitter.com/intent/user?user_id=787219
#> 9 799574 https://twitter.com/intent/user?user_id=799574
#> 10 860921 https://twitter.com/intent/user?user_id=860921
#> 11 944231 https://twitter.com/intent/user?user_id=944231
As you can see, some parts of the Twitter archive include the barest minimum amount of data. Thankfully, we can still use rtweet to gather additional data about these users. I’m looking at a small subset of my 9,030 followers here, but you might want to do this for all your followers and save the collected user data in your archive.
<-
early_followers_accounts |>
early_followers pull(accountId) |>
::lookup_users()
rtweet
|>
early_followers_accounts select(id, name, screen_name, created_at, followers_count, description)
#> # A tibble: 11 × 6
#> id name screen_name created_at followers_count description
#> <int> <chr> <chr> <dttm> <int> <chr>
#> 1 1496 Aelfrick Aelfrick 2006-07-16 14:44:05 25 ""
#> 2 11309 Aaron Khoo aklw 2006-11-02 07:14:47 240 "I am a weapon of …
#> 3 37193 Rob coleman 2006-12-02 11:54:15 654 "data science / la…
#> 4 716213 Tim Dennis jt14den 2007-01-27 16:54:12 971 "Data librarian/di…
#> 5 741803 @AlgoCompSynth@ravenation.club by znmeb znmeb 2007-02-01 00:03:16 9755 "https://t.co/rZhZ…
#> 6 755726 Travis Dawry tdawry 2007-02-06 23:45:01 274 "data, politics, o…
#> 7 774234 Shea's Coach Beard mandoescamilla 2007-02-15 13:51:10 1177 "my anger is a gif…
#> 8 787219 Jonathan jmcphers 2007-02-21 15:56:20 591 "Software engineer…
#> 9 799574 @dietrich@mastodon.social dietrich 2007-02-27 18:41:20 6113 "A lifestyle brand…
#> 10 860921 ⌜will⌟ wtd 2007-03-09 23:20:43 719 "👋 I'm an optimis…
#> 11 944231 Christopher Peters 🇺🇦 statwonk 2007-03-11 14:49:39 4476 "Lead Econometrici…
My tweets
Now we get to the main course: the tweets themselves. We can read them in the same way that we imported accounts
and followers
with read_twitter_data()
, but for now we won’t simplify them.
To see why, let’s take a look at a single tweet. The file of tweets (outer list, [[1]]
) contains an array (inner list, e.g. [[105]]
) of tweets (named item, $tweet
). Here’s that example tweet:
# Tweets are a list of a list of tweets...
<- read_twitter_data(manifest, "tweets")[[1]][[105]]$tweet
tweet str(tweet, max.level = 2)
#> List of 16
#> $ edit_info :List of 1
#> ..$ initial:List of 4
#> $ retweeted : logi FALSE
#> $ source : chr "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>"
#> $ entities :List of 5
#> ..$ user_mentions:List of 1
#> ..$ urls :List of 1
#> ..$ symbols : list()
#> ..$ media :List of 1
#> ..$ hashtags :List of 1
#> $ display_text_range: chr [1:2] "0" "236"
#> $ favorite_count : chr "118"
#> $ id_str : chr "1276198597596459018"
#> $ truncated : logi FALSE
#> $ retweet_count : chr "33"
#> $ id : chr "1276198597596459018"
#> $ possibly_sensitive: logi FALSE
#> $ created_at : chr "Thu Jun 25 17:00:30 +0000 2020"
#> $ favorited : logi FALSE
#> $ full_text : chr "Thanks to prodding from @dsquintana, I added `include_tweet()` to {tweetrmd}. Automatically embed the HTML twee"| __truncated__
#> $ lang : chr "en"
#> $ extended_entities :List of 1
#> ..$ media:List of 1
There’s quite a bit of data in each tweet
, so we’ll pause here and figure out how we want to transform the nested list into a flat last that will rectangle nicely.
<- function(tweet_raw) {
tidy_tweet_raw <- c(
basic_items "created_at",
"favorite_count",
"retweet_count",
"full_text",
"id",
"lang",
"source"
)
# start with a few basic items
<- tweet_raw[basic_items]
tweet
# and collapse a few nested items into a single string
$user_mentions <- tweet_raw |>
tweet::pluck("entities", "user_mentions") |>
purrr::map_chr("screen_name") |>
purrrpaste(collapse = ",")
$hashtags <- tweet_raw |>
tweet::pluck("entities", "hashtags") |>
purrr::map_chr("text") |>
purrrpaste(collapse = ",")
tweet }
When we apply this function to the example tweet, we get a nice, flat list.
tidy_tweet_raw(tweet) |> str()
#> List of 9
#> $ created_at : chr "Thu Jun 25 17:00:30 +0000 2020"
#> $ favorite_count: chr "118"
#> $ retweet_count : chr "33"
#> $ full_text : chr "Thanks to prodding from @dsquintana, I added `include_tweet()` to {tweetrmd}. Automatically embed the HTML twee"| __truncated__
#> $ id : chr "1276198597596459018"
#> $ lang : chr "en"
#> $ source : chr "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>"
#> $ user_mentions : chr "dsquintana"
#> $ hashtags : chr "rstats"
This flattened tweet list will end up becoming a row in a tidy table of tweets thanks to simplify_twitter_data()
, which is used to flatten the list of all of the tweets into a tibble. Once combined into a single table, we use our good friends dplyr
, lubridate
and stringr
to convert columns to their correct format and to extract a few features.
<-
tidy_tweets read_twitter_data(manifest, "tweets") |>
simplify_twitter_data(tidy_tweet_raw) |>
mutate(
across(contains("_count"), as.integer),
retweet = str_detect(full_text, "^RT @"),
reply = str_detect(full_text, "^@"),
type = case_when(
~ "retweet",
retweet ~ "reply",
reply TRUE ~ "tweet"
),created_at = strptime(created_at, "%a %b %d %T %z %Y"),
hour = hour(created_at),
day = wday(created_at, label = TRUE, abbr = TRUE, week_start = 1),
month = month(created_at, label = TRUE, abbr = FALSE),
day_of_month = day(created_at),
year = year(created_at)
)
The result… a nice tidy table of tweets!
tidy_tweets
#> # A tibble: 6,223 × 17
#> created_at favori…¹ retwe…² full_…³ id lang source user_…⁴ hasht…⁵ retweet reply type hour day month
#> <dttm> <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <int> <ord> <ord>
#> 1 2022-11-05 10:02:17 0 0 "RT @g… 1588… en "<a h… "georg… "" TRUE FALSE retw… 10 Sat Nove…
#> 2 2022-11-04 19:42:01 4 0 "@JonT… 1588… en "<a h… "JonTh… "" FALSE TRUE reply 19 Fri Nove…
#> 3 2022-11-04 15:21:23 1 0 "@tjma… 1588… en "<a h… "tjmah… "" FALSE TRUE reply 15 Fri Nove…
#> 4 2022-11-03 12:39:09 1 0 "@trav… 1588… en "<a h… "trave… "" FALSE TRUE reply 12 Thu Nove…
#> 5 2022-11-03 06:45:53 5 0 "@mcca… 1588… en "<a h… "mccar… "" FALSE TRUE reply 6 Thu Nove…
#> 6 2022-11-03 06:36:56 2 0 "@trav… 1588… en "<a h… "trave… "" FALSE TRUE reply 6 Thu Nove…
#> 7 2022-11-02 12:26:46 0 0 "RT @p… 1587… en "<a h… "posit… "" TRUE FALSE retw… 12 Wed Nove…
#> 8 2022-11-02 12:20:50 4 0 "And I… 1587… en "<a h… "" "" FALSE FALSE tweet 12 Wed Nove…
#> 9 2022-10-31 11:47:57 0 0 "RT @D… 1587… en "<a h… "Dante… "" TRUE FALSE retw… 11 Mon Octo…
#> 10 2022-10-30 19:32:22 8 0 "At fi… 1586… en "<a h… "pomol… "" FALSE FALSE tweet 19 Sun Octo…
#> # … with 6,213 more rows, 2 more variables: day_of_month <int>, year <dbl>, and abbreviated variable names
#> # ¹favorite_count, ²retweet_count, ³full_text, ⁴user_mentions, ⁵hashtags
If you’ve seen the Observable notebook that inspired this post, you’ll notice that I’ve mostly recreated their data structure, but in R. Next, let’s recreate some of the plots in that notebook, too!
Monthly tweets, replies and retweets
Code: Set Blog Theme
Yeah, so real quick, I’m going to set up a plot theme for the rest of this post. Here it is, if you’re interested in this kind of thing!
<-
blog_theme theme_minimal(18, base_family = "IBM Plex Mono") +
theme(
plot.background = element_rect(fill = "#f9fafa", color = NA),
plot.title.position = "plot",
plot.title = element_text(size = 24, margin = margin(b = 1, unit = "line")),
legend.position = c(0, 1),
legend.direction = "horizontal",
legend.justification = c(0, 1),
legend.title.align = 1,
axis.title.y = element_text(hjust = 0),
axis.title.x = element_text(hjust = 0),
panel.grid.major = element_line(color = "#d3d9db"),
panel.grid.minor = element_blank()
)
theme_set(blog_theme)
The first chart shows the number of tweets, replies and mentions sent in each month from 2009 to 2022. From 2009 to 2015, I sent about 25 total tweets per month, with one large spike in January 2014 when a grad school course I was taking decided to do a “Twitter seminar.” My Twitter usage dropped off considerably between 2015 and 2018: the result of a mix of grad school grinding, and then when my son was born in 2016 tweeting practically stopped altogether.
My Twitter usage picked up again in 2018, which also coincided with my realization that academia wasn’t my ideal future. In 2018 and 2019 you can see my baseline usage pick up considerably at the start of the year — the effects of a lot of tweeting and networking during rstudio::conf. Since 2019, my usage has been fairly stable; I typically send between 50 and 100 tweets a month. Finally, there’s a noticeable recent drop in activity: since Twitter changed ownership I still read Twitter but only occasionally tweet.
Hover or tap2 on a bar above to see the top 5 tweets in each segment.
Code: Plot Monthly Tweets
<- c(reply = "#5e5b7f", tweet = "#ef8c02", retweet = "#7ab26f")
type_colors
<- function(data) {
top_5_tweets_text slice_max(
data,n = 5,
order_by = retweet_count * 2 + favorite_count,
with_ties = FALSE
|>
) pull(full_text) |>
str_trunc(width = 120)
}
<-
plot_monthly |>
tidy_tweets # Group nest by month and tweet type ---
mutate(dt_month = sprintf("%d-%02d", year, month(created_at))) |>
group_nest(dt_month, month, year, type) |>
mutate(
# Calculate number of tweets per month/type
n = map_int(data, nrow),
# and extract the top 5 tweets
top = map(data, top_5_tweets_text)
|>
) select(-data) |>
# Then build the tooltip (one row per month/type)
rowwise() |>
mutate(
type_pl = plu::ral(type, n = n),
tooltip = glue::glue(
"<p><strong>{month} {year}: ",
"<span style=\"color:{type_colors[type]}\">{n} {type_pl}</span></strong></p>",
"<ol>{tweets}</ol>",
tweets = paste(sprintf("<li>%s</li>", top), collapse = "")
),tooltip = htmltools::HTML(tooltip)
|>
) ungroup() |>
# Finally ensure the order of factors (including month!)
mutate(type = factor(type, rev(c("tweet", "reply", "retweet")))) |>
arrange(dt_month, type) |>
mutate(dt_month = fct_inorder(dt_month)) |>
# Plot time! ----
ggplot() +
aes(x = dt_month, y = n, fill = type, color = type, group = type) +
::geom_col_interactive(
ggiraphwidth = 1,
aes(tooltip = tooltip)
+
) scale_fill_manual(values = type_colors) +
scale_color_manual(values = type_colors) +
# The x-axis is factors for each month,
# we need labels for each year, e.g. 2010-01 => 2010
scale_x_discrete(
breaks = paste0(seq(2008, 2022, by = 1), "-01"),
labels = seq(2008, 2022, by = 1)
+
) scale_y_continuous(expand = expansion(add = c(1, 1))) +
labs(
title = "Tweets per month",
x = "Month Tweeted →",
y = "Count →",
fill = NULL,
color = NULL
+
) theme(
plot.title = element_text(size = 24, margin = margin(b = 2, unit = "line")),
legend.position = c(0, 1.14)
)
::girafe(
ggiraphggobj = plot_monthly,
width_svg = 14,
height_svg = 6,
desc = knitr::opts_current$get("fig.alt")
)
Popular tweets, likes & retweets
Which tweets earned the most internet points? The next plot displays tweets that had at least 5 retweets or 5 favorites. Note that I’ve fiddled with the axis scales; both are log-scales and each break shows (roughly) a doubling of internet points in each direction. Interestingly, for “popular” tweets (please note the air-quotes) retweets and favorites appear to be log-linear: a doubling of one generally corresponds to a doubling of the other, although my tweets tended to receive about 4 times as many likes as retweets.
There’s also some pretty interesting stuff going on in the low-retweets but high-favorites area. Popular tweets are cool, but the tweets that got lots of likes without being retweeted are the feel-good tweets that made me feel like I was part of a community online.
Code: Plot Popular Tweets
<- function(x) {
jitter <- min(1, x * 0.2)
spread + runif(1, -spread, spread)
x
}
<-
plot_popular_tweets |>
tidy_tweets filter(retweet_count >= 5 | favorite_count >= 5) |>
mutate(
age = difftime(Sys.time(), created_at, units = "days"),
age = as.numeric(age) / 365.25,
created_at = strftime(created_at, '%a %b %e, %Y'),
full_text = str_replace_all(full_text, "\n\n", "</p><p>"),
full_text = str_replace_all(full_text, "\n", "<br>"),
tooltip = glue(
"<p>{full_text}</p><dl>",
"<dt>♲</dt><dd>{retweet_count}</dd>", # recycling icon
"<dt>♥</dt><dd>{favorite_count}</dt>", # heart icon
"<dt>✎</dt><dd>", # pencil icon
"<a href=\"https://twitter.com/grrrck/status/{id}\">{created_at}</a>",
"</dd></dl>"
)|>
) rowwise() |>
mutate(across(c(retweet_count, favorite_count), jitter)) |>
ungroup() |>
ggplot() +
aes(
x = favorite_count,
y = retweet_count,
color = age,
size = 5 * retweet_count + favorite_count,
tooltip = tooltip
+
) ::geom_point_interactive() +
ggiraphscale_color_viridis_c(option = "C", direction = -1) +
scale_y_continuous(
trans = scales::log1p_trans(),
breaks = c(10, 25, 50, 100, 200, 400),
+
) scale_x_continuous(
trans = scales::log1p_trans(),
breaks = c(10, 25, 50, 100, 200, 400, 800, 1600)
+
) guides(size = "none") +
labs(
title = "Popular tweets",
x = "Favorites →",
y = "Retweets →",
color = "Tweet age\nin years"
+
) theme(
legend.title = element_text(size = 12, vjust = 1),
legend.position = c(1.0125, 1.08),
legend.justification = c(1, 1)
)
::girafe(
ggiraphggobj = plot_popular_tweets,
width_svg = 12,
height_svg = 8,
options = list(
::opts_toolbar(position = "bottomright"),
ggiraph::opts_tooltip(placement = "container"),
ggiraph::opts_hover_inv("color:var(--borderColorCustom, #cfd5d8)")
ggiraph
),desc = knitr::opts_current$get("fig.alt")
)
Tweets by time of day
The next plot highlights the time of day at which I sent tweets. Each bar show the total number of tweets I’ve written within a given hour of the day. Morning hours are in the top half of each day’s circular panel and evening hours are in the bottom half. Tuesday at noon seems to be my favorite time to tweet — I sent 120 tweets between 12pm and 1pm on Tuesday — followed by Friday at 1pm (111 tweets) or at 11am (110 tweets).
Hover or tap on a bar to compare a given time across all days.
Code: Plot Tweets by Time of Day
<-
tweet_count_by_hour |>
tidy_tweets count(day, hour) |>
mutate(
hour_label = case_when(
== 12 ~ "12pm",
hour == 0 ~ "12am",
hour > 12 ~ paste0(hour - 12, "pm"),
hour < 12 ~ paste0(hour, "am")
hour
),pct = n / sum(n)
)<- function(day, hour_label, ...) {
tooltip_hour <-
this_hour_count |>
tweet_count_by_hour filter(hour_label == !!hour_label)
<- sum(this_hour_count$n)
this_hour_total <- scales::percent(this_hour_total / sum(tweet_count_by_hour$n), 0.1)
this_hour_pct <- trimws(format(this_hour_total, big.mark = ","))
this_hour_total
<-
this_hour_days |>
this_hour_count mutate(
across(pct, scales::percent_format(0.1)),
across(n, format, big.mark = ","),
across(n, trimws),
text = glue("{day}: {pct} ({n})"),
text = if_else(day == !!day, glue("<strong>{text}</strong>"), text)
|>
) glue_data("<li>{text}</li>") |>
glue_collapse()
::glue(
glue"<p><strong>{hour_label}</strong><br><small>{this_hour_pct} of total ({this_hour_total})</small></p>",
"<ul>{this_hour_days}</ul>"
)
}
$tooltip <- pmap_chr(tweet_count_by_hour, tooltip_hour)
tweet_count_by_hour
<-
plot_time_of_day ggplot(tweet_count_by_hour ) +
aes(y = n, fill = day, x = hour, data_id = hour, tooltip = tooltip) +
geom_area(
data = function(d) {
# Shade from midnight-6am and 6pm-midnight, kinda like geom_step_area()
<- max(d$n)
max_count tibble(
day = sort(rep(unique(d$day), 6)),
hour = rep(c(0, 6, 6.01, 18, 18.01, 24), 7),
n = rep(c(max_count, max_count, 0, 0, max_count, max_count), 7),
tooltip = ""
)
},fill = "#aaaaaa30",
+
) ::geom_col_interactive(show.legend = FALSE, width = 1) +
ggiraphfacet_wrap(vars(day), nrow = 2) +
coord_polar(start = pi) +
scale_x_continuous(
breaks = seq(0, 23, 3),
minor_breaks = 0:23,
labels = c("12am", paste0(seq(3, 9, 3), "am"), "12pm", paste0(seq(3, 9, 3), "pm")),
limits = c(0, 24),
expand = expansion()
+
) scale_y_continuous(expand = expansion(), breaks = seq(0, 100, 25)) +
scale_fill_discrete() +
labs(
title = "When do I do my tweeting?",
x = NULL,
y = NULL
+
) theme(
axis.text.y = element_blank(),
axis.text.x = element_text(size = 10),
panel.grid.major.y = element_blank()
)
::girafe(
ggiraphggobj = plot_time_of_day,
width_svg = 12,
height_svg = 8,
options = list(
::opts_hover_inv("filter: saturate(30%) brightness(125%)"),
ggiraph::opts_hover(css = "opacity:1"),
ggiraph::opts_tooltip(
ggiraphplacement = "container",
css = "width: 12rem; font-family: var(--font-monospace, 'IBM Plex Mono');",
# These don't matter, position is set by CSS rules below
offx = 600,
offy = 260,
use_cursor_pos = FALSE
)
),desc = knitr::opts_current$get("fig.alt")
)
Tweet source
The tweet archive includes the application used to send the tweet, stored as the HTML that’s displayed in the tweet text:
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
With a little bit of regex, we can extract the tweet source. Apparently, I’ve used 37 different apps to write my tweets, but 17 were used for no more than 5 tweets. Most often — actually, 79% of the time — I wrote tweets from the web app or my phone.
Code: Plot Tweet Source
<-
tweet_source |>
tidy_tweets extract(
source,into = c("source_href", "source"),
regex = '<a href="([^"]+)"[^>]+>([^<]+)</a>'
)
<- tweet_source |>
tweet_source_count count(source) |>
mutate(pct = n / sum(n))
<-
plot_tweet_source |>
tweet_source mutate(
source = fct_lump_n(source, n = 15),
source = fct_rev(fct_infreq(source))
|>
) count(source, type, sort = TRUE) |>
pivot_wider(names_from = type, values_from = n, values_fill = 0) |>
mutate(
total = reply + retweet + tweet,
tooltip = pmap_chr(
list(source, reply, retweet, tweet, total),
function(source, reply, retweet, tweet, total) {
<- glue(
x '<label for="{tolower(label)}">{label}</label>',
'<progress id="{tolower(label)}" max="{total}" value="{value}">{value}</progress>',
label = c("Tweets", "Replies", "Retweets"),
value = c(tweet, reply, retweet)
)<- glue_collapse(x)
x paste0('<p class="b">', source, "</p>", x)
}
)|>
) ggplot() +
aes(x = total, y = source, tooltip = tooltip) +
::geom_col_interactive(show.legend = FALSE) +
ggiraphscale_x_continuous(expand = expansion(add = c(0, 0.01))) +
scale_y_discrete(expand = expansion()) +
labs(
title = "What app did I use to tweet?",
x = "Tweets →",
y = NULL
+
) theme(
panel.grid.major.y = element_blank()
)
::girafe(
ggiraphggobj = plot_tweet_source,
width_svg = 10,
height_svg = 8,
options = list(
::opts_hover_inv("filter: saturate(30%) brightness(125%)"),
ggiraph::opts_hover(css = "opacity:1"),
ggiraph::opts_tooltip(
ggiraphplacement = "container",
css = "width: 15rem; font-family: var(--font-monospace, 'IBM Plex Mono');"
)
),desc = knitr::opts_current$get("fig.alt")
)
My likes
One huge reason to go through the trouble of requesting and downloading your Twitter archive is to collect a copy of your liked tweets. (Sadly, your bookmarks are not a part of the archive.)
<-
likes read_twitter_data(manifest, "like") |>
simplify_twitter_data()
|>
likes arrange(as.numeric(tweetId))
#> # A tibble: 11,773 × 3
#> tweetId fullText expan…¹
#> <chr> <chr> <chr>
#> 1 42240201359233024 We just went live with RStudio, a new IDE for R. Try it out and let us know what you thin… https:…
#> 2 169437879704092672 RT @DKThomp: Adulthood, Delayed: What the Recession Has Done to Millennials http://t.co/U… https:…
#> 3 338425212762738690 Joy! First sighting of NYC's CitiBikes in place! http://t.co/o6VPop4kWZ https:…
#> 4 343026917659791360 What a shitty day to announce my new data analytics project, Prism. https:…
#> 5 343037575889580033 Very cool: DoS using agent-based modeling to understand conflict dynamics in Niger Delta … https:…
#> 6 347931309496213504 BACK TO BACK CHAMPS!!! Going crazy all by myself in my little hotel room in Prague. Effin… https:…
#> 7 383071681079549953 The real reason lowering health care costs is hard: Every patient is unique http://t.co/j… https:…
#> 8 386857068423938048 Good work on study on admits/length of stay/ 'crowdedness' of ICU & impacts on morbid… https:…
#> 9 386862608231325696 I'm giving my presentation on scheduling medical residents at #informs2013 in the Doing G… https:…
#> 10 386935928042029056 Gustavo just presented a semicont opt that can B perfectly applied in my supply chain pro… https:…
#> # … with 11,763 more rows, and abbreviated variable name ¹expandedUrl
While the likes archive includes the full text of each tweet, we can use the lookup_tweets()
function from the rtweet package to download complete information about each tweet.
<-
likes_full ::lookup_tweets(likes$tweetId) |>
rtweetwrite_rds("data/likes.rds")
Getting all 11,773 tweets takes a few minutes, so I highly recommend saving the data to disk as soon as you’ve collected it.
<- read_rds("data/likes.rds")
likes_full likes_full
#> # A tibble: 11,385 × 43
#> created_at id id_str full_…¹ trunc…² displ…³ entities source in_rep…⁴ in_re…⁵ in_re…⁶ in_re…⁷
#> <dttm> <dbl> <chr> <chr> <lgl> <dbl> <list> <chr> <dbl> <chr> <dbl> <chr>
#> 1 2022-11-04 20:35:35 1.59e18 15886914597… "Read … FALSE 278 <named list> "<a h… NA NA NA NA
#> 2 2022-11-05 10:33:18 1.59e18 15889022786… "heari… FALSE 190 <named list> "<a h… NA NA NA NA
#> 3 2022-11-05 01:35:38 1.59e18 15887669703… "Defen… FALSE 236 <named list> "<a h… NA NA NA NA
#> 4 2022-11-04 14:23:18 1.59e18 15885977742… "Pleas… FALSE 60 <named list> "<a h… NA NA NA NA
#> 5 2022-11-04 11:35:40 1.59e18 15885555857… "Despe… FALSE 114 <named list> "<a h… NA NA NA NA
#> 6 2022-11-04 16:16:08 1.59e18 15886261683… "Here'… FALSE 269 <named list> "<a h… NA NA NA NA
#> 7 2022-11-04 11:46:29 1.59e18 15885583081… "@tjma… FALSE 39 <named list> "<a h… 1.59e18 158855… 1.29e9 128991…
#> 8 2022-11-03 10:22:11 1.59e18 15881747079… "https… FALSE 0 <named list> "<a h… NA NA NA NA
#> 9 2022-11-04 09:48:27 1.59e18 15885286045… "One o… FALSE 188 <named list> "<a h… NA NA NA NA
#> 10 2022-11-04 13:54:39 1.59e18 15885905619… "We’ve… FALSE 141 <named list> "<a h… NA NA NA NA
#> # … with 11,375 more rows, 31 more variables: in_reply_to_screen_name <chr>, geo <list>, coordinates <list>,
#> # place <list>, contributors <lgl>, is_quote_status <lgl>, retweet_count <int>, favorite_count <int>,
#> # favorited <lgl>, retweeted <lgl>, lang <chr>, possibly_sensitive <lgl>, quoted_status_id <dbl>,
#> # quoted_status_id_str <chr>, quoted_status_permalink <list>, quoted_status <list>, text <chr>, favorited_by <lgl>,
#> # scopes <list>, display_text_width <lgl>, retweeted_status <lgl>, quote_count <lgl>, timestamp_ms <lgl>,
#> # reply_count <lgl>, filter_level <lgl>, metadata <lgl>, query <lgl>, withheld_scope <lgl>, withheld_copyright <lgl>,
#> # withheld_in_countries <lgl>, possibly_sensitive_appealable <lgl>, and abbreviated variable names ¹full_text, …
Assuming I liked a tweet in the same year it was written (reasonable but not entirely accurate), plotting the source year of the tweet highlights just how much my Twitter usage picked up in 2018.
Code: Plot Total Likes
<-
plot_liked_tweets |>
likes_full count(year = year(created_at)) |>
mutate(
noun = map_chr(n, \(n) plu::ral("tweet", n = n)),
tooltip = paste(format(n, big.mark = ","), "liked", noun, "in", year)
|>
) ggplot() +
aes(year, n, tooltip = tooltip, group = 1) +
geom_line(color = "#595959", linewidth = 1.5) +
::geom_point_interactive(color = "#595959", size = 7) +
ggiraphscale_x_continuous(breaks = seq(2008, 2022, 2), expand = expansion(add = 0.25)) +
labs(
title = "Tweets I've Liked",
x = "Year →",
y = "Liked Tweets →"
)
::girafe(
ggiraphggobj = plot_liked_tweets,
width_svg = 12,
height_svg = 4,
options = list(ggiraph::opts_tooltip()),
desc = knitr::opts_current$get("fig.alt")
)
Advertising info
The last thing I want to dive into is a part of the archive that includes information about Twitter’s perception of you. Or more importantly how they see you in terms of advertising.
Impressions and engagements
There are two key items in the archive: ad impressions and engagements. All ads on Twitter are actually tweets that are promoted into your view because an advertiser has paid for Twitter to show you a tweet you wouldn’t otherwise see.
An impression is a promoted tweet you see in your timeline or in tweet replies, but you don’t interact with the tweet. An engagement is a tweet that you click on or interact with in some way. The definitions (included in the details below) are hazy — I’m fairly certain from looking at my data that some tweets are “engaged with” simply by being visible on my screen for a longer period of time. (In other words, I’m certain I haven’t actively engaged with as many tweets as are highlighted below.)
The ads data items are imported separately and have a pretty wild nested structure. I used a lot of tidyr’s tidyr::unnest()
and my newest favorite function, unnest_wider()
.
Code: ad_impressions
ad-impressions.js
-ad
: Promoted Tweets the account has viewed and any associated metadata. -deviceInfo
: Information about the device where the impression was viewed such as its ID and operating system. -displayLocation
: Location where the ad was viewed on Twitter. -promotedTweetInfo
: Information about the associated tweet such as unique identifier, text, URLs and media when applicable. -advertiserInfo
: Advertiser name and screen name. -matchedTargetingCriteria
: Targeting criteria that were used to run the campaign. -impressionTime
: Date and time when the ad was viewed.
<-
ad_impressions read_twitter_data(manifest, "adImpressions") |>
simplify_twitter_data() |>
unnest(adsUserData) |>
unnest(adsUserData) |>
unnest_wider(adsUserData) |>
unnest_wider(c(deviceInfo, promotedTweetInfo, advertiserInfo)) |>
mutate(
matchedTargetingCriteria = map(matchedTargetingCriteria, map_dfr, identity),
across(impressionTime, ymd_hms)
)
Code: ad_engagements
ad-engagements.js
- ad: Promoted Tweets the account has engaged with and any associated metadata. - engagementAttributes: Type of engagement as well as date and time when it occurred.
<-
ad_engagements read_twitter_data(manifest, "adEngagements") |>
simplify_twitter_data() |>
unnest(adsUserData) |>
unnest(adsUserData) |>
unnest_wider(adsUserData) |>
mutate(across(engagementAttributes, map, map_dfr, identity)) |>
unnest_wider(impressionAttributes) |>
# now the same as the impressions
unnest_wider(c(deviceInfo, promotedTweetInfo, advertiserInfo)) |>
mutate(
matchedTargetingCriteria = map(matchedTargetingCriteria, map_dfr, identity),
across(impressionTime, ymd_hms)
)
Once you have the impressions and engagements tables, you can combine them together with purrr::list_rbind
.
<-
ads list(
impression = ad_impressions,
engagement = ad_engagements
|>
) list_rbind(names_to = "type")
ads
#> # A tibble: 8,599 × 16
#> type osType devic…¹ devic…² displ…³ tweetId tweet…⁴ urls media…⁵ adver…⁶ scree…⁷ matche…⁸ impressionTime
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <list> <list> <chr> <chr> <list> <dttm>
#> 1 impression Ios 2eVW2/… iPhone… Timeli… 154910… "When … <NULL> <NULL> Chevro… @chevr… <tibble> 2022-08-08 14:03:09
#> 2 impression Ios 2eVW2/… iPhone… Timeli… 155557… "Tune … <chr> <NULL> Walmart @Walma… <tibble> 2022-08-08 14:08:07
#> 3 impression Ios 2eVW2/… iPhone… Timeli… 155465… "Meet … <NULL> <NULL> Anker @Anker… <tibble> 2022-08-08 14:04:17
#> 4 impression Ios 2eVW2/… iPhone… Timeli… 155598… "This … <NULL> <chr> KESIMP… @KESIM… <tibble> 2022-08-08 10:54:28
#> 5 impression Ios 2eVW2/… iPhone… TweetC… 155507… "🎁🎁G… <NULL> <NULL> Webull @Webul… <tibble> 2022-08-08 10:57:56
#> 6 impression Ios 2eVW2/… iPhone… Timeli… 151472… "#1 is… <NULL> <NULL> Financ… @finan… <tibble> 2022-08-08 10:56:19
#> 7 impression Ios 2eVW2/… iPhone… TweetC… 155507… "🎁🎁G… <NULL> <NULL> Webull @Webul… <tibble> 2022-08-08 10:56:57
#> 8 impression Ios 2eVW2/… iPhone… Timeli… 155481… "Mick.… <NULL> <NULL> EPIX i… @EPIXHD <tibble> 2022-08-08 03:33:23
#> 9 impression Ios 2eVW2/… iPhone… Timeli… 155407… "Watch… <NULL> <NULL> Paper … @HowLi… <tibble> 2022-08-08 03:34:51
#> 10 impression Ios 2eVW2/… iPhone… Timeli… 155512… "Wreck… <NULL> <NULL> Mill G… @Mill_… <tibble> 2022-08-08 03:25:08
#> # … with 8,589 more rows, 3 more variables: publisherInfo <list>, promotedTrendInfo <list>,
#> # engagementAttributes <list>, and abbreviated variable names ¹deviceId, ²deviceType, ³displayLocation, ⁴tweetText,
#> # ⁵mediaUrls, ⁶advertiserName, ⁷screenName, ⁸matchedTargetingCriteria
The downside of the ads data is that it only includes the last three-ish months. Here are my impressions and engagements for August through early November of 2022.
Code: Plot Interactions by Month
<-
plot_ads_interactions |>
ads count(type, month = floor_date(impressionTime, "month")) |>
mutate(
n_str = format(n, big.mark = ","),
tooltip = pmap_chr(
list(type, n, n_str, month),
\(type, n, n_str, month) {glue(
"{n} {type} in {month}",
type = plu::ral(type, n = n),
month = month(month, label = TRUE, abbr = FALSE)
)
})|>
) ggplot() +
aes(month, n, fill = type, tooltip = tooltip) +
::geom_col_interactive() +
ggiraphscale_fill_manual(
values = c("#97c4ca", "#1c7d8b"),
labels = c("Engagement", "Impression")
+
) labs(
title = "Ad Interactions by Month",
x = NULL,
y = "Promoted Tweets →",
fill = NULL
+
) theme(
panel.grid.major.x = element_blank(),
legend.direction = "vertical",
legend.position = c(0.95, 0.9),
legend.justification = c(1, 1)
)
::girafe(
ggiraphggobj = plot_ads_interactions,
width_svg = 12,
height_svg = 6,
options = list(ggiraph::opts_tooltip()),
desc = knitr::opts_current$get("fig.alt")
)
Who advertized to me?
Finally, I wanted to know who was advertising to me and which tweets I was seeing. The advertising data includes demographics and keywords used by the advertisers to target you, and I recommend taking a look at that. But I’m running out of steam in this post, so let’s just take a look at the promoted content I saw on Twitter over the last few months.
Code: Plot Ad Interactions by Advertiser
<-
ads_advertiser_counts |>
ads count(advertiserName, type, sort = TRUE) |>
pivot_wider(names_from = type, values_from = n) |>
slice_max(n = 25, engagement + impression) |>
pivot_longer(-1, names_to = "type")
<-
ads_tweet_examples |>
ads filter(!is.na(tweetText)) |>
semi_join(ads_advertiser_counts) |>
group_by(advertiserName, type) |>
mutate(tweetText = str_trunc(tweetText, width = 80)) |>
summarize(
n = n(),
tweets = glue_collapse(glue(
"<li>{sample(unique(tweetText), min(5, length(unique(tweetText))))}</li>"
)),.groups = "drop"
|>
) mutate(
tweets = glue('<ul>{tweets}</ul>'),
tweets = glue(
'<p><strong>{n}</strong> promoted tweet ',
'<strong>{type}s</strong> ',
'by <strong>{advertiserName}</strong></p>',
'{tweets}'
)
)
<-
plot_advertisers |>
ads_advertiser_counts left_join(ads_tweet_examples) |>
mutate(advertiserName = fct_reorder(advertiserName, value, sum)) |>
ggplot() +
aes(value, advertiserName, fill = type, tooltip = tweets) +
::geom_col_interactive() +
ggiraphscale_x_continuous(expand = expansion()) +
scale_fill_manual(
values = c("#97c4ca", "#1c7d8b"),
labels = c("Engagement", "Impression")
+
) labs(
title = "Ad Interactions by Advertiser",
x = "Interactions with Promoted Tweets →",
y = NULL,
fill = NULL
+
) theme(
panel.grid.major.y = element_blank(),
legend.direction = "vertical",
legend.position = c(0.99, 0.1),
legend.justification = c(1, 0)
)
::girafe(
ggiraphggobj = plot_advertisers,
width_svg = 12,
height_svg = 10,
options = list(ggiraph::opts_tooltip()),
desc = knitr::opts_current$get("fig.alt")
)
Footnotes
simplify_twitter_data()
is an optional and separate function because it’s an 80/20 function: it’s 20% of the code that does the right thing 80% of the time.↩︎On mobile devices, tapping on a bar kind of works. But to change focus from one plot element to another, you might need to tap outside of the plot area before tapping on the new element. Sorry! The hover interactions work a whole lot better on desktop.↩︎