An informal poll about experiences with programming languages has been making the rounds on Twitter this week. It all started with this tweet from @cotufa82:
- First language: Basic / Java
2. Had difficulties: Java
3. Most used: JavaScript / Python
4. Totally hate: Java
5. Most loved: Go / Python
6. For beginners: Python / Ruby
What about you? — Super Di (@cotufa82) October 3, 2019
The tweet caught on within a few days and there are now more than 16,840 replies and quote tweets from developers and programmers sharing their own experiences.
My interest in the poll was piqued by another tweet by @edsu sharing a Jupyter notebook analyzing the tweeted responses. I thought it would be interesting to do a similar analysis using R, initially thinking I could compare the R and Python versions.
What I should have done is to have used both R and Python (because they’re friends and language wars are silly), but instead I ended up going down the endless rabbit hole of regular expressions and free-form informal survey results.
Gather the Tweets
I gathered all tweets containing "first language"
, "most used"
, and "most loved"
using the excellent rtweet package by Mike Kearney.
<- rtweet::search_tweets(
tweets '"first language" AND "most used" AND "most loved"',
n = 18000,
include_rts = FALSE
)
You can download a CSV with the processed tweets. The .csv
doesn’t include the full tweet data, but it does include status_id
so that you can recover the tweet data with rtweet::lookup_statuses()
.
Whose Tweets Were The Most Popular?
There were 16,840 responses to the poll and 89% or 15,025 of them are replies to or quotes of another tweet. Here are the top contributors to the popularity of the poll, in the form of the top 10 recipients of a reply or quote tweet.
Our Experience with Programming Languages
Let’s dive into the results. If you’re interested in taking a peek behind the regular expressions curtain, I’ve included a code walkthrough below.
The original tweet asked for six categories: First language, Had difficulties, Most used, Totally hate, Most loved, For beginners. Replies to this tweet were… creative. The category names and formatting were hand-typed, so flexible and prone to spelling errors and permutations.
To get the broadest range of answers possible, I used flexible regular expressions to accept a variety of formatting choices, and I also widened the categories to encompass the same core themes. For example, first love
, secret love
, and mostly loved
all were added to the Most loved category, which I called, simply, love.
I also captured multiple programming languages in each category (even the original tweet had multiple answers for first language (Basic/Java) and a few other categories).
Each of the following plots shows the top 20 responses in each category.
First Language vs. Recommended First Language
How do the first languages learned by programmers compare to the languages they would recommend to others to learn first? Many people started with older languages like Basic, C, Pascal, C++ and Java but would recommend new programmers start with Python, JavaScript, Ruby and also C or Java.
Love It or Hate It
Which programming languages are loved and which languages are not? The world seems to have a love/hate relationship with JavaScript, but Python is much more loved than hated. Likewise Swift, Ruby, and Go are significantly more positive than negative, C++ is also a bit love/hate, and PHP certainly isn’t feeling the love.
Most Used or Had Difficulties
Which languages are most used compared with those that have caused difficulties? JavaScript is eating the world, and plenty of people are using workhorse languages like Python, Java and C#/C++. (And a quite a few are using PHP presumably because they have to.) Still, JavaScript’s love/hate relationship continues as many people indicated that it caused them problems. I’m not surprised to see C++, C, and Java on the had difficulties list. Interestingly, Haskell shows up in the loved list but seems to also be tricky to learn.
Feelings about #rstats
How do developers feel about my favorite language? R isn’t a typical first language, but it is among the top 20 recommended to new programmers to learn first. It’s also the 12th most used language.
Category | Rank | Total |
---|---|---|
most used | 12 | 1456 |
love | 15 | 2067 |
had difficulties | 19 | 2092 |
hate | 16 | 2641 |
beginner | 17 | 2296 |
first language | 28 | 1508 |
curious | 15 | 207 |
currently | 2 | 63 |
next | 3 | 50 |
honerable mention | 8 | 98 |
chronology | 25 | 29 |
also used, eager to learn, frenemy, never studied, on my list, to learn, totally meh, willing to learn |
Code Walkthrough
At a high level, the process for cleaning and standardizing the tweet repsonses looks like this. I abstracted some of the larger steps in the pipeline into separate functions.
Pre-clean the tweet text, including remove_unused_text()
Separate tweets so that each line or item of the tweet is in its own row using
tidyr::separate_rows()
- Items are indicated by
N.
,N)
,N:
, orN-
, or just appear on a new line without numbering.
- Items are indicated by
Remove whitespace and any numbering from each line
Separate each line into a question
category
andanswer
pair by splitting on:
usingtidyr::separate()
Filter out empty answers and convert everything to lower case
Use a set of regular expressions to process_answer() into individual languages
Use more regular expressions to recode_answer() and recode_category(), fixing spelling mistakes and combining overlapping groups
Count the number of replies mentioning each programming language by category
The whole pipeline is summarized below, including the function to plot response counts by category.
Remove Unused Text
This little function removes usernames (@user
), URLs, parenthetical comments, and turns #hashtag
into hashtag
because many people specified their choices using language hashtags, like #rstats
instead of r
.
<- function(text) {
remove_unused_text %>%
text # strip usernames
str_remove_all("@\\w+\\s*") %>%
# strip URLs
str_remove_all("\\s*http[^ ]+\\s*") %>%
# remove parentheticals
str_remove_all("\\s*\\(.+?\\)( |\n|$)") %>%
# replace "#hashtag" with "hashtag"
str_replace_all("#(\\w)", "\\1")
}
Process Answer
The goal in processing the answers is to transform each answer to a single string of comma separated languages. In doing this, common variations of language lists should result in the same final answers. For example, Python and R
, Python/R
, and Python or R
should all be handled similarly. To help with this process I created a list of common languages that frequently appear in the answers.
<- c(
common_langs # c, c#, c++, and .net are manually included later
"css", "html", "python", "javascript", "x86", "java", "ruby", "pascal", "php",
"matlab", "perl", "fortran", "logo", "actionscript", "lua", "assembly",
"delphi", "js", "scheme", "scratch", "go", "typescript", "clojure", "elixr",
"kotlin", "ocaml", "rust", "mathematica", "matlab", "dart", "flutter", "groovy",
"flash", "bash", "shell", "sql", "haskell", "lisp", "scala", "sas",
"rstats", "golang"
)
Then, with a bit of regex kung fu, the responses are converted from Python and R
to python,r
.
<- function(answer, common_langs) {
process_answer %>%
answer # Aggresively remove unusual characters
str_replace_all("[^\\w\\d#+., ]", " ") %>%
# Remove leading character if it's a `,`
str_replace_all("^,", " ") %>%
# Remove `.` at end of string
str_remove_all("[.]$") %>%
# Replace and, or with space (prep for next step)
str_replace_all("\\b(and|or|also|amp)\\b", " ") %>%
# Remove qualifiers
str_remove_all("\\b(maybe|now)\\b") %>%
# Multiple languages may be listed separated by spaces, if so add comma
str_replace_all(
pattern = paste0("\\b(", paste(common_langs, collapse = "|"), ")\\b\\s*"),
replacement = "\\1,"
%>%
) gsub("c\\+\\+\\d+", "c++", .) %>%
# Comma separate languages that are tough to regex
gsub("c ", "c,", ., fixed = TRUE) %>%
gsub(".net ", ".net,", ., fixed = TRUE) %>%
gsub("c# ", "c#,", ., fixed = TRUE) %>%
gsub("c++ ", "c++,", ., fixed = TRUE) %>%
# No trailing punctuation
str_remove("[.,!?/=<>;:]+$")
}
Recode Answer
There are a number of programming languages that have multiple variants or are commonly referred to by shorthand names — rstats
for R
or golang
for go
, for example. This function recodes the programming language answers that I noticed while working with the data (but it’s admitedly not complete).
<- function(answer) {
recode_answer # Recode Basic Variants
<- recode(answer, "vb" = "visual basic")
answer <- if_else(str_detect(answer, "visual.*basic"), "visual basic", answer)
answer <- if_else(str_detect(answer, "q.*basic"), "qbasic", answer)
answer <- if_else(str_detect(answer, "gw.*basic"), "gw basic", answer)
answer <- if_else(str_detect(answer, "(?<!(visual|q|gw)\\s?)basic"), "basic", answer)
answer # Recode Pascal variants
<- if_else(str_detect(answer, "pascal"), "pascal", answer)
answer # Recode js vs Javascript
<- recode(answer, "js" = "javascript")
answer # Recode golang to go
<- recode(answer, "golang" = "go")
answer # Recode rstats as r
recode(answer, "rstats" = "r")
}
Recode Category
As you might imagine with a free-form survey where users manually enter both the question and the answer, there is a large amount of variation in the spelling and categories used.
I broadly grouped many of the variations into common themes, primarily working to fit the original prompt. There are many, many interesting created categories, like best dead language
, didn't spark joy
, or latest crush
. Here are two additional categories that I created, curious
and interesting
.
<- function(category) {
recode_category case_when(
str_detect(category, "first.+lang(uage)?|firstlanguage") ~ "first language",
str_detect(category, "^first$") ~ "first language",
str_detect(category, "b(e|i)ginn?e|new dev|newb|starter|noob|brginners|begginners|begginers") ~ "beginner",
str_detect(category, "want|would|wish|wanna|curious|desire|(like.+learn)|curios|(like to try)") ~ "curious",
str_detect(category, "m[ou]st?(ly)? ?used?") ~ "most used",
str_detect(category, "diff?.+c.+lt|diificulties|difficulies|difficuties|difficulities") ~ "had difficulties",
str_detect(category, "love") ~ "love",
str_detect(category, "hate|dislike|avoid|(don.?t.+like)") ~ "hate",
str_detect(category, "promis|interest|exotic|esoter|(most excited)|(weird)") ~ "interesting",
str_detect(category, "honou?rable mention") ~ "honerable mention",
str_detect(category, "next|need to learn") ~ "next",
str_detect(category, "others used|other lang|dabbl") ~ "others used",
str_detect(category, "current") ~ "currently",
TRUE ~ category
) }
Poll Processing Pipeline
Finally, here is the full pipeline to go from raw tweets to poll results.
<-
tweets_lang_poll %>%
tweets select(status_id, created_at, user_id, screen_name, text) %>%
# Remove tweets with "English" because that's probably a different thread
filter(!str_detect(text, "[eE]nglish")) %>%
mutate(
# Backup original tweet text
text_og = text,
# Remove unused text from tweets
text = remove_unused_text(text)
%>%
) # Split text into question/answer pairs,
# splitting on newline or one of: `N.`, `N)`, `N:`, or `N-`
separate_rows(text, sep = "\n|\\d\\s*[.):-]") %>%
# Remove whitespace and `N.` numbers from start of text
mutate(text = str_remove_all(text, "^\\s*(\\d[.):-])?\\s*")) %>%
# Seperate question/answer into category, answer columns, splitting on colon `:`
separate(
col = text,
into = c("category", "answer"),
sep = "\\s*:\\s*",
remove = FALSE
%>%
) # Remove nothing answers or answers without any letters
filter(
!is.na(answer),
str_detect(answer, "[[:alnum:]]")
%>%
) # Re-encode category, answer as UTF-8 (:shrug:) and lowercase
mutate_at(vars(category, answer), stringi::stri_enc_toutf8) %>%
mutate_at(vars(category, answer), tolower) %>%
# Category: Remove leading non-alpha characters and squish whitespace
mutate(
category = str_remove(category, "^[^[:alpha:]]+"),
category = str_squish(category)
%>%
) # Process answer as well as we can programmatically
mutate(answer = process_answer(answer, common_langs)) %>%
# Separate into one language per row
separate_rows(answer, sep = "\\s*[,/]\\s*") %>%
# Squish the strings
mutate_at(vars(answer), str_squish) %>%
mutate(
answer = recode_answer(answer),
category2 = recode_category(category)
%>%
) # Filter out empty category, answer fields
filter(!str_detect(answer, "^\\s*$")) %>%
filter(
nchar(answer) > 0,
nchar(category) > 4
)
And then to aggregate and count programming language mentions per category.
<-
tweets_lang_counted %>%
tweets_lang_poll count(category2, answer, sort = TRUE)
Plot Language Counts by Category
Last, but not least, this function creates the plots for requested categories. One key detail is that bars are ordered within each facet using tidytext’s reorder_within()
function. Check out Julia Silge’s excellent blog post on this function: Reordering and facetting for ggplot2.
While the bars are ordered in descending order, I wanted the bar fill color to be consistent across facets to facilitate comparison between the two categories. The color palette is ocean.deep
from the pals package, which I found by looking through Emil Hvitfeldt’s Comprehensive list of color palettes in R.
<- function(
plot_tweets_by_category
tweets_lang_counted,
categories,ncol = 2,
min_count = 10
) {%>%
tweets_lang_counted filter(category2 %in% categories) %>%
mutate_at(vars(category2), factor, levels = categories) %>%
group_by(category2) %>%
arrange(desc(n)) %>%
filter(n >= min_count) %>%
top_n(20, n) %>%
ungroup() %>%
arrange(category2, answer, desc(n)) %>%
mutate(
answer_within = tidytext::reorder_within(answer, n, category2),
answer = fct_reorder(answer, n, first)
%>%
) ggplot() +
aes(answer_within, n, fill = answer) +
geom_col() +
coord_flip() +
::scale_x_reordered(expand = c(0, 0)) +
tidytextdiscrete_scale("fill", "ocean", function(n) rev(pals::ocean.deep(n + 10))[6:(n+5)]) +
guides(fill = FALSE) +
labs(x = NULL, y = NULL) +
facet_wrap(~ category2, scales = "free", ncol = ncol) +
theme_minimal(base_family = "PT Sans", base_size = 18) +
theme(
plot.margin = margin(20, 20, 20, 20),
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(family = "PT Sans Narrow"),
axis.text.y.left = element_text(margin = margin()),
panel.spacing.x = unit(3, "line"),
panel.spacing.y = unit(2, "line")
) }
What About You?
If you made it this far, share your programming experiences on Twitter!
Thanks for reading and feel free to share feedback, thoughts, or questions with me on Twitter at @grrrck.