Details for twitter-archive-text-mining-R.ipynb

Published by gedankenstuecke

Description

This notebook uses the programming language R and data from a Twitter archive to perform a simple sentiment analysis and explore how your sentiment changes over time.

0

Tags & Data Sources

sentiment analysis sentiment Twitter Archive Analyzer

Comments

Please log in to comment.

Notebook
Last updated 5 years, 9 months ago

Twitter archive analysis in R

This performs some twitter archive analysis that is based on this chapter of Text Mining with R. To use this notebook you need to have uploaded a Twitter archive into your Open Humans account through http://twarxiv.org. Initially, what you'll see is data I've supplied, and as you run each command, it'll be replaced with your data.

For a start let's install/load all the required packages again:

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date


Attaching package: ‘dplyr’

The following objects are masked from ‘package:lubridate’:

    intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Let's now get our access token and request our personal user object that will contain all of our data file downloads etc.:

If you want to look at the data sources you have on Open Humans, uncomment the line below by removing the #:

Let's now find the download URL for the Twitter archive from all files:

We can now create a temporary file that will contain the whole zipped Twitter archivee, from this we can then unzip and read the tweets.csv file:

Parsed with column specification:
cols(
  tweet_id = col_double(),
  in_reply_to_status_id = col_double(),
  in_reply_to_user_id = col_double(),
  timestamp = col_character(),
  source = col_character(),
  text = col_character(),
  retweeted_status_id = col_double(),
  retweeted_status_user_id = col_double(),
  retweeted_status_timestamp = col_character(),
  expanded_urls = col_character()
)

Now let's convert the timestamps into a proper format and plot a simple histogram of tweets over time:

We can now 'tokenize' (that is, break up into words) the tweet texts, which will make it easier to work with them. This also allows us to easily calculate word frequencies in the next step. This stage also passes words through a stopwords filter. Stopwords are those words which are frequently used (such as 'the', 'we', 'and', and 'I') but provide very little information, and it's common to filter them out during a textual analysis. Stopwords are language dependent, so you may want to change the language default. You can read more about the stopwords function here: https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/stopwords

So, what are the top words that you've used? Mine are the German equivalents of I, all the articles (der, die, das) along with not and is.

wordnfreq
ich 17749 0.02300768
die 12838 0.01664165
das 12081 0.01566036
der 9146 0.01185578
nicht 8669 0.01123746
ist 8501 0.01101968

Let's now group the tweets by being "old" (pre-2013) and more recent ones (2013 or newer) to see whether the topics you tweet about have changed. After grouping you can calulcate the word-ratios and look at the most unusual and the most recent words:

You can adapt the cutoff above easily to your own needs. Just replace 2013-01-01 with your own date in the YYYY-MM-DD format.

wordpasttodaylogratio
option 8.071251e-05 8.071286e-05 -4.282714e-06
backup 4.447424e-05 4.439207e-05 1.849286e-03
ern 4.447424e-05 4.439207e-05 1.849286e-03
gefallen 7.247654e-05 7.264157e-05 -2.274431e-03
timeline 1.087148e-04 1.089624e-04 -2.274431e-03
wieviel 3.623827e-05 3.632079e-05 -2.274431e-03
haha 2.091936e-04 2.098534e-04 -3.148939e-03
extra 5.271021e-05 5.246336e-05 4.694238e-03
spam 1.054204e-04 1.049267e-04 4.694238e-03
videos 6.424057e-05 6.457028e-05 -5.119383e-03
lasse 6.918215e-05 6.860593e-05 8.363967e-03
bioinformatiker2.800230e-05 2.824950e-05 -8.789112e-03
fliegen 2.800230e-05 2.824950e-05 -8.789112e-03
hosen 2.800230e-05 2.824950e-05 -8.789112e-03
installation 2.800230e-05 2.824950e-05 -8.789112e-03
instapaper 2.800230e-05 2.824950e-05 -8.789112e-03
krass 2.800230e-05 2.824950e-05 -8.789112e-03
medium 2.800230e-05 2.824950e-05 -8.789112e-03
quelle 2.800230e-05 2.824950e-05 -8.789112e-03
rper 2.800230e-05 2.824950e-05 -8.789112e-03
sequenzen 2.800230e-05 2.824950e-05 -8.789112e-03
typisch 2.800230e-05 2.824950e-05 -8.789112e-03
vertrauen 2.800230e-05 2.824950e-05 -8.789112e-03
warm 5.600460e-05 5.649900e-05 -8.789112e-03
damals 7.741812e-05 7.667721e-05 9.616315e-03
gelandet 4.776863e-05 4.842771e-05 -1.370313e-02
statt 1.350699e-04 1.331762e-04 1.411938e-02
deidesheim 4.117985e-05 4.035643e-05 2.019842e-02
10k 1.647194e-05 1.614257e-05 2.019842e-02
13k 8.235970e-06 8.071286e-06 2.019842e-02
fm 5.155717e-04 8.071286e-06 4.156964
#bosc2015 1.647194e-06 1.089624e-04 -4.191929
#smbe14 1.647194e-06 1.089624e-04 -4.191929
#piratencafe 2.701398e-04 4.035643e-06 4.203774
portland 1.647194e-06 1.129980e-04 -4.228297
#lqfb 2.964949e-04 4.035643e-06 4.296865
#mozsprint 1.647194e-06 1.251049e-04 -4.330080
#muenster 3.113197e-04 4.035643e-06 4.345655
yay 4.941582e-06 3.833861e-04 -4.351357
#smbe16 1.647194e-06 1.291406e-04 -4.361828
#force2016 1.647194e-06 1.412475e-04 -4.451440
#futurecommons1.647194e-06 1.412475e-04 -4.451440
nrw 7.330014e-04 8.071286e-06 4.508835
#csvconf 1.647194e-06 1.654614e-04 -4.609664
#mozwow 1.647194e-06 1.654614e-04 -4.609664
#bosc2016 1.647194e-06 1.775683e-04 -4.680282
#cccamp15 1.647194e-06 1.775683e-04 -4.680282
berkeley 1.647194e-06 1.816039e-04 -4.702755
#get2015 1.647194e-06 2.058178e-04 -4.827918
#smbe15 1.647194e-06 2.219604e-04 -4.903425
#piraten 5.600460e-04 4.035643e-06 4.932853
#latergram 1.647194e-06 2.502099e-04 -5.023227
emoji 1.647194e-06 2.502099e-04 -5.023227
#om13 1.647194e-06 2.582811e-04 -5.054975
sci 1.647194e-06 2.703881e-04 -5.100785
mp 3.350393e-03 2.017821e-05 5.112229
#bosc2017 1.647194e-06 3.793504e-04 -5.439387
#mozfest 1.647194e-06 4.116356e-04 -5.521065
#opencon 1.647194e-06 4.519920e-04 -5.614591
ly 3.758897e-03 4.035643e-06 6.836715

We can see that some things, e.g. archives haven't changed much in frequency, as obvious from the small "odds" ratio that's close to zero. Outliers with a large positive values are those that were more frequent in the past, while outliers with values large negative values are more common in more recent tweets. To make it a bit more intuitive we can now plot the Top 15 words with the largest positive/negative odds ratio:

What we see in my data below: My activity in the Pirate Party (and living in the state of North-Rhine Westphalia - short NRW) is clearly in the past and tweets about them occur mainly before 2013 - as demonstrated by hashtags like #piraten, #lptnrw, #lmvnrw, #nrw etc. My activity in the Open Science world is clearly still going strong in comparison: #opencon, #mozfest, bosc201*, #csvconf etc. are overrepresented. When you run the analysis on your data, what trends do you see? Next, let's see what would happen if we excluded hashtags.

In the analysis of my data, we see some changes: now older words are somewhat associated with the German language/dialects (moin, kriegt, bildung, anstatt, neuer) and random Top-Level-Domains that have fallen out of popularity (ly, fm) while the newer trends are

  • english speaking (yay, travels, happily, worries)
  • location based (portland, berkeley, iceland, zurich), fitting an increased travel compared to earlier times
  • phd-related (lichens, ggplot2, markov)
  • and modern-ish inventions. emoji anyone? 😂

    How do your trends compare with what you expect? And now let's do the whole thing by just looking at people I replied to in the past compared to now:

In my data, I notice that the people from my past are largely other Pirate Party members. People from the present heavily feature the Open* & Quantified Self crowd at large (e.g. @o_guest, @sujaik, @protohedgehog, @kevinschawinski, @kaiblin, @eramirez), including some of the awesome people that run/ran Open Humans with me to make this possible (👋 beaugunderson, @madprime, @betatim).

Emoji pairings

In a next step we can now look into highly-frequent pairings of emoji with individual words. For this we filter out the emoji out of all tweet-texts and associate them with words, ignoring stop-words & URLs etc.

Joining, by = c("tweet_id", "timestamp")
item1item2n
fra 31
🎉 opensnp 9
lhr 8
😂 data 7
😂 opencon 6
🎉 humans 4
😍 mozfest 4
🍻 beer 3
🐶 dog 3
💖 hear 3

Looking at the top 10 emoji in my data shows that for me, #mozfest is more 😍, while #opencon is more 😂. And there's two words, fra and lhr where the emoji can't be rendered by R. This would be ✈️, which associates with the IATA codes for the Frankfurt airport and London Heathrow. Check out your own emoji trends for clues into how your emoji use changes with context.

Notebook
Last updated 5 years, 9 months ago

Twitter archive analysis in R

This performs some twitter archive analysis that is based on this chapter of Text Mining with R. To use this notebook you need to have uploaded a Twitter archive into your Open Humans account through http://twarxiv.org. Initially, what you'll see is data I've supplied, and as you run each command, it'll be replaced with your data.

For a start let's install/load all the required packages again:

In [1]:
library(purrr)
library(stringr)
library(tidytext)
library(widyr)
library(httr)
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)
Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date


Attaching package: ‘dplyr’

The following objects are masked from ‘package:lubridate’:

    intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Let's now get our access token and request our personal user object that will contain all of our data file downloads etc.:

In [2]:
access_token <- Sys.getenv("OH_ACCESS_TOKEN")
url <- paste("https://www.openhumans.org/api/direct-sharing/project/exchange-member/?access_token=",access_token,sep="")
resp <- GET(url)
user <- content(resp, "parsed")

If you want to look at the data sources you have on Open Humans, uncomment the line below by removing the #:

In [3]:
# user$data

Let's now find the download URL for the Twitter archive from all files:

In [4]:
for (data_source in user$data){
    if (data_source$source == "direct-sharing-70"){
         twitter_archive_url <- data_source$download_url
    }
}

We can now create a temporary file that will contain the whole zipped Twitter archivee, from this we can then unzip and read the tweets.csv file:

In [5]:
temp <- tempfile()
download.file(twitter_archive_url,temp,method='wget')

#unzip(temp, list=TRUE) # this would list all files in the zip archive

data <- read_csv(unz(temp, "tweets.csv"))
Parsed with column specification:
cols(
  tweet_id = col_double(),
  in_reply_to_status_id = col_double(),
  in_reply_to_user_id = col_double(),
  timestamp = col_character(),
  source = col_character(),
  text = col_character(),
  retweeted_status_id = col_double(),
  retweeted_status_user_id = col_double(),
  retweeted_status_timestamp = col_character(),
  expanded_urls = col_character()
)

Now let's convert the timestamps into a proper format and plot a simple histogram of tweets over time:

In [6]:
tweets <- mutate(data,timestamp = ymd_hms(timestamp))
ggplot(tweets, aes(x = timestamp)) +
  geom_histogram(position = "identity", bins = 20, show.legend = FALSE) + theme_minimal()

We can now 'tokenize' (that is, break up into words) the tweet texts, which will make it easier to work with them. This also allows us to easily calculate word frequencies in the next step. This stage also passes words through a stopwords filter. Stopwords are those words which are frequently used (such as 'the', 'we', 'and', and 'I') but provide very little information, and it's common to filter them out during a textual analysis. Stopwords are language dependent, so you may want to change the language default. You can read more about the stopwords function here: https://www.rdocumentation.org/packages/tm/versions/0.7-3/topics/stopwords

In [7]:
replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>% 
  filter(!str_detect(text, "^RT")) %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

frequency <- tidy_tweets %>% 
  count(word, sort = TRUE) 
frequency$freq <- frequency$n / sum(frequency$n)

So, what are the top words that you've used? Mine are the German equivalents of I, all the articles (der, die, das) along with not and is.

In [8]:
head(frequency)
wordnfreq
ich 17749 0.02300768
die 12838 0.01664165
das 12081 0.01566036
der 9146 0.01185578
nicht 8669 0.01123746
ist 8501 0.01101968

Let's now group the tweets by being "old" (pre-2013) and more recent ones (2013 or newer) to see whether the topics you tweet about have changed. After grouping you can calulcate the word-ratios and look at the most unusual and the most recent words:

In [ ]:
cutoff_date <- as.Date("2013-01-01")

You can adapt the cutoff above easily to your own needs. Just replace 2013-01-01 with your own date in the YYYY-MM-DD format.

In [9]:
library(tidyr)

tidy_tweets$date_group <- ifelse(tidy_tweets$timestamp < cutoff_date,"past","today")

word_ratios <- tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  count(word, date_group) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  spread(date_group, n, fill = 0) %>%
  mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
  mutate(logratio = log(past / today)) %>%
  arrange(desc(logratio))

word_ratios %>% 
  arrange(abs(logratio))
wordpasttodaylogratio
option 8.071251e-05 8.071286e-05 -4.282714e-06
backup 4.447424e-05 4.439207e-05 1.849286e-03
ern 4.447424e-05 4.439207e-05 1.849286e-03
gefallen 7.247654e-05 7.264157e-05 -2.274431e-03
timeline 1.087148e-04 1.089624e-04 -2.274431e-03
wieviel 3.623827e-05 3.632079e-05 -2.274431e-03
haha 2.091936e-04 2.098534e-04 -3.148939e-03
extra 5.271021e-05 5.246336e-05 4.694238e-03
spam 1.054204e-04 1.049267e-04 4.694238e-03
videos 6.424057e-05 6.457028e-05 -5.119383e-03
lasse 6.918215e-05 6.860593e-05 8.363967e-03
bioinformatiker2.800230e-05 2.824950e-05 -8.789112e-03
fliegen 2.800230e-05 2.824950e-05 -8.789112e-03
hosen 2.800230e-05 2.824950e-05 -8.789112e-03
installation 2.800230e-05 2.824950e-05 -8.789112e-03
instapaper 2.800230e-05 2.824950e-05 -8.789112e-03
krass 2.800230e-05 2.824950e-05 -8.789112e-03
medium 2.800230e-05 2.824950e-05 -8.789112e-03
quelle 2.800230e-05 2.824950e-05 -8.789112e-03
rper 2.800230e-05 2.824950e-05 -8.789112e-03
sequenzen 2.800230e-05 2.824950e-05 -8.789112e-03
typisch 2.800230e-05 2.824950e-05 -8.789112e-03
vertrauen 2.800230e-05 2.824950e-05 -8.789112e-03
warm 5.600460e-05 5.649900e-05 -8.789112e-03
damals 7.741812e-05 7.667721e-05 9.616315e-03
gelandet 4.776863e-05 4.842771e-05 -1.370313e-02
statt 1.350699e-04 1.331762e-04 1.411938e-02
deidesheim 4.117985e-05 4.035643e-05 2.019842e-02
10k 1.647194e-05 1.614257e-05 2.019842e-02
13k 8.235970e-06 8.071286e-06 2.019842e-02
fm 5.155717e-04 8.071286e-06 4.156964
#bosc2015 1.647194e-06 1.089624e-04 -4.191929
#smbe14 1.647194e-06 1.089624e-04 -4.191929
#piratencafe 2.701398e-04 4.035643e-06 4.203774
portland 1.647194e-06 1.129980e-04 -4.228297
#lqfb 2.964949e-04 4.035643e-06 4.296865
#mozsprint 1.647194e-06 1.251049e-04 -4.330080
#muenster 3.113197e-04 4.035643e-06 4.345655
yay 4.941582e-06 3.833861e-04 -4.351357
#smbe16 1.647194e-06 1.291406e-04 -4.361828
#force2016 1.647194e-06 1.412475e-04 -4.451440
#futurecommons1.647194e-06 1.412475e-04 -4.451440
nrw 7.330014e-04 8.071286e-06 4.508835
#csvconf 1.647194e-06 1.654614e-04 -4.609664
#mozwow 1.647194e-06 1.654614e-04 -4.609664
#bosc2016 1.647194e-06 1.775683e-04 -4.680282
#cccamp15 1.647194e-06 1.775683e-04 -4.680282
berkeley 1.647194e-06 1.816039e-04 -4.702755
#get2015 1.647194e-06 2.058178e-04 -4.827918
#smbe15 1.647194e-06 2.219604e-04 -4.903425
#piraten 5.600460e-04 4.035643e-06 4.932853
#latergram 1.647194e-06 2.502099e-04 -5.023227
emoji 1.647194e-06 2.502099e-04 -5.023227
#om13 1.647194e-06 2.582811e-04 -5.054975
sci 1.647194e-06 2.703881e-04 -5.100785
mp 3.350393e-03 2.017821e-05 5.112229
#bosc2017 1.647194e-06 3.793504e-04 -5.439387
#mozfest 1.647194e-06 4.116356e-04 -5.521065
#opencon 1.647194e-06 4.519920e-04 -5.614591
ly 3.758897e-03 4.035643e-06 6.836715

We can see that some things, e.g. archives haven't changed much in frequency, as obvious from the small "odds" ratio that's close to zero. Outliers with a large positive values are those that were more frequent in the past, while outliers with values large negative values are more common in more recent tweets. To make it a bit more intuitive we can now plot the Top 15 words with the largest positive/negative odds ratio:

In [10]:
word_ratios %>%
  group_by(logratio < 0) %>%
  top_n(15, abs(logratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio past (before cutoff) / now (after cutoff)") +
  scale_fill_discrete(name = "", labels = c("past", "now")) + theme_minimal()

What we see in my data below: My activity in the Pirate Party (and living in the state of North-Rhine Westphalia - short NRW) is clearly in the past and tweets about them occur mainly before 2013 - as demonstrated by hashtags like #piraten, #lptnrw, #lmvnrw, #nrw etc. My activity in the Open Science world is clearly still going strong in comparison: #opencon, #mozfest, bosc201*, #csvconf etc. are overrepresented. When you run the analysis on your data, what trends do you see? Next, let's see what would happen if we excluded hashtags.

In [11]:
word_ratios %>%
  filter(!str_detect(word, "^#")) %>%
  group_by(logratio < 0) %>%
  top_n(15, abs(logratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio past (before cutoff) / now (after cutoff)") +
  scale_fill_discrete(name = "", labels = c("past", "now")) + theme_minimal()

In the analysis of my data, we see some changes: now older words are somewhat associated with the German language/dialects (moin, kriegt, bildung, anstatt, neuer) and random Top-Level-Domains that have fallen out of popularity (ly, fm) while the newer trends are

  • english speaking (yay, travels, happily, worries)
  • location based (portland, berkeley, iceland, zurich), fitting an increased travel compared to earlier times
  • phd-related (lichens, ggplot2, markov)
  • and modern-ish inventions. emoji anyone? 😂

    How do your trends compare with what you expect? And now let's do the whole thing by just looking at people I replied to in the past compared to now:

In [12]:
tidy_tweets %>%
  filter(str_detect(word, "^@")) %>%
  filter(!str_detect(word, "^@ny")) %>%
  count(word, date_group) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  spread(date_group, n, fill = 0) %>%
  mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
  mutate(logratio = log(past / today)) %>%
  arrange(desc(logratio)) %>%
  group_by(logratio < 0) %>%
  top_n(15, abs(logratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio past (before cutoff) / now (after cutoff)") +
  scale_fill_discrete(name = "", labels = c("past", "now")) + theme_minimal()

In my data, I notice that the people from my past are largely other Pirate Party members. People from the present heavily feature the Open* & Quantified Self crowd at large (e.g. @o_guest, @sujaik, @protohedgehog, @kevinschawinski, @kaiblin, @eramirez), including some of the awesome people that run/ran Open Humans with me to make this possible (👋 beaugunderson, @madprime, @betatim).

Emoji pairings

In a next step we can now look into highly-frequent pairings of emoji with individual words. For this we filter out the emoji out of all tweet-texts and associate them with words, ignoring stop-words & URLs etc.

In [13]:
emoji_tweets <- tweets %>%
  filter(!str_detect(text, "^RT")) %>%
  filter(!str_detect(text, "^@")) %>%
  filter(str_detect(text, "[\\uD83C-\\uDBFF\\uDC00-\\uDFFF]+")) %>%
  mutate(Emoji = str_extract_all(text, 
                  "[\\uD83C-\\uDBFF\\uDC00-\\uDFFF]+")) %>%
  select(tweet_id, timestamp, Emoji,text)
In [14]:
emoji_tweets <- emoji_tweets %>%
  select(-Emoji) %>%
  unnest_tokens(word, text) %>%
  left_join(emoji_tweets) %>%
  mutate(Emoji = map_chr(Emoji, ~ ifelse(length(.x) > 0, .x[[1]], ""))) %>%
  mutate(word = str_replace_all(word, "’", "'")) %>%
  filter(!(Emoji %in% c("", "-"))) %>%
  filter(!(word %in% c("t.co", "http",'https'))) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
Joining, by = c("tweet_id", "timestamp")
In [15]:
emo_twids <- emoji_tweets %>%
  select(tweet_id, Emoji) %>%
  distinct() %>%
  rename(word = Emoji)
  
emoji_tweets %>%
  select(tweet_id, word) %>%
  bind_rows(emo_twids) %>%
  pairwise_count(word, tweet_id, sort = TRUE) %>%
  filter(item1 %in% unique(emoji_tweets$Emoji)) %>%
  group_by(item1) %>%
  slice(1:2) %>%
  ungroup() %>%
  filter(nchar(item2) > 2) %>%
  arrange(desc(n)) %>% head(n=10)
item1item2n
fra 31
🎉 opensnp 9
lhr 8
😂 data 7
😂 opencon 6
🎉 humans 4
😍 mozfest 4
🍻 beer 3
🐶 dog 3
💖 hear 3

Looking at the top 10 emoji in my data shows that for me, #mozfest is more 😍, while #opencon is more 😂. And there's two words, fra and lhr where the emoji can't be rendered by R. This would be ✈️, which associates with the IATA codes for the Frankfurt airport and London Heathrow. Check out your own emoji trends for clues into how your emoji use changes with context.

In [ ]: