R client for accessing Google Cloud Natural Language APIs
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
trump_tweets <- readRDS("data/text-trumptweets-demo.rds")
sa <- readRDS("data/sa-trumptweets-demo.rds")
R client for interacting with Google Cloud Natural Language APIs.
This is a work in progress. To date, this package only interacts with the
analyzeSentiment
API. If this project interests you, please let me know.
Google’s documentation is fantastic.
Click the link below for a quick start guide.
Before you can use Google’s NLP APIs, you’ll need an API key. For that, I
recommend following the quick start guide linked above. It’ll basically go
something like this:
The first time you use any of the functions from the googleapis package, you’ll
be asked for your API key. It’ll be set as an environment variable and saved for
future sessions, so you should only have to do this once (per machine).
Install googleapis
using the devtools
package.
if (!"devtools" %in% installed.packages()) {
install.packages("devtools")
}
devtools::install_github("mkearney/googleapis")
Download tweets using the rtweet
package, which can be installed from CRAN
install.package("rtweet")
or the dev version from Github.
devtools::install_github("mkearney/rtweet")
Use rtweet
to download 1,000 non-retweeted Twitter statuses about Trump (for
best results specify single language).
trump <- rtweet::search_tweets("Trump lang:en", n = 1000, include_rts = FALSE)
Prep the text by first removing URLs, AT (@) mentions, and line breaks
trump_tweets <- gsub(
"@\\S{1,}|http\\S{1,}|\\n", "", trump$text
)
Second, because Google’s sentiment analysis API will actually return
sentiment scores for every sentence, remove any periods at the start of tweets.
trump_tweets <- gsub(
"^\\.\\S{0,}", "", trump_tweets
)
Third, remove all extra spaces (for the record, I’m not entirely sure if
this matters, but if you’re going to be analyzing individual words at any point,
it’s not a bad idea).
trump_tweets <- trimws(gsub(
"[ ]{2,}", " ", trump_tweets
))
Here’s what the tweets look like for me:
head(trump_tweets)
With the text ready to be analyzed, load googleapis
if you haven’t already.
library(googleapis)
Conduct sentiment analysis on the character vector of tweets (text) using the
analyze_sentiment()
function.
sa <- analyze_sentiment(trump_tweets)
#saveRDS(sa, "data/sa-trumptweets-demo.rds")
#saveRDS(trump_tweets, "data/text-trumptweets-demo.rds")
Note: replicate this tutorial using the specific data set used here by reading
in the R data files as done in the code below:
trump_tweets <- readRDS(
"https://github.com/mkearney/googleapis/blob/master/data/text-trumptweets-demo.rds?raw=true"
)
sa <- readRDS(
"https://github.com/mkearney/googleapis/blob/master/data/sa-trumptweets-demo.rds?raw=true"
)
The output is a bit messy, but it’s easily converted into a tidy data frame
using the as.data.frame
or as_tibble
functions.
sa <- tibble::as_tibble(sa)
sa
Each row in the converted data frame represents one sentence of the provided text. For each observation, there are seven features (variables):
id
ID assigned to each document (in this case, each tweet)unit
Unit of analysis, either “document” or “sentence”score
The sentiment (along positive and negative dimensions) score of the sentence standardized on a -1.0 to 1.0 scale.magnitude
The magnitude of the score (positive, unstandardized)position
The ordinal position of a given sentence as a sequence within a single tweet (e.g., first sentence of a document, second sentence of a document)offset
The relative position, in number of characters, from which the sentence started within a documentcontent
The text that was analyzed (corresponds with unit)Explore the data using the tidyverse.
## i subscribe to the tidyverse
suppressPackageStartupMessages(library(tidyverse))
Histogram of scores faceted by unit
sa %>%
ggplot(aes(score, fill = unit)) +
geom_histogram(binwidth = .1) +
facet_wrap(~ unit)
Box plot for each sentence position number
p <- sa %>%
filter(unit == "sentence") %>%
mutate(position = factor(position)) %>%
ggplot(
aes(x = position, y = score, colour = position, fill = position)
) +
geom_boxplot(outlier.shape = NA, alpha = .7) +
geom_jitter(alpha = .4, shape = 21) +
theme(
legend.position = "none",
text = element_text(family = "Helvetica Neue", colour = "black"),
axis.text = element_text(colour = "black"),
plot.caption = element_text(size = rel(.8), colour = "#777777"),
plot.title = element_text(face = "bold")
) +
labs(
title = "Google Cloud sentiment analysis of tweets about Trump",
subtitle = "Positive and negative sentiment scores (n = 1,549) broken down by sentence position",
x = "Sentence Position",
y = "Sentiment",
caption = "\nData collected and analyzed in R using rtweet and googleapis packages\nCode and plot by Michael W. Kearney 2017"
)
png("demo_plot.png", 7.5, 5.3, "in", 10, res = 127.5)
p
dev.off()