There has been talking about using social media for epidemiological purposes. For example, in areas where there is an increase of key-words such as “malaria” in social media, an increase in malaria prevalence is to be expected. I don’t think social media can be applied for epidemiological purposes. There are to many biases. Still, it doesn’t mean that there aren’t any nice scientific applications using social media as tool.
To start from the beginning, how do you extract keywords from social media? For example, what are the latest associated words around the subject #plasmodium? Here I will be using an R- Application Programming Interface (API) to search through the Twitter data.
To set up your API for twitter, follow the explanation on the Decision stats website.
Before you set-up connection with twitteR, load the libraries for R that you’re going to use:
library(twitteR) library(tm) library(wordcloud) library(RColorBrewer) library(XML) library(igraph)
Setting up connection with twitteR can sometimes be a pain. When you use R-Studio, you need to type-over the web address containing the long token key. After you shook hands with Twitter, search through their database for a subject and convert the data to a dataframe:
## Plasmodium twitter cloud a <- searchTwitter("#plasmodium", n=1000) tweets_df = twListToDF(a) #Convert to Dataframe
Next, clean up the dataframe.
b=Corpus(VectorSource(tweets_df$text), readerControl = list(language = "eng")) b <- tm_map(b, tolower) #Changes case to lower case b <- tm_map(b, stripWhitespace) #Strips White Space b <- tm_map(b, removePunctuation) #Removes Punctuation b <- tm_map(b, removeWords, stopwords("english")) #Removes English stopwords like 'the' b <- tm_map(b, removeNumbers) #Removes numbers inspect(b) tdm <- TermDocumentMatrix(b) m1 <- as.matrix(tdm) v1 <- sort(rowSums(m1),decreasing=TRUE) d4 <- data.frame(word = names(v1),freq=v1)
And plot the wordcloud:
wordcloud(d4$word,d4$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
You get a twitter wordcloud around #plasmodium:
This is just one example. Data can be presented the way you like. On the website ‘mining in Twitter with R‘, you can find some more nice examples.
This twitter data mining tool is a goldmine for marketing. How to use it for more scientific purposes? Well, for example, you can search for the associative words or subjects between ‘#malaria’ and ‘@billgatesfoundation’. What are the trends? Maybe useful when you are writing a grand-proposal!