Twitter’s Big Data Analysis Using RStudio
HOUSSAME EDDINE BALOULI1 and LAZHAR CHINE2
2Associate Professor, Boumerdes University, Algeria,
4.1.1 BIG DATA ANALYTICS PROCESS
Generally, many steps characterize the big data analytics process: as a first step by identifying the business problem (the subject) to be solved. Next, all sources of data (input) need to be collected; it is an essential step. The next action is data cleaning to make it ready for the analysis. A model, picture, equation, or other type of results will be constructed in the analytics step. Finally, we interpret the final output in order to improve the quality of the decision-making process and clarify the future  (Figure 4.1).
FIGURE 4.1 Big data analytics process.
Source: Team .
4.1.2 SOCIAL MEDIA ANALYTICS (SMA)
SMA refers to collecting, cleaning, and treating structured, unstructured, qualitative, or quantitative (quantifiable or not) data loaded from social media websites. Moreover, social media is a broad term encompassing a variety of online platforms that allow subscribers to exchange content, sentiment, information, and others.
Twitter-by its 140 characters mle-is one of the most popular social media websites (after Facebook) with more than 500 million tweets per day. Twitter is growing rapidly since its creation before 13 years. An advantage of Twitter is that all tweets are showed in real-time in which the information can reach a large number of subscribers in a veiy short moment .
4.1.3 TWEETS ANALYTICS PROCESS
Tweets analytics process can be described as follow:
i. Creating a Connection to the Twitter Server: By creating new apps on the development website (http://developer.twitter.com/). The server admin will ask you many questions about the nature of your project and the followed approach. The objective is to have access to tweets.
ii. Select the Data Type: Select one or many keywords that you need to study and analyze.
iii. Extract the Data: Using the package “searchTwitter” or other packages. You can choose any number of tweets you need and can specify the language, the date, and other parameters. All these options are related to the nature of the keywords you search for.
iv. Clean the Data: Using the “tin” (text mining) package, we extract only texts from the loaded data. After that, we clean the texts from numbers, punctuations, and other special characters. The next step eliminates a group of English words such as “did,” “do,” “must,” “they,” and others. The list of these words will show in the practical part. At this level, we also eliminate repeated words.
v. Data Processing: The data now is cleaned and ready for analysis. Using the “wordcloud” package , we create the word cloud of both Trump and Trudeau. Many options are available such as colors and the position of the words in the cloud.
4.1.4 WORD CLOUD UTILITY
Word cloud is a powerful communication tool. It is very easy to understand and share. The word clouds summarize any topic well. Word cloud is an efficient method for text analysis. It adds simplicity and clarity. The most used keywords appear better in a word cloud presentation. Word cloud is visually more explicable than a data table filled with texts. However, who uses Word Cloud'?
Word cloud has several uses:
- • Laboratories: For the presentation of both quantitative and qualitative data.
- • Marketing Campaigns: To highlight customer needs and identify dissatisfaction.
- • Education: To support essential topics.
- • Politicians and journalists.
- • Social Network: To collect, analyze, and share the user’s sentiments.
- 4.2 EXPERIMENTAL METHODS AND MATERIALS
- 4.2.1 LOADING LIBRARIES (PACKAGES)
We used many packages in out study: “twitteR”  to connect to the Twitter website; “tin”  to the text mining phase; and “wordcloud” to the presentation.
4.2.2 CREATING A CONNECTION
The next step is to connect to the Twitter server. For this, we must have authentication keys that can be obtained by registering on the development website (https://developer.twitter.com/). The process is not veiy complex.
consumersecret <- "xaA9ihDYXiGWkECOxPC45S6VRzlcnNR29rZWchORLGWqvDgPVw" accesstoken <- "1013074431550291968-jxjLtzaELHQB0xqQIrBTkzf2EOsNAg" accesssecret <- "LDEzIC5kwlJwpZK39nsH5gBapE5a93gAVFn7du45zEHKX"
Once obtained, we specify the string consumer_key, consumer secret, access_token, and access_secret in the sutup twitter oauth command:
setuptwitterauth (consuinerkeyonsumersecret, accesstoken, accesssecret)
 “Using direct authentication”
The message “Using direct authentication” should appear in the console, indicating that the operation is running smoothly.
4.2.3 EXTRACTION OF TWEETS
The search Twitter function is used to load tweets online. In our study, we specify two keywords: @realDonaldTiump and @JustinTmdeau.
tweets_Tramp <- searchTwitter ("@realDonaldTntmp”, n = 5000, since = '2017-01-0Г) tweets_Tmdeau <- searchTwitter ("@JustinTmdeau”, n= 5000, since = '2017-01-01')
We limit the number of extracted tweets to n = 5000, and since 01/01/2017 for the date. We are interested in the English language for the tweets. The date of the extraction was 25/01/2019 at 16:00 UC.
4.2.4 THE STRUCTURE OF THE OUTPUT
Using the “str” base function of R, we confirm that the output is a list. It means that our output contains characters, numbers, and other types of data.
Str(tweets_Trump)  List of 5000
4.2.5 FIRST TWEETS
We can show any tweet we need, and we can know many things about it such as the date of publication, by who, its IP address, and other information.
 "keikomeff: @realDonaldTrump Lock him up! YouDre next! Mueller is coming!" print(tveets_Trudeau[[l]])
 "Cheiyl_Wildlife: RT @Pam_Palmater: Good grief. INAC ALWAYS blames some inanimate object li ke a law or policy for why horrific things are done to First Natiol"
4.2.6 CLEANING THE OUTPUT
The first step is the extraction of the text-only using the “sapply” function.
Tmmpjext <- s apply(tweets_T rump, function(x) xSgetTextO) Tmdeaujext <- sapply(tweets_Tradeau, function(x) xSgetTextO)
Then, we create a coipus using the '‘corpus” function.
Trumpcorpus <- Corpus(VectorSource(Trump_text)) Tmdeau_corpus <- Corpus(YectorSource(Trudeau_text))
After that, we clean the coipus from numbers, punctuation, special characters, spaces, and a group of English words such as “they,” “you,” and “must.” Now, the text is ready to analyze by the construction of the word cloud.
4.3 RESULTS AND DISCUSSION
The first result is the Trump word cloud created using the “wordcloud” function. We limit the showed words to 100.
Tnunp <- tvordcloud(Tnmp_clean, random.order=F, maxwords=100, scale=c(3,0.5), colors=rafnb ow(60))
The word cloud of Trump is given in Figure 4.2.
Through the cloud of words, we notice that there are many words related to different contexts: political, economic, and others. These words are government, democrats, Roger Mueller (The Special Investigator about the possibility of Russian intervention in the elections), border (Mexican border), Maduro (Venezuelan President), FBI, and Nancy Beloucy (President of the US House of Representatives). These could show us all the problems that the American president is suffering and his intervention, even in cases outside the United States (Venezuela, for example).
FIGURE 4.2 Trump word cloud.
The word cloud of Trump allows us to discover many things about his personality, his thinking, his vision about many political, social, and economic events, and his interaction with the outside world.
The second-word cloud is about Canadian Prime Minister Justin Trudeau. We use the same function as the precedent Trump word cloud (Figure 4.3).
FIGURE 4.3 Tmdeau word cloud.
Through this word cloud, we cannot know a lot about Trudeau using this simple analysis. There are no significant words except Canada or Canadian. Although we got more than 5,000 tweets about the Canadian Prime Minister, we could not analyze his personality or know much about him.
4.4 CONCLUSION, LIMITATION, AND FUTURE RESEARCH
Unlike some previous studies published in non-peer reviews that have shown how to extract only tweets, we compared extracted tweets in a specific area (politics), and we explain how they are used to extract important information.
Twitter’s big data is a treasure that we need to conserve, discover, and analyze. Many personal data, statistics, emotions, visions, reports, plans, strategies, and other types of data are available to analyze.
The study of tweets is a strong focus of the analysis of social networks because Twitter has become an important factor in communication. This example shows that it is easy to initiate the first analysis from data extracted directly online. The data preparation phase is becoming as important as ever.
On the other hand, the programming language R with its interface Rstudio allows us to use it as a powerful tool to extract big data online and clean it to be ready for use and study. It allows us to build many powerful plots and graphs that help managers, researchers, governments, and other actors in the decision making process.
Regarding the limits of the study, the word cloud is not enough to extract information about the studied keywords because the process does not always work well. We need to improve the study by advanced techniques such as sentiment analysis that is the subject of our next work.
- • big data
- • programming language R
- • Rstudio
- • Trudeau word cloud
- • word cloud
- 1. Team, R. C. R., (2018). A Language and Environment for Statistical Computing. Retrieved on CRAN: URL: https://www.R-project.org/ (accessed on 22 October 2020).
- 2. В, B., (2014). Analytics in a Big Data World: The Essential Guide to Data Science and its Applications. John Wiley & Sons.
- 3. Gandomi, A. H., (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 137-144.
- 4. Zhao, Y., (2013). Analyzing Twitter data with text mining and social network analysis. The 11th Australasian Data Mining and Analytics Conference (AusDM 2013).
- 5. Fellows, I., (2018). Word Cloud: Word Clouds. Retrieved on CRAN: https://CRAN.R- project.org/package=wordcloud (accessed on 22 October 2020).
- 6. Gentiy, J., (2015). TwitteR: R Based Twitter Client. Retrieved on CRAN. https://cran.r- project.org/web/packages/twitteR/index.html (accessed on 22 October 2020).
- 7. Homik, I. F., (2018). Tm: Text Mining Package. Retrieved from CRAN: https:// CRAN.R-project.org/package=tm (accessed on 22 October 2020).