The Importance of Muslim Grievance in Public Communication - Exploring German Muslim Organizations on Twitter
47 minute read
Published:
This post examines Twitter accounts of German Muslim organizations, specifically Generation Islam (GI), which is mostly labelled as radical in the German context. The study highlights the significance of grievance as a central theme for GI and compares the response of other organizations towards Muslim grievances - i.a. how they lack similar level of condemnation. Graphs visually represent and compare the role of grievances for GI, while commentary provides interpretations. Coding details and prerequisites can be skipped, focusing on the graphs and commentary for a concise understanding of the topic.
Packages and Data
I used a variety of packages for my analysis. Through the pacman
package, I will load all those packages necessary. The function p_load
allows R to download packages that you have not installed yet and load them if they are not already.
pacman::p_load(tidyverse, ggrepel, lubridate, ggpubr, quanteda, DT, plotly,
quanteda.textplots, quanteda.textstats, quanteda.textmodels,
grid, gridExtra, ldatuning, stm, tidytext, seededlda,
caret, glmnet, tibble, kableExtra, caretEnsemble, ranger, rtweet)
The main actor I chose to analyse is Generation Islam (short GI). They can be described as political activists mainly but not only occupied with online political commentary addressing topics affecting Muslims in Germany and worldwide. This group advertises the (re-)establishment of an Islamic caliphate as a solution to Muslim grievances world wide. In comparison I also looked at:
- Central Council of Muslims in Germany (“Zentralrat der Muslime in Deutschland”, short ZMD)
- Turkish-Islamic Union for Religious Affairs (“Türkisch-Islamischen Union der Anstalt für Religion”, short DITIB)
- Islamic Community Millî Görüş (“Islamische Gemeinschaft Millî Görüş”, short IGMG)1
- Islamic Council for the Federal Republic of Germany (“Islamrat für die Bundesrepublik Deutschland”, short IR)
- Alhambra Society (“Alhambra-Gesellschaft”).
I retrieved the data in April 2022 using RTweet
. By utilizing the (free) Twitter Sandbox API, I gained access to the latest 3000 tweets for each account. Obtaining the data with RTweet is straightforward and can be achieved as follows.
Since I have the data already laying around, I will load them into my environment and append them. You can get the data in the corresponding Github repository.
## Generation Islam
genislam <- readRDS("genislam.RDS")
## ZMD
zmd <- readRDS("zmd")
## All Actors
all_islam <- bind_rows(genislam,
zmd,
readRDS("alhm"),
readRDS("ditib"),
readRDS("igmggenclik"),
readRDS("islamratbrd"))
Let’s create a descriptive table that displays the number of tweets accessed and the corresponding time span
all_islam %>%
group_by(screen_name) %>%
summarise(n = n(), `First Tweet` = min(created_at), `Last Tweet` = max(created_at))
## # A tibble: 6 × 4
## screen_name n `First Tweet` `Last Tweet`
## <chr> <int> <dttm> <dttm>
## 1 Alhambra_eV 1335 2017-10-26 17:16:27 2022-04-27 09:46:43
## 2 DITIBkoln 978 2012-02-28 12:29:26 2022-04-27 22:57:31
## 3 Islamratbrd 1246 2013-09-27 19:02:21 2022-04-27 09:52:15
## 4 der_zmd 1534 2016-05-16 16:02:12 2022-04-22 12:42:30
## 5 genislam1 3132 2019-07-22 16:33:18 2022-04-19 13:59:34
## 6 igmggenclik 3200 2016-09-24 08:23:58 2022-04-27 13:58:20
Here we can see that the number of tweets and time span covered vary significantly. This is important considering that different political events occurred throughout and Generation Islam definitely would have commented on some of them. Hence, we are safe to say that there is a level of selection bias unfolding through the fact that the data is not expansive enough. Assuming that Generation Islam are ideologically consistent in how they interpret the world and frame political events, I suppose that the overarching narratives remain similar in varying contexts. This, and the fact that certain topics are likely to reoccur as some political conflicts remain unresolved, partly mitigates the issue of selection bias.
Tweet Activity of Generation Islam
Let us have a look on the tweet activity of Generation Islam over the retrieved time span:Long code/output - Expand here
## [1] "English_United States.1252"
ts <- genislam
ts$tweet <- 1L
ts <- ts %>%
mutate(weekly = str_c(
formatC(isoweek(created_at), format = "f", digits = 0, width = 2, flag = "0"),
"/",
str_sub(isoyear(created_at), 3, 4))) %>%
group_by(screen_name) %>%
arrange(created_at, .by_group = TRUE) %>%
mutate("cumulative" = cumsum(tweet))
ts <- ts %>%
group_by(screen_name, weekly) %>%
mutate("week_cumulative" = cumsum(tweet))
# Adding all tweets together
cumulative_tweet <- ggplot(ts) +
geom_line(aes(x=as.Date(created_at), y=cumulative)) +
xlab("Date") +
ylab("Tweets (cumulative)") +
scale_x_date(date_labels = "%b %y", date_breaks = "2 month") +
theme_minimal() +
labs(title = "A") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# Adding up all tweets per week
week_tweet <- ggplot(ts) +
geom_point(aes(x=as.Date(created_at), y=week_cumulative), alpha = 0.1) +
geom_smooth(aes(x=as.Date(created_at), y=week_cumulative), se = FALSE, color = "black",
method = "gam", formula = y ~ s(x)) +
xlab("Date") +
ylab("Tweets (weekly)") +
scale_x_date(date_labels = "%b %y", date_breaks = "2 month") +
theme_minimal() +
labs(title = "B") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# Displaying both plots together
ggarrange(cumulative_tweet, week_tweet, ncol = 1, nrow = 2, align = "v")
Examining this figure, we observe that the weekly activity of Generation Islam exhibits fluctuations. Notably, the highest point of weekly activities occurred in May 2021, which aligns with the timeframe of the 2021 Israel-Palestine crisis. This correlation suggests a potential connection between political events happening and the increased engagement of Generation Islam during that period. They are quick to react and magnify their communication when significant grievances happen to Muslim populations around the world.
Topics and Discourses
Hashtags
Hashtags can give away what Twitter discourses actors want to penetrate. Let us have a look on how Generation Islam deals with hashtags and what that says about their communication. Let us start with their 50 most used hashtags:
# Creating a Document-Feature-Matrix that seeks out for hashtags as a pattern
tweet_dfm <- tokens(genislam$text, remove_punct = TRUE) %>%
dfm() %>%
dfm_select(pattern = "#*")
# Extract 50 most used hashtags
toptag <- names(topfeatures(tweet_dfm, 50))
# Display 50 Most used Hashtags
toptag <- names(topfeatures(tweet_dfm, 50))
toptag_table <- bind_cols("1-10" = toptag[1:10],
"11-20" = toptag[11:20],
"21-30" = toptag[21:30],
"31-40" = toptag[31:40],
"41-50" = toptag[41:50])
knitr::kable(toptag_table)
1-10 | 11-20 | 21-30 | 31-40 | 41-50 |
---|---|---|---|---|
#stopmacron | #syria | #kopftuch | #uyghurs | #islamfeindlichkeit |
#afghanistan | #gaza | #demohamburg | #uiguren | #savesilwan |
#hanau | #gazaunderattack | #rassismus | #kabul | #islamhass |
#palestine | #idlib | #freepalestine | #kopftuchverbot | #islamophobie |
#china | #ukraine | #coronavirus | #breaking | #palestineunderattack |
#ramadan | #palestinians | #corona | #cdu | #boycottfrance |
#india | #palestinewillbefree | #jerusalem | #einestages | #alaqsaunderattack |
#islam | #taliban | #israel | #palästina | #hijab |
#madeinchina | #afd | #savesheikhjarrah | #bds | #skandalunion |
#islamophobia | #uyghur | #france | #delhi | #xinjiang |
Analyzing the top 50 hashtags shows how salient topics related to grievance and politics are within Generation Islam’s communication. However, by constructing a network of co-occurrence among these hashtags, we can uncover their interrelationships and potentially infer underlying topics.
# Lets do that for the 50 most used tweets and create a network
tag_fcm <- fcm(tweet_dfm)
# Lets display this network
topgat_fcm <- fcm_select(tag_fcm, pattern = toptag)
set.seed(123)
textplot_network(topgat_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 3,
vertex_labelsize = 3.5, edge_color="grey60")
We can distinguish four clusters in the network: The Israel-Palestine conflict (words adjacent to #palestine), contexts of political violence towards/involving Muslims (e.g. #syria, #afghanistan, #india), China and the Uyghurs (words adjacent to #china and #uyghurs), and Islamophobia, Racism, and the Right-wing (words adjacent to #islamophobie). The latter topic is very interesting as it references issues, actors, and political actions. Issues include headscarf bans (#kopftuchverbot), racism (#rassismus and the far-right terrorist attack in #Hanau-Germany), and Islamophobia (#islamhass and #islamophobia). These issues are linked to actors like France and the French president, Emmanuel Macron, the Christian Democratic Union in Germany (CDU), a conservative, center-right party that has led the federal government of Germany for most of the post-Second World War period, and Alternative for Germany (AfD), a German far- and radical-right party. Linking to issues of racism and Islamophobia as well, political actions and demands, namely #boycottfrance and #stopmacron, inform us in what relations some actors and issues are seen.
Topic Modelling
Exploring co-occurrences helped me identify topics and revealed in what contexts words were embedded. However, this was largely done by my personal judgments on what word relationships formulate a coherent topic. Unsupervised learning methods like topic modeling allow us to outsource these inductive categorizations to a statistical framework. One of these frameworks is Latent Dirichlet allocation (LDA) which is a “generative probabilistic model” with the underlying idea that “documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words”2. By projecting statistical assumptions onto text data and their latent thematic structure, clusters of co-occurring words can be modeled that ultimately represent topics. Several computational developments and adoptions exist that build on that framework, one being Structural Topic Modeling realized in the stm
package in R. This package offers useful implementations like using document-level data in the analysis and spectral initialization3. I use this package to execute simple topic modeling with spectral initialization on the Generation Islam data set with the number of topics set to 10.
Long code/output - Expand here
post_genislam <- genislam
post_genislam$text <- gsub("🇵🇸", "",
gsub("🏽","",
gsub("🏼","",
gsub("✊","",
gsub("🏻", "",
gsub("🤦","",
gsub("♂️","",
gsub("👨","",
gsub("👏", "",
gsub("🤲🏾", "",
gsub("🤲", "",
gsub("💔", "",
gsub(" 👇", "", post_genislam$text)))))))))))))
post_genislam$text <- gsub("🤷","",
gsub("🤷","",
gsub("♀", "",
gsub("🤷♀", "",
gsub("🇨🇳","", post_genislam$text)))))
dfm_post <- corpus(post_genislam) %>%
tokens(remove_punct=T,
remove_numbers = T,
remove_url = T,
split_hyphens = T,
remove_symbols = T) %>%
tokens_remove(stopwords("english", source = "marimo")) %>%
tokens_remove(stopwords("de", source = "marimo")) %>%
tokens_remove(stopwords("ar", source = "stopwords-iso")) %>%
tokens_remove(c("à", "un", "la", "le", "en", "et", "été", "les", "avec")) %>%
tokens_remove(pattern = "^[\\p{script=Arab}]+$", valuetype = "regex") %>%
dfm() %>%
dfm_remove(pattern = c("*.tt", "*.uk", "*.com", "rt", "#*", "@*", ".de")) %>%
dfm_trim(max_termfreq = .99,termfreq_type = "quantile",verbose = T) %>%
dfm_trim(min_termfreq = .7,termfreq_type = "quantile",verbose = T) %>%
dfm_wordstem()
Then peform the topic modelling with 10 topics
set.seed(1234)
dfm2stm <- convert(dfm_post, to = "stm")
topic.count <- 10
model.stm <- stm(dfm2stm$documents, dfm2stm$vocab,
K = topic.count, data = dfm2stm$meta,
init.type = "Spectral", verbose = FALSE)
Now we can create a table to show us the topics, their proportion in the dataset, and their most relevant or indicative words.
proportions_table <- make.dt(model.stm)
topic_table <- summarize_all(proportions_table, mean)
topic_long <- topic_table[,c(2:11)] %>%
pivot_longer(everything(), names_to = "Topic #", values_to = "Topic Proportions")
topics_label <- labelTopics(model.stm, n=10)
prob_word <- topics_label[["prob"]]
topic_long$`Top Words` <- ""
for (i in 1:10) {
topic_long[i,3] <- paste(prob_word[i,], collapse = ", ")
}
topic_long$`Topic #` <- gsub("Topic", "", topic_long$`Topic #`)
topic_long$`Topic Proportions` <- round(topic_long$`Topic Proportions`*100)
topic_long$`Topic Proportions` <- paste0(topic_long$`Topic Proportions`, "%")
knitr::kable(topic_long)
Topic # | Topic Proportions | Top Words |
---|---|---|
1 | 7% | destroy, uyghur, islam, genocid, twitter, muhammad, leader, prophet, across, kashmir |
2 | 14% | bomb, use, arrest, strike, tortur, account, happen, brother, kill, white |
3 | 22% | deutsch, israelisch, palästina, problem, angriff, palästinens, viel, macht, weiß, wert |
4 | 8% | protest, villag, stop, tri, face, yesterday, border, plan, part, settler |
5 | 5% | night, die, refuge, imam, attack, news, syrian, leav, insid, macron |
6 | 7% | watch, girl, go, want, know, human, right, group, jewish, support |
7 | 4% | famili, crime, intern, show, demolish, build, facebook, journalist, religion, accus |
8 | 19% | frankreich, moscheen, sei, politik, islamisch, zeit, eltern, gerad, berlin, corona |
9 | 6% | nazi, europa, artikel, millionen, kommt, sieht, wenig, hätten, eigentlich, eu |
10 | 8% | call, brutal, terrorist, dead, target, wear, saudi, ban, week, hijab |
It is incumbent on the researcher to determine the meaning of each topic. So, one has to interpret the results once again. A shared characteristic across all topics is yet again words that coincide with grievances, violence, and Islamophobia. The expected topic proportions represent the average probability of a topic being prevalent throughout the corpus of tweets. Those values are obtained by taking the mean of \(\theta\), the document-topic loadings sometimes called \(\gamma\), per topic. In layman’s terms, on average, the probability of a tweet corresponding to Topic 3 is around 20%. To better understand what these topics consist of, let’s observe Topics 4 and 10 in more detail. Top words are those that have the highest probability (\(\beta\)) of being associated with that topic4. Similar to the previous analysis, Topic 4 relates to words that indicate Islamophobia and Racism. On the other hand, Topic 10 strongly relates to Palestine and Israel. But there is also another layer to that. By referencing “Europa” (Europe), “Frankreich” (France), and “CDU” (Christian Democratic Union of Germany) Topic 4 connotes a specifically domestic or European context. A contrast is evident that highlights issues in which Muslims are victimized in either Western societies or global contexts slightly differently. Western contexts often refer to particular right-wing politicians, racist discourses, and legislative decisions. Global contexts focus on government-initiated physical aggression, in the form of armed violence for example. Against the background that Generation Islam see Muslim grievances as a systemic issue only to be resolved by an Islamic caliphate, these national and global issues serve as testimonies affirming that existing political systems and their representatives are failing Muslims in their own respect. Arguably, this framing aids in making their suggestion on how to fix these issues more plausible to the Muslim audience they are tailored to.
Predictive Models
Naive Bayes
Previously, the analytical aim was to exemplify methods of text exploration through computational text analysis. Common issues and topics were found and attempted to make sense of. However, we can also classify text into already known categories by using text elements as predictors. This method is considered part of supervised learning. To exemplify this method, I will use the fact of whether a tweet belongs to Generation Islam or not as a target variable. Prediction models are applied and the predictive power of text elements is evaluated. As a result, questions on whether Generation Islam communication is distinctive and how are discussed. For instance, assuming that the language of Generation Islam is distinct from that of the ZMD, is it possible to predict what tweet belongs to whom based on the language used? One possible technique is to make use of Bayes’ probability theorem and apply it to text classification. The goal is to estimate the posterior probability \(P(A|B)\) of an outcome \(A\) occurring, conditional on evidence \(B\). This probability is derived from the likelihood of evidence occurring conditional on the outcome \(P(B|A)\) times the prior probability of the outcome occurring \(P(A)\) over the prior probability of the evidence occurring \(P(B)\): \(P(A|B) = \frac{P(A) * P(B|A)}{P(B)}\). This can be translated into a model that estimates the probability of a tweet belonging to an account conditional on its text: \(P(Account|Text) = \frac{P(Account) * P(Text|Account)}{P(Text)}\).
This is called a Naive Bayes classifier. Using it gives a good entry into the world of predictive modeling and how to exploit it for classification. I use this model to predict whether tweets belong to either Generation Islam or ZMD. Such a process is called a binary classification as the outcome variable has two possible values. A conventional approach to evaluate a predictive model is splitting the data into a number of training and test data sets. In this example, I will split the data into one training and test data set of equal size. The training data are complete while the test data have the account information removed. First, the Naive Bayes model is applied to the training data set. This trained model then predicts what tweets belong to what accounts in the test data. This is referred to as an out-of-sample prediction. The account information is then reintroduced to the test data set, allowing evaluation of the predictions by comparing them to actual observed data. A confusion matrix is a handy tool to evaluate these predictions (see below). It is noticeable that most accounts were correctly predicted. In fact, the accuracy of the model is 0.8581, meaning around 86% of the tweets were correctly predicted. With the No Information Rate being 0.6545, meaning that if one were to classify every tweet, the most dominant account, Generation Islam, would be correct around 65% of the time. Sensitivity (or True Positive Rate) and Specificity (or True Negative Rate) are two further metrics to evaluate the model. About 90% of predicted Generation Islam tweets were in fact by Generation Islam Thus, as regards Sensitivity, the model performs rather well. With 76% of the tweets by the Central Council being predicted as such, the model performed less well regarding Specificity. Of course, this correlates with the case numbers per account, being higher for Generation Islam
Long code/output - Expand here
key_df <- bind_rows(genislam, zmd)
key_df$screen_name <- factor(key_df$screen_name, levels = c("der_zmd", "genislam1"))
key_corp <- corpus(key_df)
key_dfm <- tokens(key_corp, remove_punct=T,
remove_numbers = T,
remove_url = T,
split_hyphens = T,
remove_symbols = T) %>%
tokens_remove(stopwords("english", source = "marimo")) %>%
tokens_remove(stopwords("de", source = "marimo")) %>%
tokens_remove(stopwords("ar", source = "stopwords-iso")) %>%
dfm() %>%
dfm_group(groups = screen_name) %>%
dfm_remove(pattern = c("*.tt", "*.uk", "*.com", "rt", "#*", "@*", ".de")) %>%
dfm_trim(max_termfreq = .99,termfreq_type = "quantile",verbose = T) %>%
dfm_trim(min_termfreq = .7,termfreq_type = "quantile",verbose = T)
key_df_small<- key_df
key_df_small<- key_df_small %>%
select(text, screen_name)
data_corpus <- corpus(key_df_small, text_field = "text")
set.seed(1234)
training_id <- sample(1:4666, 2333, replace = FALSE)
# Create docvar with ID
docvars(data_corpus, "id_numeric") <- 1:ndoc(data_corpus)
# Get training set
dfmat_training <- corpus_subset(data_corpus, id_numeric %in% training_id) %>%
tokens(remove_punct=T,
remove_numbers = T,
remove_url = T,
split_hyphens = T,
remove_symbols = T) %>%
tokens_remove(stopwords("english", source = "marimo")) %>%
tokens_remove(stopwords("de", source = "marimo")) %>%
tokens_remove(stopwords("ar", source = "stopwords-iso")) %>%
dfm() %>%
dfm_remove(pattern = c("*.tt", "*.uk", "*.com", "rt", "#*", "@*", ".de")) %>%
dfm_trim(max_termfreq = .99,termfreq_type = "quantile",verbose = T) %>%
dfm_trim(min_termfreq = .7,termfreq_type = "quantile",verbose = T)
# Get test set (documents not in training_id)
dfmat_test <-
corpus_subset(data_corpus,!id_numeric %in% training_id) %>%
tokens(remove_punct=T,
remove_numbers = T,
remove_url = T,
split_hyphens = T,
remove_symbols = T) %>%
tokens_remove(stopwords("english", source = "marimo")) %>%
tokens_remove(stopwords("de", source = "marimo")) %>%
tokens_remove(stopwords("ar", source = "stopwords-iso")) %>%
dfm() %>%
dfm_remove(pattern = c("*.tt", "*.uk", "*.com", "rt", "#*", "@*", ".de")) %>%
dfm_trim(max_termfreq = .99,termfreq_type = "quantile",verbose = T) %>%
dfm_trim(min_termfreq = .7,termfreq_type = "quantile",verbose = T)
# Train Naïve Bayes
model.NB <- textmodel_nb(dfmat_training, docvars(dfmat_training, "screen_name"), prior = "docfreq")
# The prior indicates an assumed distribution.
# Here we choose how frequently the categories occur in our data.
dfmat_matched <-
dfm_match(dfmat_test, features = featnames(dfmat_training))
actual_class <- docvars(dfmat_matched, "screen_name")
predicted_class <- predict(model.NB, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
confusion <- confusionMatrix(tab_class, mode = "everything", positive = "genislam1")
## Confusion Matrix and Statistics
##
## predicted_class
## actual_class der_zmd genislam1
## der_zmd 613 138
## genislam1 194 1388
##
## Accuracy : 0.8577
## 95% CI : (0.8429, 0.8716)
## No Information Rate : 0.6541
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6803
##
## Mcnemar's Test P-Value : 0.00254
##
## Sensitivity : 0.9096
## Specificity : 0.7596
## Pos Pred Value : 0.8774
## Neg Pred Value : 0.8162
## Precision : 0.8774
## Recall : 0.9096
## F1 : 0.8932
## Prevalence : 0.6541
## Detection Rate : 0.5949
## Detection Prevalence : 0.6781
## Balanced Accuracy : 0.8346
##
## 'Positive' Class : genislam1
##
With Naive Bayes, words take the role of variables. Hence, evaluating the coefficients gives insights into what words were relevant in predicting outcomes. Looking at text elements that had the most impact on classifying text as Generation Islam underpins the previous results (see below). Language differences are once more visible. Still, referring to grievances seems to be a defining and distinctive element of their communication. Words like Islam hater (“islamhasser”), among others, indicate that. With previous knowledge of Generation Islam, one can suspect words like the West (“westen”), and politicans (“politiker”) relate to Western politics.
model_parm <- rownames_to_column(as.data.frame(model.NB[["param"]]), "screen_name")
model_parm_long <- model_parm %>%
pivot_longer(-screen_name)
model_parm_long$value <- format(model_parm_long$value , scientific = FALSE)
top_ten_parameters <- model_parm_long %>%
filter(screen_name == "genislam1") %>%
slice_max(value, n=10)
top_ten_parameters$value <- as.numeric(top_ten_parameters$value)
ggplot(top_ten_parameters, aes(y = value, x = reorder(name,value))) +
geom_col() +
coord_flip() +
xlab("Text elements") +
ylab("Coefficient") +
theme_minimal()
Glmnet and Random Forest
Entering the world of predictive modeling opens up many possibilities to analyze textual data. One way is to create document-level variables from text data and use them in estimations. To exemplify this, I created a binary variable that is “TRUE” every time a word (stem) refers to (Muslim) grievances is mentioned and “FALSE” if it is not:
Long code/output - Expand here
# Creating Dictionary
dict <- dictionary(list(middle_east = c("israel*", "palest*", "paläs*",
"jerusalem", "aqsa", "west bank",
"west-bank", "gaza", "aparth*", "antisemit*"),
other_grievances = c("china", "chines*", "uigur*", "uyghur*",
"chines*", "camp*", "hindu*", "myanm*",
"rohing*", "bosni*", "srebrenica",
"hindutv*", "bomb*", "attack", "strike"),
defence = c("verteid*", "resist*", "defen*",
"widerstand*", "battl*", "wehren*",
"abwehr*", "oppos*"),
hijab = c("kopftuch*", "scarf*", "niqab*", "jilbab*",
"hijab*", "shador", "schador"),
racism_islamophobia = c("rassis*", "race*", "racis*",
"discrim*", "diskrim*", "nazi*", "*phob*"),
islam_general = c("dua", "quran", "koran", "hadith", "narration", "prophet*", "gesandter",
"rasul*", "sahaba*", "salaf", "kalif*", "mubarak", "eid", "deen")))
table_dict <- as.data.frame(matrix(ncol = 6, nrow = 14))
names(table_dict) <- c("middle_east", "other_grievances", "defence", "hijab",
"racism_islamophobia", "islam_general")
table_dict$middle_east <- c(dict[["middle_east"]], rep("", 14-length(dict[["middle_east"]])))
table_dict$other_grievances <- c(dict[["other_grievances"]], rep("", 14-length(dict[["other_grievances"]])))
table_dict$defence <- c(dict[["defence"]], rep("", 14-length(dict[["defence"]])))
table_dict$hijab <- c(dict[["hijab"]], rep("", 14-length(dict[["hijab"]])))
table_dict$racism_islamophobia <- c(dict[["racism_islamophobia"]], rep("", 14-length(dict[["racism_islamophobia"]])))
table_dict$islam_general <- c(dict[["islam_general"]], rep("", 14-length(dict[["islam_general"]])))
middle_east | other_grievances | defence | hijab | racism_islamophobia | islam_general |
---|---|---|---|---|---|
israel* | china | verteid* | kopftuch* | rassis* | dua |
palest* | chines* | resist* | scarf* | race* | quran |
paläs* | uigur* | defen* | niqab* | racis* | koran |
jerusalem | uyghur* | widerstand* | jilbab* | discrim* | hadith |
aqsa | camp* | battl* | hijab* | diskrim* | narration |
west bank | hindu* | wehren* | shador | nazi* | prophet* |
west-bank | myanm* | abwehr* | schador | phob | gesandter |
gaza | rohing* | oppos* | rasul* | ||
aparth* | bosni* | sahaba* | |||
antisemit* | srebrenica | salaf | |||
hindutv* | kalif* | ||||
bomb* | mubarak | ||||
attack | eid | ||||
strike | deen |
As these words relate to grievances, I named the variable as such. I did so for a data set that combines all the Twitter accounts of Muslim organizations specified in Table below. I clustered all these Muslim organizations except GI into the category “Other”. Tweets that mention one of the dictionary words clearly have more presence in the Generation Islam corpus (GI: 40%, Other: 9%).
grievance <- c("israel", "palest", "paläs", "jerusalem", "aqsa", "west bank",
"west-bank", "gaza", "aparth", "antisemit","uigur", "uyghur",
"hindut", "myanm", "rohing", "srebrenica", "bomb", "attack",
"strike", "verteid", "resist", "defen", "widerstand", "battl",
"wehren", "abwehr", "oppos", "kopftuch", "scarf", "niqab",
"jilbab","hijab", "shador", "schador", "rassis", "race", "racis",
"discrim", "diskrim", "nazi", "phob")
all_islam$grievance <- grepl(paste(grievance,collapse="|"), all_islam$text, ignore.case = TRUE)
all_islam <- all_islam %>%
mutate(account = case_when(screen_name == "genislam1" ~ "genislam",
TRUE ~ "other"))
# Creating Table
all_islam %>%
group_by(account, grievance) %>%
summarise(n = n()) %>%
mutate(freq = paste0(round(n / sum(n)*100), "%"))
## # A tibble: 4 × 4
## # Groups: account [2]
## account grievance n freq
## <chr> <lgl> <int> <chr>
## 1 genislam FALSE 1864 60%
## 2 genislam TRUE 1268 40%
## 3 other FALSE 7566 91%
## 4 other TRUE 727 9%
With the grievance variable now being a non-character type datum, it can be integrated as a predictor into machine learning frameworks without text data implementation. Here, I will concisely curate an Elastic-Net Regularized Generalized Linear Model (glmnet) and a Random Forest (rf) model through the R package caret
.
The R package “glmnet” offers a Generalized Linear Model, which strikes a compromise between ridge- (\(\alpha=0\)) and lasso-regression (\(\alpha=1\)) via an elastic-net penalty. A ridge regression penalizes the number of non-zero coefficients and a lasso regression the absolute magnitude of coefficients. The strength of the penalty is controlled by \(\lambda\): \(\min_{\beta_0,\beta} \frac{1}{N} \sum_{i=1}^{N} w_i l(y_i,\beta_0+\beta^T x_i) + \lambda\left[(1-\alpha)\|\beta\|_2^2/2 + \alpha \|\beta\|_1\right]\)
Caret allows us to tune such a model by offering a range of values for \(\alpha\) and \(\lambda\) in a k-fold cross-validation to identify the model with the best performance. K-fold cross-validation means that the data are randomly split into a \(k\) number of training and test data sets. This process is repeated \(k\) times with varying test data sets and model performance then summarized across all folds. In addition to the grievance variable, I apply an Elastic-Net Regularized Generalized Linear Model on the following document-level predictors: Length of a tweet (display\_text\_width
), number of times a tweet was retweeted (retweet\_count
), and number of times a tweet was marked as a favorite (favorite\_count
). I provide a range of \(\alpha\) between 0 and 1 in four steps, \(\lambda\) between 0.0001 and 1 in 20 steps, and a fold number of \(k=10\). The best model performance was reached with \(\alpha=1\) (lasso-regression) and \(\lambda=0.0001\), determined by the “Area under the ROC Curve” value (AUC). The Receiver operating characteristic curve (ROC) is a graphical tool to determine the quality of a binary classification model by contrasting its true positive to false positive rate. The AUC equals 1 when a perfect prediction of only true positives is achieved. An AUC of 0.5 indicates a random classifier and 0 is a perfectly false classifier. Through these model parameters an AUC of \(0.8989949\) was achieved, which is quite good.
Long code/output - Expand here
set.seed(1234)
account_y <- factor(all_islam$account, levels = c("other", "genislam"))
account_x <- select(all_islam, grievance,display_text_width,
retweet_count, favorite_count)
# Create custom indices: myFolds (for a 10-fold CV)
myFolds <- createFolds(account_y, k = 10)
# Create reusable trainControl object: myControl
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds
)
train <- data.frame(account_x, account_y)
#glmnet
model_glmnet <- train(account_y ~ ., data = train,
metric = "ROC",
method = "glmnet",
trControl = myControl,
tuneGrid = expand.grid(
alpha=seq(0,1,0.25), # 5 values of alpha
lambda=seq(0.0001, 1, length=20) # 20 values of lambda
)
)
## + Fold01: alpha=0.00, lambda=1
## - Fold01: alpha=0.00, lambda=1
## + Fold01: alpha=0.25, lambda=1
## - Fold01: alpha=0.25, lambda=1
## + Fold01: alpha=0.50, lambda=1
## - Fold01: alpha=0.50, lambda=1
## + Fold01: alpha=0.75, lambda=1
## - Fold01: alpha=0.75, lambda=1
## + Fold01: alpha=1.00, lambda=1
## - Fold01: alpha=1.00, lambda=1
## + Fold02: alpha=0.00, lambda=1
## - Fold02: alpha=0.00, lambda=1
## + Fold02: alpha=0.25, lambda=1
## - Fold02: alpha=0.25, lambda=1
## + Fold02: alpha=0.50, lambda=1
## - Fold02: alpha=0.50, lambda=1
## + Fold02: alpha=0.75, lambda=1
## - Fold02: alpha=0.75, lambda=1
## + Fold02: alpha=1.00, lambda=1
## - Fold02: alpha=1.00, lambda=1
## + Fold03: alpha=0.00, lambda=1
## - Fold03: alpha=0.00, lambda=1
## + Fold03: alpha=0.25, lambda=1
## - Fold03: alpha=0.25, lambda=1
## + Fold03: alpha=0.50, lambda=1
## - Fold03: alpha=0.50, lambda=1
## + Fold03: alpha=0.75, lambda=1
## - Fold03: alpha=0.75, lambda=1
## + Fold03: alpha=1.00, lambda=1
## - Fold03: alpha=1.00, lambda=1
## + Fold04: alpha=0.00, lambda=1
## - Fold04: alpha=0.00, lambda=1
## + Fold04: alpha=0.25, lambda=1
## - Fold04: alpha=0.25, lambda=1
## + Fold04: alpha=0.50, lambda=1
## - Fold04: alpha=0.50, lambda=1
## + Fold04: alpha=0.75, lambda=1
## - Fold04: alpha=0.75, lambda=1
## + Fold04: alpha=1.00, lambda=1
## - Fold04: alpha=1.00, lambda=1
## + Fold05: alpha=0.00, lambda=1
## - Fold05: alpha=0.00, lambda=1
## + Fold05: alpha=0.25, lambda=1
## - Fold05: alpha=0.25, lambda=1
## + Fold05: alpha=0.50, lambda=1
## - Fold05: alpha=0.50, lambda=1
## + Fold05: alpha=0.75, lambda=1
## - Fold05: alpha=0.75, lambda=1
## + Fold05: alpha=1.00, lambda=1
## - Fold05: alpha=1.00, lambda=1
## + Fold06: alpha=0.00, lambda=1
## - Fold06: alpha=0.00, lambda=1
## + Fold06: alpha=0.25, lambda=1
## - Fold06: alpha=0.25, lambda=1
## + Fold06: alpha=0.50, lambda=1
## - Fold06: alpha=0.50, lambda=1
## + Fold06: alpha=0.75, lambda=1
## - Fold06: alpha=0.75, lambda=1
## + Fold06: alpha=1.00, lambda=1
## - Fold06: alpha=1.00, lambda=1
## + Fold07: alpha=0.00, lambda=1
## - Fold07: alpha=0.00, lambda=1
## + Fold07: alpha=0.25, lambda=1
## - Fold07: alpha=0.25, lambda=1
## + Fold07: alpha=0.50, lambda=1
## - Fold07: alpha=0.50, lambda=1
## + Fold07: alpha=0.75, lambda=1
## - Fold07: alpha=0.75, lambda=1
## + Fold07: alpha=1.00, lambda=1
## - Fold07: alpha=1.00, lambda=1
## + Fold08: alpha=0.00, lambda=1
## - Fold08: alpha=0.00, lambda=1
## + Fold08: alpha=0.25, lambda=1
## - Fold08: alpha=0.25, lambda=1
## + Fold08: alpha=0.50, lambda=1
## - Fold08: alpha=0.50, lambda=1
## + Fold08: alpha=0.75, lambda=1
## - Fold08: alpha=0.75, lambda=1
## + Fold08: alpha=1.00, lambda=1
## - Fold08: alpha=1.00, lambda=1
## + Fold09: alpha=0.00, lambda=1
## - Fold09: alpha=0.00, lambda=1
## + Fold09: alpha=0.25, lambda=1
## - Fold09: alpha=0.25, lambda=1
## + Fold09: alpha=0.50, lambda=1
## - Fold09: alpha=0.50, lambda=1
## + Fold09: alpha=0.75, lambda=1
## - Fold09: alpha=0.75, lambda=1
## + Fold09: alpha=1.00, lambda=1
## - Fold09: alpha=1.00, lambda=1
## + Fold10: alpha=0.00, lambda=1
## - Fold10: alpha=0.00, lambda=1
## + Fold10: alpha=0.25, lambda=1
## - Fold10: alpha=0.25, lambda=1
## + Fold10: alpha=0.50, lambda=1
## - Fold10: alpha=0.50, lambda=1
## + Fold10: alpha=0.75, lambda=1
## - Fold10: alpha=0.75, lambda=1
## + Fold10: alpha=1.00, lambda=1
## - Fold10: alpha=1.00, lambda=1
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 1, lambda = 1e-04 on full training set
## [1] 0.8989949
The model coefficient for the grievance variable is positive, indicating that the grievance variable is positively associated with Generation Islam (see below). For once, this may certainly be due to the specific language used by the groups in question, but perhaps also because other, politically more established, Muslim associations and organizations do not denounce these issues with the same intensity. Following this argument, this suggests that there is a gap in public communication about these issues, which certain organizations are likely to be filling. Future research can investigate whether this is more of a statistical artifact constructed through my case selection or a substantive hypothesis.
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.463980324
## grievanceTRUE 1.705930216
## display_text_width -0.004707908
## retweet_count 0.005208284
## favorite_count 0.057276707
To improve my predictions, I compare the glmnet model with a Random Forest model (rf) to see which one performs better. When used in caret, this model is called from the R package ranger
. In layman’s terms, an rf model makes use of decision trees where variables are randomly drawn at each split of these trees. Such a model usually yields very robust and accurate models. Similar to glmnet, one can specify a range of hyperparameters and choose the best-performing model. Those parameters include the number of variables to choose from at each split (\(m_{try}\)), the minimum number of terminal nodes, and the split rule. Split rules define how nodes within a tree model are further divided. Best model performance was achieved with \(m_{try}=4\), a minimal node size of 1, and a split rule that choose “extremely randomized trees”. The rf model achieved an AUC value of \(0.9317046\), scoring better than the glmnet model. The figure below shows that the rf model had denser, more homogeneous estimates with fewer outliers, and it performed better in all folds.
Long code/output - Expand here
## Optimizing predictions with random forest
set.seed(1234)
#random forest
model_rf <- train(account_y ~ ., data = train,
metric = "ROC",
method = "ranger",
trControl = myControl
)
## + Fold01: mtry=2, min.node.size=1, splitrule=gini
## - Fold01: mtry=2, min.node.size=1, splitrule=gini
## + Fold01: mtry=3, min.node.size=1, splitrule=gini
## - Fold01: mtry=3, min.node.size=1, splitrule=gini
## + Fold01: mtry=4, min.node.size=1, splitrule=gini
## - Fold01: mtry=4, min.node.size=1, splitrule=gini
## + Fold01: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold01: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold01: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold01: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold01: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold01: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold02: mtry=2, min.node.size=1, splitrule=gini
## - Fold02: mtry=2, min.node.size=1, splitrule=gini
## + Fold02: mtry=3, min.node.size=1, splitrule=gini
## - Fold02: mtry=3, min.node.size=1, splitrule=gini
## + Fold02: mtry=4, min.node.size=1, splitrule=gini
## - Fold02: mtry=4, min.node.size=1, splitrule=gini
## + Fold02: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold02: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold02: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold02: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold02: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold02: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold03: mtry=2, min.node.size=1, splitrule=gini
## - Fold03: mtry=2, min.node.size=1, splitrule=gini
## + Fold03: mtry=3, min.node.size=1, splitrule=gini
## - Fold03: mtry=3, min.node.size=1, splitrule=gini
## + Fold03: mtry=4, min.node.size=1, splitrule=gini
## - Fold03: mtry=4, min.node.size=1, splitrule=gini
## + Fold03: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold03: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold03: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold03: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold03: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold03: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold04: mtry=2, min.node.size=1, splitrule=gini
## - Fold04: mtry=2, min.node.size=1, splitrule=gini
## + Fold04: mtry=3, min.node.size=1, splitrule=gini
## - Fold04: mtry=3, min.node.size=1, splitrule=gini
## + Fold04: mtry=4, min.node.size=1, splitrule=gini
## - Fold04: mtry=4, min.node.size=1, splitrule=gini
## + Fold04: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold04: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold04: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold04: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold04: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold04: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold05: mtry=2, min.node.size=1, splitrule=gini
## - Fold05: mtry=2, min.node.size=1, splitrule=gini
## + Fold05: mtry=3, min.node.size=1, splitrule=gini
## - Fold05: mtry=3, min.node.size=1, splitrule=gini
## + Fold05: mtry=4, min.node.size=1, splitrule=gini
## - Fold05: mtry=4, min.node.size=1, splitrule=gini
## + Fold05: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold05: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold05: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold05: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold05: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold05: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold06: mtry=2, min.node.size=1, splitrule=gini
## - Fold06: mtry=2, min.node.size=1, splitrule=gini
## + Fold06: mtry=3, min.node.size=1, splitrule=gini
## - Fold06: mtry=3, min.node.size=1, splitrule=gini
## + Fold06: mtry=4, min.node.size=1, splitrule=gini
## - Fold06: mtry=4, min.node.size=1, splitrule=gini
## + Fold06: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold06: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold06: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold06: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold06: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold06: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold07: mtry=2, min.node.size=1, splitrule=gini
## - Fold07: mtry=2, min.node.size=1, splitrule=gini
## + Fold07: mtry=3, min.node.size=1, splitrule=gini
## - Fold07: mtry=3, min.node.size=1, splitrule=gini
## + Fold07: mtry=4, min.node.size=1, splitrule=gini
## - Fold07: mtry=4, min.node.size=1, splitrule=gini
## + Fold07: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold07: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold07: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold07: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold07: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold07: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold08: mtry=2, min.node.size=1, splitrule=gini
## - Fold08: mtry=2, min.node.size=1, splitrule=gini
## + Fold08: mtry=3, min.node.size=1, splitrule=gini
## - Fold08: mtry=3, min.node.size=1, splitrule=gini
## + Fold08: mtry=4, min.node.size=1, splitrule=gini
## - Fold08: mtry=4, min.node.size=1, splitrule=gini
## + Fold08: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold08: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold08: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold08: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold08: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold08: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold09: mtry=2, min.node.size=1, splitrule=gini
## - Fold09: mtry=2, min.node.size=1, splitrule=gini
## + Fold09: mtry=3, min.node.size=1, splitrule=gini
## - Fold09: mtry=3, min.node.size=1, splitrule=gini
## + Fold09: mtry=4, min.node.size=1, splitrule=gini
## - Fold09: mtry=4, min.node.size=1, splitrule=gini
## + Fold09: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold09: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold09: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold09: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold09: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold09: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold10: mtry=2, min.node.size=1, splitrule=gini
## - Fold10: mtry=2, min.node.size=1, splitrule=gini
## + Fold10: mtry=3, min.node.size=1, splitrule=gini
## - Fold10: mtry=3, min.node.size=1, splitrule=gini
## + Fold10: mtry=4, min.node.size=1, splitrule=gini
## - Fold10: mtry=4, min.node.size=1, splitrule=gini
## + Fold10: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold10: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold10: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold10: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold10: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold10: mtry=4, min.node.size=1, splitrule=extratrees
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 2, splitrule = extratrees, min.node.size = 1 on full training set
# Create model_list
model_list <- list(glmnet = model_glmnet, rf = model_rf)
# Pass model_list to resamples()
resamples <- resamples(model_list)
# Summarize the results
summary(resamples)
##
## Call:
## summary.resamples(object = resamples)
##
## Models: glmnet, rf
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.7583141 0.8970773 0.9195290 0.8989949 0.9237299 0.9313372 0
## rf 0.9272811 0.9298911 0.9317303 0.9317046 0.9334955 0.9360645 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.9119775 0.9710916 0.9756163 0.9698142 0.9764513 0.9856645 0
## rf 0.9380946 0.9505962 0.9547160 0.9533073 0.9595017 0.9628885 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmnet 0.4102200 0.5572898 0.6502306 0.6056797 0.6716034 0.6997871 0
## rf 0.7236609 0.7359879 0.7514636 0.7504619 0.7621497 0.7818375 0
# RF performs best, also outlier folds (see spikes and near zero values)
bwtheme <- standard.theme("pdf", color=FALSE)
bwtheme <- bwtheme
bwtheme$par.main.text <- list(font = 2,
just = "left",
x = grid::unit(5, "mm"))
glm_plot1 <- bwplot(resamples, metric="ROC", par.settings=bwtheme, main="A")
glm_plot1[["xlab"]] <- "AUC"
glm_plot2 <- dotplot(resamples, metric = "ROC", par.settings=bwtheme, main="B")
glm_plot2[["xlab"]] <- "AUC"
glm_plot3 <- xyplot(resamples, metric="ROC", par.settings=bwtheme)
glm_plot3[["xlab"]] <- "AUC"
glm_plot3[["main"]] <- "C"
glm_plot4 <- densityplot(resamples, metric="ROC", ylab="Density", par.settings=bwtheme, main="D")
glm_plot4[["xlab"]] <- "AUC"
# Creating Figure
grid.arrange(glm_plot1,glm_plot2, glm_plot3, glm_plot4, ncol=2, bottom ="All values display AUC (ROC)")
But it is not necessary to decide between glmnet and rf.The caret framework allows us to blend models via a generalized linear model, possibly improving predictive performance. I use a stack of a glmnet and rf models in an attempt to predict the account membership of tweets between Generation Islam and other Muslim organizations (see below). Overall, the accuracy of the model blend is 5 percentage points better than the Naive Bayes classification (0.9037 versus 0.8581). However, most of the correct predictions were true negatives (0.9635). The stacked model performed worse in detecting tweets belonging to GI than the Naive Bayes model (0.7436 versus 0.9096). Again, this one-sided prediction accuracy most likely coincides with unequal case numbers in the target variable outcomes. In defense of the stacked model, it should be noted that it used fewer predictors. In contrast, the Naive Bayes model used an abundance of words as variables, making it susceptible to overfitting.
Long code/output - Expand here
set.seed(1234)
training_id_new <- sample(1:11425, 5712, replace = FALSE)
train_new <- train[training_id_new,]
test_new <- train[-training_id_new,]
myFolds <- createFolds(train_new$account_y, k = 10)
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds
)
models <- caretList(account_y ~ ., data=train_new, trControl = myControl, methodList=c("glmnet", "ranger"))
## + Fold01: alpha=0.10, lambda=0.0329
## - Fold01: alpha=0.10, lambda=0.0329
## + Fold01: alpha=0.55, lambda=0.0329
## - Fold01: alpha=0.55, lambda=0.0329
## + Fold01: alpha=1.00, lambda=0.0329
## - Fold01: alpha=1.00, lambda=0.0329
## + Fold02: alpha=0.10, lambda=0.0329
## - Fold02: alpha=0.10, lambda=0.0329
## + Fold02: alpha=0.55, lambda=0.0329
## - Fold02: alpha=0.55, lambda=0.0329
## + Fold02: alpha=1.00, lambda=0.0329
## - Fold02: alpha=1.00, lambda=0.0329
## + Fold03: alpha=0.10, lambda=0.0329
## - Fold03: alpha=0.10, lambda=0.0329
## + Fold03: alpha=0.55, lambda=0.0329
## - Fold03: alpha=0.55, lambda=0.0329
## + Fold03: alpha=1.00, lambda=0.0329
## - Fold03: alpha=1.00, lambda=0.0329
## + Fold04: alpha=0.10, lambda=0.0329
## - Fold04: alpha=0.10, lambda=0.0329
## + Fold04: alpha=0.55, lambda=0.0329
## - Fold04: alpha=0.55, lambda=0.0329
## + Fold04: alpha=1.00, lambda=0.0329
## - Fold04: alpha=1.00, lambda=0.0329
## + Fold05: alpha=0.10, lambda=0.0329
## - Fold05: alpha=0.10, lambda=0.0329
## + Fold05: alpha=0.55, lambda=0.0329
## - Fold05: alpha=0.55, lambda=0.0329
## + Fold05: alpha=1.00, lambda=0.0329
## - Fold05: alpha=1.00, lambda=0.0329
## + Fold06: alpha=0.10, lambda=0.0329
## - Fold06: alpha=0.10, lambda=0.0329
## + Fold06: alpha=0.55, lambda=0.0329
## - Fold06: alpha=0.55, lambda=0.0329
## + Fold06: alpha=1.00, lambda=0.0329
## - Fold06: alpha=1.00, lambda=0.0329
## + Fold07: alpha=0.10, lambda=0.0329
## - Fold07: alpha=0.10, lambda=0.0329
## + Fold07: alpha=0.55, lambda=0.0329
## - Fold07: alpha=0.55, lambda=0.0329
## + Fold07: alpha=1.00, lambda=0.0329
## - Fold07: alpha=1.00, lambda=0.0329
## + Fold08: alpha=0.10, lambda=0.0329
## - Fold08: alpha=0.10, lambda=0.0329
## + Fold08: alpha=0.55, lambda=0.0329
## - Fold08: alpha=0.55, lambda=0.0329
## + Fold08: alpha=1.00, lambda=0.0329
## - Fold08: alpha=1.00, lambda=0.0329
## + Fold09: alpha=0.10, lambda=0.0329
## - Fold09: alpha=0.10, lambda=0.0329
## + Fold09: alpha=0.55, lambda=0.0329
## - Fold09: alpha=0.55, lambda=0.0329
## + Fold09: alpha=1.00, lambda=0.0329
## - Fold09: alpha=1.00, lambda=0.0329
## + Fold10: alpha=0.10, lambda=0.0329
## - Fold10: alpha=0.10, lambda=0.0329
## + Fold10: alpha=0.55, lambda=0.0329
## - Fold10: alpha=0.55, lambda=0.0329
## + Fold10: alpha=1.00, lambda=0.0329
## - Fold10: alpha=1.00, lambda=0.0329
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 1, lambda = 0.000329 on full training set
## + Fold01: mtry=2, min.node.size=1, splitrule=gini
## - Fold01: mtry=2, min.node.size=1, splitrule=gini
## + Fold01: mtry=3, min.node.size=1, splitrule=gini
## - Fold01: mtry=3, min.node.size=1, splitrule=gini
## + Fold01: mtry=4, min.node.size=1, splitrule=gini
## - Fold01: mtry=4, min.node.size=1, splitrule=gini
## + Fold01: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold01: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold01: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold01: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold01: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold01: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold02: mtry=2, min.node.size=1, splitrule=gini
## - Fold02: mtry=2, min.node.size=1, splitrule=gini
## + Fold02: mtry=3, min.node.size=1, splitrule=gini
## - Fold02: mtry=3, min.node.size=1, splitrule=gini
## + Fold02: mtry=4, min.node.size=1, splitrule=gini
## - Fold02: mtry=4, min.node.size=1, splitrule=gini
## + Fold02: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold02: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold02: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold02: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold02: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold02: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold03: mtry=2, min.node.size=1, splitrule=gini
## - Fold03: mtry=2, min.node.size=1, splitrule=gini
## + Fold03: mtry=3, min.node.size=1, splitrule=gini
## - Fold03: mtry=3, min.node.size=1, splitrule=gini
## + Fold03: mtry=4, min.node.size=1, splitrule=gini
## - Fold03: mtry=4, min.node.size=1, splitrule=gini
## + Fold03: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold03: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold03: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold03: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold03: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold03: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold04: mtry=2, min.node.size=1, splitrule=gini
## - Fold04: mtry=2, min.node.size=1, splitrule=gini
## + Fold04: mtry=3, min.node.size=1, splitrule=gini
## - Fold04: mtry=3, min.node.size=1, splitrule=gini
## + Fold04: mtry=4, min.node.size=1, splitrule=gini
## - Fold04: mtry=4, min.node.size=1, splitrule=gini
## + Fold04: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold04: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold04: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold04: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold04: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold04: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold05: mtry=2, min.node.size=1, splitrule=gini
## - Fold05: mtry=2, min.node.size=1, splitrule=gini
## + Fold05: mtry=3, min.node.size=1, splitrule=gini
## - Fold05: mtry=3, min.node.size=1, splitrule=gini
## + Fold05: mtry=4, min.node.size=1, splitrule=gini
## - Fold05: mtry=4, min.node.size=1, splitrule=gini
## + Fold05: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold05: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold05: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold05: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold05: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold05: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold06: mtry=2, min.node.size=1, splitrule=gini
## - Fold06: mtry=2, min.node.size=1, splitrule=gini
## + Fold06: mtry=3, min.node.size=1, splitrule=gini
## - Fold06: mtry=3, min.node.size=1, splitrule=gini
## + Fold06: mtry=4, min.node.size=1, splitrule=gini
## - Fold06: mtry=4, min.node.size=1, splitrule=gini
## + Fold06: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold06: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold06: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold06: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold06: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold06: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold07: mtry=2, min.node.size=1, splitrule=gini
## - Fold07: mtry=2, min.node.size=1, splitrule=gini
## + Fold07: mtry=3, min.node.size=1, splitrule=gini
## - Fold07: mtry=3, min.node.size=1, splitrule=gini
## + Fold07: mtry=4, min.node.size=1, splitrule=gini
## - Fold07: mtry=4, min.node.size=1, splitrule=gini
## + Fold07: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold07: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold07: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold07: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold07: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold07: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold08: mtry=2, min.node.size=1, splitrule=gini
## - Fold08: mtry=2, min.node.size=1, splitrule=gini
## + Fold08: mtry=3, min.node.size=1, splitrule=gini
## - Fold08: mtry=3, min.node.size=1, splitrule=gini
## + Fold08: mtry=4, min.node.size=1, splitrule=gini
## - Fold08: mtry=4, min.node.size=1, splitrule=gini
## + Fold08: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold08: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold08: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold08: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold08: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold08: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold09: mtry=2, min.node.size=1, splitrule=gini
## - Fold09: mtry=2, min.node.size=1, splitrule=gini
## + Fold09: mtry=3, min.node.size=1, splitrule=gini
## - Fold09: mtry=3, min.node.size=1, splitrule=gini
## + Fold09: mtry=4, min.node.size=1, splitrule=gini
## - Fold09: mtry=4, min.node.size=1, splitrule=gini
## + Fold09: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold09: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold09: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold09: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold09: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold09: mtry=4, min.node.size=1, splitrule=extratrees
## + Fold10: mtry=2, min.node.size=1, splitrule=gini
## - Fold10: mtry=2, min.node.size=1, splitrule=gini
## + Fold10: mtry=3, min.node.size=1, splitrule=gini
## - Fold10: mtry=3, min.node.size=1, splitrule=gini
## + Fold10: mtry=4, min.node.size=1, splitrule=gini
## - Fold10: mtry=4, min.node.size=1, splitrule=gini
## + Fold10: mtry=2, min.node.size=1, splitrule=extratrees
## - Fold10: mtry=2, min.node.size=1, splitrule=extratrees
## + Fold10: mtry=3, min.node.size=1, splitrule=extratrees
## - Fold10: mtry=3, min.node.size=1, splitrule=extratrees
## + Fold10: mtry=4, min.node.size=1, splitrule=extratrees
## - Fold10: mtry=4, min.node.size=1, splitrule=extratrees
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 2, splitrule = gini, min.node.size = 1 on full training set
# Create ensemble model: stack
stack <- caretStack(models, method="glm")
test_new$pred <- predict(stack, newdata=test_new, level = 0.95)
confusionMatrix(data = test_new$pred, reference = test_new$account_y, positive = "genislam")
## Confusion Matrix and Statistics
##
## Reference
## Prediction other genislam
## other 4009 398
## genislam 152 1154
##
## Accuracy : 0.9037
## 95% CI : (0.8958, 0.9113)
## No Information Rate : 0.7283
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.744
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7436
## Specificity : 0.9635
## Pos Pred Value : 0.8836
## Neg Pred Value : 0.9097
## Prevalence : 0.2717
## Detection Rate : 0.2020
## Detection Prevalence : 0.2286
## Balanced Accuracy : 0.8535
##
## 'Positive' Class : genislam
##
Discussion
In this c topics of Generation Islam’s communication have been revealed. These topics mainly refer to contexts in which Muslims are victims of grievances. It also became evident who is associated with these grievances, namely, elements of politics and the media. In the ideological framework of this group, these are manifestations of failing political systems and confirmation that only a caliphate could finally eradicate these predicaments. Moreover, on the basis of this very topic setting, it was possible to predict whether tweets belonged to Generation Islam or not. It turned out that Muslim grievances were a predictor that was positively associated with the group. Generation Islam sets itself apart from other Muslim organizations by addressing and occupying the topic of grievances much more. This distinction prompts us to reflect on whether we adequately engage in discussing and condemning these issues, as failing to do so may hinder the cultivation of broader solidarity within society. By neglecting these concerns, we risk creating an environment where individuals feel marginalized and may be driven towards radicalization, perceiving larger segments of society as complicit in an oppressive system against their group. It urges us to ponder the importance of addressing and resolving grievances for the sake of inclusivity and societal harmony.
In the case of the IGMG, the Twitter data of the youth organization was used as they tweet in Turkish far less than their parent organization. Certainly, youth organizations communicate differently than their parent organizations, but by including them we can capture the ideological setting of the organization as a whole without including a language that I am not able to analyze.↩︎
Roberts et al., 2019 report that using spectral initialization consistently produces the best results.↩︎
Git repository provides code to produce the top 20 words with the highest probability for each topic.↩︎