These is a very verbose code documentation of a talk that I held at the Social Science Data Lab at the Mannheimer Zentrum für Europäische Sozialforschung (MZES) upon invitation from Christiane Grill. Thanks for having me!

Overview of this talk

  1. Why quanteda?
  2. Using quanteda
  3. Applying dictionaries
  4. Unsupervised machine learning
  5. Supervised machine learning
  6. Closing remarks

Why quanteda?

Kenneth Benoit, creator of quanteda

Kenneth Benoit, creator of quanteda

Why quanteda?

Using quanteda

Most analyses with quanteda onsist of three steps:

  1. Import the data
  2. Build a corpus
  3. Calculate a DFM
Model of a DTM.

Model of a DTM.

Using quanteda: Reading data

sherlock <- readtext("data/sherlock/novels/[0-9]*.txt") 
sherlock$doc_id <- str_sub(sherlock$doc_id, start = 4, end = -5)
mycorpus <- corpus(sherlock, docid_field = "doc_id")
docvars(mycorpus, "Textno") <- sprintf("%02d", 1:ndoc(mycorpus))
mycorpus
## Corpus consisting of 12 documents and 1 docvar.

Using quanteda: Generating corpus statistics

mycorpus.stats <- summary(mycorpus)
mycorpus.stats$Text <- reorder(mycorpus.stats$Text, 1:ndoc(mycorpus), order = T)
mycorpus.stats

Using quanteda: What makes DFMs nifty

Things to remember about DFMS:

Using quanteda: Calculating a DFM (1)

mydfm <- dfm(mycorpus, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove = stopwords("english"))
mydfm
## Document-feature matrix of: 12 documents, 8,489 features (79.1% sparse).
head(dfm_sort(mydfm, decreasing = TRUE, margin = "both"), n = 12, nf = 10) 
## Document-feature matrix of: 12 documents, 10 features (0.0% sparse).
## 12 x 10 sparse Matrix of class "dfm"
##                                        features
## docs                                    said upon holmes one man mr little
##   The Adventure of the Speckled Band      44   41     55  33  11  5     17
##   The Adventure of the Copper Beeches     47   33     42  36  34 44     37
##   The Boscombe Valley Mystery             37   42     43  31  41 24     25
##   The Man with the Twisted Lip            28   54     28  36  30 20     21
##   The Adventure of the Beryl Coronet      45   33     26  32  27 20     22
##   The Red-headed League                   51   50     51  29  25 55     25
##   A Scandal in Bohemia                    33   25     47  27  23  9     14
##   The Adventure of the Engineer's Thumb   47   38     12  33  17 11     25
##   The Adventure of the Noble Bachelor     33   29     34  31  10 17     26
##   The Adventure of the Blue Carbuncle     43   38     34  38  37 17     24
##   The Five Orange Pips                    32   47     25  29  19  3      5
##   A Case of Identity                      45   35     46  17  16 50     28
##                                        features
## docs                                    now see may
##   The Adventure of the Speckled Band     21  22  19
##   The Adventure of the Copper Beeches    18  17  21
##   The Boscombe Valley Mystery            16  24  19
##   The Man with the Twisted Lip           27  18  15
##   The Adventure of the Beryl Coronet     29  20  25
##   The Red-headed League                  14  23   8
##   A Scandal in Bohemia                   17  15  21
##   The Adventure of the Engineer's Thumb  16  16   9
##   The Adventure of the Noble Bachelor    16  16  18
##   The Adventure of the Blue Carbuncle    33  27   7
##   The Five Orange Pips                   12  16  24
##   A Case of Identity                     15  15  11

Using quanteda: Calculating a DFM (2)

load("data/euspeech/euspeech.korpus.RData")
korpus.euspeech
## Corpus consisting of 17,505 documents and 10 docvars.
mydfm.eu <- dfm(korpus.euspeech, groups = "Typ")
mydfm.eu.prop <- dfm_weight(mydfm.eu, scheme = "prop")
head(dfm_sort(mydfm.eu.prop, decreasing = TRUE, margin = "both"), nf = 8) 
## Document-feature matrix of: 2 documents, 8 features (0.0% sparse).
## 2 x 8 sparse Matrix of class "dfm"
##            features
## docs           european        also     countri        need        year
##   Regierung 0.006177315 0.007775821 0.008114498 0.003875806 0.005878028
##   EU        0.010959076 0.006602713 0.004304388 0.006419974 0.004401859
##            features
## docs              europ      polici      govern
##   Regierung 0.003943332 0.002321008 0.006384211
##   EU        0.005781216 0.006610397 0.002485433

Applying dictionaries: Defining an ad-hoc dictionary

populism.liberalism.dict <- dictionary(list(populism = c("elit*", "consensus*", "undemocratic*", "referend*", "corrupt*", "propagand", "politici*", "*deceit*", "*deceiv*", "*betray*", "shame*", "scandal*", "truth*", "dishonest*", "establishm*", "ruling*"), liberalism = c("liber*", "free*", "indiv*", "open*", "law*", "rules", "order", "rights", "trade", "global", "inter*", "trans*", "minori*", "exchange", "market*")))
populism.liberalism.dict
## Dictionary object with 2 key entries.
## - [populism]:
##   - elit*, consensus*, undemocratic*, referend*, corrupt*, propagand, politici*, *deceit*, *deceiv*, *betray*, shame*, scandal*, truth*, dishonest*, establishm*, ruling*
## - [liberalism]:
##   - liber*, free*, indiv*, open*, law*, rules, order, rights, trade, global, inter*, trans*, minori*, exchange, market*

Applying dictionaries: Applying an ad-hoc dictionary

mydfm.eu <- dfm(korpus.euspeech, dictionary = populism.liberalism.dict)
mydfm.eu.prop <- dfm_weight(mydfm.eu, scheme = "prop")
eu.poplib <- convert(mydfm.eu.prop, "data.frame") %>% 
  bind_cols(korpus.euspeech.stats) %>% 
  filter(length >= 1200, populism > 0 | liberalism > 0)
ggplot(eu.poplib, aes(country, populism)) + geom_boxplot(outlier.size = 0) + geom_jitter(aes(country,populism), position = position_jitter(width = 0.4, height = 0), alpha = 0.1, size = 0.2, show.legend = F) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + xlab("") + ylab("Populism share") + ggtitle("Populism share in the EUspeech corpus based on our ad-hoc dictionary (%)")