Textométrie

Classes, thématiques

Lise Vaudor

ISIG, UMR 5600 EVS

2024-10-01

Recherche de classes, recherche de thèmes

La recherche de classes ou de thèmes dans un corpus de textes consiste à identifier un certain nombre de thématiques ou sujets sous-jacents, reflétés dans les textes par l’utilisation privilégiée de certains mots. En d’autres termes, les classes, thèmes, ou topics, sont définis et désignés comme des listes de mots.

Corpus Inondations (wp_floods)

Considérons le corpus de textes suivant, qui comprend des articles Wikipedia (dans diverses langues) relatifs à des événements d’inondation. Le texte a été traduit en anglais via Google Translate quand nécessaire.

head(wp_floods %>% select(-textt))
flood flood_label article lang title translated_title
wd:Q14628797 2013 China–Russia floods https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods en 2013 China–Russia floods 2013 China–Russia floods
wd:Q14628797 2013 China–Russia floods https://fr.wikipedia.org/wiki/Inondations_de_2013_en_Chine_et_en_Russie fr Inondations de 2013 en Chine et en Russie 2013 floods in China and Russia
wd:Q14628797 2013 China–Russia floods https://id.wikipedia.org/wiki/Banjir_Tiongkok_dan_Rusia_2013 id Banjir Tiongkok dan Rusia 2013 2013 Chinese and Russian floods
wd:Q14628797 2013 China–Russia floods https://tl.wikipedia.org/wiki/2013_Pagbaha_sa_Tsina_at_Rusya tl 2013 Pagbaha sa Tsina at Rusya 2013 Flooding in China and Russia
wd:Q14628797 2013 China–Russia floods https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) ru Наводнения на Дальнем Востоке России и в Китае (2013) Floods in the Russian Far East and China (2013)
wd:Q119109771 2023 Haiti floods https://en.wikipedia.org/wiki/2023_Haiti_floods en 2023 Haiti floods 2023 Haiti floods

La table wp_floods comprend donc un certain nombre de colonnes correspondant à des métadonnées et une une colonne “textt” correspondant au texte des articles (traduits en anglais). Voici par exemple le premier de ces textes:

wp_floods$textt[1]
[1] ":84 dead, 105 missing, 840,000 displaced\n\nDuring mid-August 2013 parts of eastern Russia and northeastern China were stricken by heavy flooding.  At least 85 people died from the floods and more than 105 others were left missing as of August 19.[4] More than 60,000 homes were destroyed and 840,000 people evacuated from Heilongjiang, Jilin, and Liaoning provinces due to flooding which happened at the same time as flooding in China's southern Guangdong province.[5][6]\nFrom the end of July to mid-August 2013, unusually heavy rainfall occurred near the Amur River, which marks the dividing line between China and Russia.  Starting on August 10, 2013, areas of northeastern China began to experience flooding.[7] From August 15 to 17, heavy rainfall worsened the problem, causing the worst flooding in the region in more than a decade.[4][8]  Nankouqian Township, one of the hardest-hit areas, saw 44.9 centimetres (17.7 in) of rain, half the average annual total, on August 16 alone.[9]  By August 18, water levels at 61 reservoirs surpassed the \"danger\" level.  Fushun city in the Liaoning province was especially hard hit as rainstorms caused several rivers in the city to overflow.[8]  Across the border in eastern Russia, heavy flooding was also reported, with Amur Oblast, Jewish Autonomous Oblast and Khabarovsk Krai the hardest hit.[7] More than 140 towns were affected by what Russia authorities described as the worst flooding in 120 years.[10] The Amur River reached a record 100.56 metres (329.9 ft), surpassing the previous record set in 1984, and was still rising as of August 19, threatening to flood the major city of Komsomolsk-on-Amur.[9][10]\nIn China, more than 60,000 homes were destroyed and numerous roads were blocked or damaged.  More than 787,000 hectares of farm land were ruined in the region which depends heavily on farming.[4]  Power and communications lines were downed in several townships.[8] Total damaged was estimated at 16.14 billion yuan (approx. US$2.6 billion/€1.94 billion).[4]  In Russia, 3.2 billion rubles (approx. US$97 million/€73 million) was allocated for relief efforts.[7]\nChina's Liaoning province was the hardest hit with 54 reported deaths and 97 people missing as of August 19, 2013.[11]  In Jilin province, 16 deaths were reported. Heilongjiang province experienced 11 deaths.[9]  Across the region 360,000 people were displaced and 3.74 million affected in some way.[4]  No casualties were reported in Russia, but 20,000 people were evacuated.  Two captive brown bears were rescued via helicopter.[10]\nUnrelated flooding resulting from Typhoon Utor in south China occurred simultaneously, causing the death tolls from the two floods to be combined in official reports.  Typhoon Utor floods killed at least 33 people.[11]\nIn Russia, more than 30,000 volunteers helped distribute 53 tons of food and supplies to flood victims.  Officials are considering delaying the mayoral elections in Amur which are scheduled for September 8.  A decision on the elections will be made August 27.[10]\nChinese Communist Party general secretary Xi Jinping called for \"all-out efforts\" as relief work got underway.[4] More than 120,000 people, including 10,000 soldiers, helped with relief and rescue efforts.[9]"

Classification de Reinhert : rainette et quanteda

Pour réaliser une classification de Reinhert on utiliser le package rainette qui fonctionne lui-même sur des corpus au format quanteda (un package R dédié à l’analyse textuelle).

Classification de Reinhert: corpus quanteda

Transformons notre table en corpus au format quanteda:

library(quanteda)
floods_corpus=corpus(wp_floods,docid_field="article",text_field="textt")
floods_corpus <- split_segments(floods_corpus, segment_size = 100)

Nous allons réaliser les étapes de tokénization et nettoyage du corpus à l’aide du package quanteda (pour conserver le format)

tok <- tokens(floods_corpus, remove_punct = TRUE, remove_numbers = TRUE)
tok <- tokens_remove(tok, stopwords("en"))
tok <- tokens_tolower(tok)

Classification de Reinhert : dfm

Pour poursuivre avec la classification, on transforme l’objet tok en document-feature matrix

dtm <- dfm(tok)
dim(dtm)
[1] 18023 53115
dtm[1:5,1:5]
Document-feature matrix of: 5 documents, 5 features (72.00% sparse) and 6 docvars.
                                                                  features
docs                                                               dead
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_1    1
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_2    0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_3    0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_4    0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_5    0
                                                                  features
docs                                                               missing
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_1       2
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_2       0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_3       0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_4       1
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_5       0
                                                                  features
docs                                                               displaced
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_1         1
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_2         0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_3         0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_4         1
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_5         0
                                                                  features
docs                                                               mid-august
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_1          2
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_2          0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_3          0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_4          0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_5          0
                                                                  features
docs                                                               parts
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_1     1
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_2     0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_3     0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_4     0
  https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods_5     0

La matrice est très lourde, on va l’alléger pour pouvoir poursuivre l’analyse

dtm_trimmed <- dfm_trim(dtm, min_docfreq = 30)
dim(dtm_trimmed)
[1] 18023  3651

Ici, on a retiré l’ensemble des termes qui apparaissaient dans moins de 30 documents.

Classification de Reinhert: run

On réalise la classification à proprement parler:

library(rainette)
res=rainette(dtm_trimmed,k=10,min_split_members=10)

Classification de Reinhert: explore

On peut explorer les résultats de cette classification via une appli interactive:

rainette_explor(res,dtm,floods_corpus)

screenshot de l’application d’exploration de rainette

Classification de Reinhert: explore

On peut explorer les résultats de cette classification via une appli interactive:

rainette_explor(res,dtm,floods_corpus)

screenshot de l’application d’exploration de rainette

Classification de Reinhert: arbre

Pour récupérer la figure correspondant à l’arbre de classification:

rainette_plot(
  res, dtm,
  k = 6,
  n_terms = 20
)

Classification de Reinhert: classes des morceaux

On peut récupérer les classes de chacun des morceaux de documents de la manière suivante:

docvars(floods_corpus) %>% 
  mutate(class=paste0("class_",cutree_rainette(res, k = 6))) %>%  
  select(flood_label,translated_title,segment_source,class) %>% 
  head(30)
flood_label translated_title segment_source class
2013 China–Russia floods 2013 China–Russia floods https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods class_5
2013 China–Russia floods 2013 China–Russia floods https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods class_5
2013 China–Russia floods 2013 China–Russia floods https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods class_5
2013 China–Russia floods 2013 China–Russia floods https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods class_5
2013 China–Russia floods 2013 China–Russia floods https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods class_2
2013 China–Russia floods 2013 floods in China and Russia https://fr.wikipedia.org/wiki/Inondations_de_2013_en_Chine_et_en_Russie class_5
2013 China–Russia floods 2013 floods in China and Russia https://fr.wikipedia.org/wiki/Inondations_de_2013_en_Chine_et_en_Russie class_3
2013 China–Russia floods 2013 floods in China and Russia https://fr.wikipedia.org/wiki/Inondations_de_2013_en_Chine_et_en_Russie class_3
2013 China–Russia floods 2013 floods in China and Russia https://fr.wikipedia.org/wiki/Inondations_de_2013_en_Chine_et_en_Russie class_5
2013 China–Russia floods 2013 Chinese and Russian floods https://id.wikipedia.org/wiki/Banjir_Tiongkok_dan_Rusia_2013 class_5
2013 China–Russia floods 2013 Flooding in China and Russia https://tl.wikipedia.org/wiki/2013_Pagbaha_sa_Tsina_at_Rusya class_5
2013 China–Russia floods 2013 Flooding in China and Russia https://tl.wikipedia.org/wiki/2013_Pagbaha_sa_Tsina_at_Rusya class_5
2013 China–Russia floods 2013 Flooding in China and Russia https://tl.wikipedia.org/wiki/2013_Pagbaha_sa_Tsina_at_Rusya class_3
2013 China–Russia floods 2013 Flooding in China and Russia https://tl.wikipedia.org/wiki/2013_Pagbaha_sa_Tsina_at_Rusya class_5
2013 China–Russia floods 2013 Flooding in China and Russia https://tl.wikipedia.org/wiki/2013_Pagbaha_sa_Tsina_at_Rusya class_2
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_3
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_3
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_3
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_3
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_3
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_3
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_3
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_4
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_4
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_3
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_4
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_3
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_4
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_4
2013 China–Russia floods Floods in the Russian Far East and China (2013) https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) class_4

Classification de Reinhert: classes des documents

Résultats

Le package rainette propose aussi des fonctions qui permettent de calculer rapidement quelques statistiques dérivées de la classification:

floods_corpus$class=paste0("class_",cutree_rainette(res, k = 6))
clusters_by_doc_table(floods_corpus, clust_var = "class") %>% 
  head()
doc_id class_1 class_2 class_3 class_4 class_5 class_6 class_NA
https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods 0 1 0 0 4 0 0
https://fr.wikipedia.org/wiki/Inondations_de_2013_en_Chine_et_en_Russie 0 0 2 0 2 0 0
https://id.wikipedia.org/wiki/Banjir_Tiongkok_dan_Rusia_2013 0 0 0 0 1 0 0
https://tl.wikipedia.org/wiki/2013_Pagbaha_sa_Tsina_at_Rusya 0 1 1 0 3 0 0
https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) 0 0 9 8 0 0 0
https://en.wikipedia.org/wiki/2023_Haiti_floods 0 0 3 1 0 0 0
clusters_by_doc_table(floods_corpus, clust_var = "class", prop = TRUE) %>% 
  head()
doc_id class_1 class_2 class_3 class_4 class_5 class_6 class_NA
https://en.wikipedia.org/wiki/2013_China%E2%80%93Russia_floods 0 20 0.00000 0.00000 80 0 0
https://fr.wikipedia.org/wiki/Inondations_de_2013_en_Chine_et_en_Russie 0 0 50.00000 0.00000 50 0 0
https://id.wikipedia.org/wiki/Banjir_Tiongkok_dan_Rusia_2013 0 0 0.00000 0.00000 100 0 0
https://tl.wikipedia.org/wiki/2013_Pagbaha_sa_Tsina_at_Rusya 0 20 20.00000 0.00000 60 0 0
https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D0%B2%D0%BE%D0%B4%D0%BD%D0%B5%D0%BD%D0%B8%D1%8F_%D0%BD%D0%B0_%D0%94%D0%B0%D0%BB%D1%8C%D0%BD%D0%B5%D0%BC_%D0%92%D0%BE%D1%81%D1%82%D0%BE%D0%BA%D0%B5_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8_%D0%B8_%D0%B2_%D0%9A%D0%B8%D1%82%D0%B0%D0%B5_(2013) 0 0 52.94118 47.05882 0 0 0
https://en.wikipedia.org/wiki/2023_Haiti_floods 0 0 75.00000 25.00000 0 0 0

STM: c’est quoi

Il existe plusieurs méthodes et plusieurs algorithmes pour réaliser une recherche de classes et ou de thèmes. Dans cette prochaine partie je me focaliserai sur le Structural Topic Modelling (STM) et sur sa mise en oeuvre à l’aide du package R stm.

La vignette de ce package, qui détaille la méthode (algorithme) et sa mise en oeuvre sous R peut être consultée ici.

Les explications de Julia Silge, ici et sur la méthode et sa mise en oeuvre dans le cadre du tidyverse et de tidytext peuvent également servir…

STM: principes

On note

  • \(K\) le nombre total de thèmes
  • \(P\) le nombre total de mots
  • \(N\) le nombre total de documents

Scores \(\beta\) (beta)

On note \(\beta_{k,p}\) la probabilité conditionnelle d’un mot \(p\) pour un thème \(k\)

On peut ainsi représenter les scores \(\beta_{k,p}\) comme une matrice (ou un tableau) \(K*P\) (thèmes x mots distincts dans le corpus).

Ainsi, pour faire simple, on peut noter ces coefficients \(\beta_{k,p}\), mais si vous voulez “visualiser” ces coefficients pour comprendre la suite gardez à l’esprit qu’ils se présentent en fait comme suit:

\(w_1\) \(w_p\) \(w_P\)
\(T_1\)
\(T_k\) \(\beta_{k,p}\)
\(T_K\)

FREX

\(FREX\) FRequence and Exclusivité

Le score \(FREX_{k,p}\) d’un mot \(p\) pour un thème \(k\) est:

\[FREX_{k,p}=\left(\frac{w}{F}+\frac{1-w}{E}\right)^{-1}\]

  • \(w\) est un poids (choisi entre 0 et 1) qui détermine la part du score \(FREX\) qui correspond à la fréquence (\(F\)) et la part correspondant à l’exclusivité (\(E\))
  • \(F\) est le score correspondant à la fréquence, et reflète la probabilité du mot conditionnellement au thème.
  • \(E\) est le score correspondant à l’exclusivité , et reflète la probabilité du thème conditionnellement au mot.

Etapes du STM

Le STM repose sur la génération de documents selon le modèle suivant:

  1. on tire au hasard une distribution des thèmes pour chaque document \(\theta_k\). On peut si on le souhaite, utiliser les métadonnées pour effectuer un tirage autour de moyennes liées à ces métadonnées (i.e. on considère que les métadonnées orientent plutôt vers tel ou tel thème).
  2. on calcule pour chaque document les valeurs \(\beta_{k,p}\) (donc on a N matrices à calculer, une pour chaque document)
  3. pour chaque mot dans chaque document,
    1. on tire un thème \(k\) en fonction de la distribution calculée à l’étape 1
    2. en se basant sur ce thème \(k\) on tire au hasard un mot à l’aide des valeurs \(\beta_{k,p}\)

Ainsi, partant de \(\theta_k\) et de \(K\) documents, on génère \(K\) nouveau documents.

On itère jusqu’à la convergence du modèle.

STM: Mise en oeuvre

Contrairement à la méthode de classification de Reinhert, l’algorithme STM peut fonctionner sur une table de termes tokénisés à l’aide de tidytext (et lemmatisés à travers une jointure avec Lexique382).

On va ainsi repartir d’une table wp_words:

wp_words %>% 
  head(20)
flood lemma ndoc n
wd:Q100293305 absent 5 1
wd:Q100293305 absorb 50 1
wd:Q100293305 abutment 6 1
wd:Q100293305 access 107 2
wd:Q100293305 accompany 74 1
wd:Q100293305 account 110 4
wd:Q100293305 accumulation 60 2
wd:Q100293305 achieve 35 2
wd:Q100293305 acre 72 3
wd:Q100293305 action 114 1
wd:Q100293305 activate 51 1
wd:Q100293305 activity 111 4
wd:Q100293305 add 143 2
wd:Q100293305 addition 256 3
wd:Q100293305 adjust 20 1
wd:Q100293305 admiral 3 1
wd:Q100293305 admission 2 2
wd:Q100293305 advance 73 3
wd:Q100293305 advantage 18 1
wd:Q100293305 adverse 18 1

STM: Mise en oeuvre

Considérons une autre manière de représenter l’information textuelle contenue par la table tib_lemmes. Nous allons compter le nombre d’occurences de chaque lemme (lemma) dans chaque document (doc) et mettre cette information sous forme de matrice \(N*p\) (le nombre de documents * le nombre total de lemmes distincts). Nous allons néanmoins exercer un tri préliminaire en retirant les mots qui sont rares à l’échelle du corpus (ici, on ne garde que ceux pour lesquels la fréquence>20).

Il s’agit d’une sorte de tableau lexical pour une partition du corpus qui correspondrait au document.

La matrice résultante est dite “sparse” car beaucoup de ses cases ont pour valeur 0 (i.e. beaucoup de mots ne se trouvent que dans quelques documents).

La fonction cast_sparse() du package tidytext permet d’effectuer ce reformatage très facilement:

tib_sparse=wp_words %>% 
  cast_sparse(row=flood, column=lemma, value=n)

dim(tib_sparse)
[1]  288 9337

¨ tib_sparse est bien une matrice de N=288 lignes (le nombre de documents) et de p=9337 colonnes (le nombre de lemmes dont la fréquence dans le corpus est >20)

STM: Calcul du modèle

Chargeons le package stm et faisons tourner l’algorithme STM sur notre table, en demandant 6 thèmes distincts :

library(stm)
set.seed(123)

topic_model<-stm(tib_sparse,K=6, verbose=FALSE)

Le calcul du STM se fait de manière itérative. Ici, il a fallu un certain nombre d’itérations (ici une centaine) pour que le modèle converge. Pour donner un ordre de grandeur, il a fallu quelques minutes sur ma machine pour obtenir le résultat.

Les objets stm sont associés à une méthode summary() qui permet de visualiser les thèmes identifiés:

summary(topic_model)
A topic model with 6 topics, 288 documents and a 9337 word dictionary.
Topic 1 Top Words:
     Highest Prob: person, affect, area, state, rain, million, cause 
     FREX: monsoon, displace, relief, brazil, food, crore, aid 
     Lift: khan, bazaar, cotton, dengue, leishmaniasis, recyclable, sugarcane 
     Score: monsoon, rainfall, subcontinent, lakh, brazil, person, leone 
Topic 2 Top Words:
     Highest Prob: water, dam, level, area, river, damage, affect 
     FREX: spa, attack, reservoir, explosion, seine, occupy, civilian 
     Lift: atrocity, columnist, contaminant, fairytale, seismologist, cube, brutality 
     Score: spa, counteroffensive, offensive, ecocide, dam, invasion, occupier 
Topic 3 Top Words:
     Highest Prob: rain, heavy, area, damage, person, city, issue 
     FREX: prefecture, advisory, issue, si, japan, warning, lift 
     Lift: avocado, backyard, blueberry, broaden, chop, cinematographer, decelerate 
     Score: gun, rain, prefecture, si, advisory, heavy, rainfall 
Topic 4 Top Words:
     Highest Prob: storm, sea, water, area, north, dike, damage 
     FREX: surge, tide, sea, storm, glacier, hurricane, hamburg 
     Lift: churchyard, auk, damp, firth, dirk, dolphin, loo 
     Score: storm, dike, glacier, hamburg, sea, surge, coast 
Topic 5 Top Words:
     Highest Prob: river, water, city, area, level, person, damage 
     FREX: china, fork, levee, river, republic, creek, grand 
     Lift: notch, rightist, bolivar, springtime, tunica, banker, barbecue 
     Score: river, fork, nationalist, china, rainfall, county, levee 
Topic 6 Top Words:
     Highest Prob: mud, day, person, molasses, year, disaster, story 
     FREX: molasses, narrative, shun, bible, genesis, whiskey, beer 
     Lift: polite, bocce, honey, sulfide, scholarship, tooth, barmaid 
     Score: molasses, genesis, beer, narrative, myth, shun, ark 

STM: termes thématiques

Chacun des thèmes est identifié par des termes privilégiés (selon plusieurs métriques différentes: Highest probability i.e. \(beta\), FREX, lift, score)…

termes_thematiques=tidy(topic_model, matrix="beta") %>% 
  group_by(topic) %>% 
  slice_max(beta,n=10) %>%  
  mutate(rank=row_number()) %>% 
  arrange(topic,desc(beta)) %>% 
  ungroup()
termes_thematiques
topic term beta rank
1 person 0.0307711 1
1 affect 0.0177151 2
1 area 0.0133653 3
1 state 0.0117848 4
1 rain 0.0117704 5
1 million 0.0111208 6
1 cause 0.0103842 7
1 damage 0.0102625 8
1 district 0.0100147 9
1 water 0.0092498 10
2 water 0.0263526 1
2 dam 0.0203013 2
2 level 0.0103180 3
2 area 0.0100758 4
2 river 0.0086220 5
2 damage 0.0069382 6
2 affect 0.0064064 7
2 reservoir 0.0060844 8
2 state 0.0057915 9
2 time 0.0056648 10
3 rain 0.0334037 1
3 heavy 0.0261140 2
3 area 0.0135341 3
3 damage 0.0129837 4
3 person 0.0110027 5
3 city 0.0101019 6
3 issue 0.0098850 7
3 due 0.0098362 8
3 road 0.0089823 9
3 river 0.0088680 10
4 storm 0.0305818 1
4 sea 0.0206151 2
4 water 0.0171280 3
4 area 0.0138419 4
4 north 0.0123516 5
4 dike 0.0118501 6
4 damage 0.0101652 7
4 coast 0.0101393 8
4 high 0.0095027 9
4 person 0.0093093 10
5 river 0.0418897 1
5 water 0.0211502 2
5 city 0.0130863 3
5 area 0.0124568 4
5 level 0.0114315 5
5 person 0.0107296 6
5 damage 0.0103097 7
5 cause 0.0095566 8
5 dam 0.0094053 9
5 high 0.0091541 10
6 mud 0.0135858 1
6 day 0.0088853 2
6 person 0.0075898 3
6 molasses 0.0070277 4
6 year 0.0062129 5
6 disaster 0.0054503 6
6 story 0.0054070 7
6 time 0.0053257 8
6 company 0.0051778 9
6 god 0.0051365 10

STM: graphique des termes par topic

ggplot(termes_thematiques  %>%
         mutate(topic=as.factor(topic)) %>%
         mutate(term=reorder_within(term,by=beta,within=topic)),
       aes(x=beta,y=term, fill=topic))+
  geom_bar(stat="identity")+
    facet_wrap(facets=vars(topic), scales="free")+
    theme(legend.position="none")+
  scale_y_reordered()

STM: Assignation d’un thème aux documents

Si la matrice “beta” correspond à la probabilité d’un terme dans un thème, on s’intéresse également à la probabilité qu’un document s’inscrive dans un thème. Cette probabilité est donnée par la matrice gamma:

tib_gamma <- tidy(topic_model, matrix = "gamma") %>% 
  arrange(document,desc(gamma))
head(tib_gamma, n=15)
document topic gamma
1 1 0.4312574
1 3 0.4169888
1 4 0.1471479
1 2 0.0030456
1 5 0.0014555
1 6 0.0001048
2 1 0.3588930
2 2 0.2367904
2 3 0.2254461
2 5 0.1671774
2 4 0.0095235
2 6 0.0021696
3 5 0.6311370
3 3 0.2353388
3 6 0.0491835