Example of Tidytext utility

Question

Example of Tidytext utility

Navigation

#1 by (5 votes)

1

Hj in a lecture from my university I saw a package that is called Tidytext ... I understood how it works, so I can not think of any utility of it. Could someone give me an example of how we would take advantage of it in problems everyday? Thanks!

r

asked by anonymous 24.05.2018 / 01:39

1 answer

Use of if and else Says that it writes to mysql but does not save to the database

score 5 · Accepted Answer

tidytext is a package that seeks to instrumentalize textual analysis in a general way, and therefore has 1001 utilities (and the most important ones can be found in the main package vignette, as pointed out by @MacusNunes). Some of the possibilities that exist in the text analysis that are deployed in tidytext would highlight:

Frequency of terms

Term-document matrix (tdm, English)

Frequency of terms - inverse of frequency in documents (tf-idf, English)

Analysis of feelings

Utility Example - Word Frequency

Step 1 - Pick up any text for analysis

# install.packages("devtools")
# devtools::install_github("tomasbarcellos/valorrr")

library(valorrr)
sessao <- html_session("http://www.valor.com.br/")
links <- links_pagina(sessao)
# Primeiras 20 notícias
noticias <- ler_noticia(sessao, links[1:20])

Now we have the text of the first 20 news stories of the newspaper Valor Econômico.

Step 2 - Use tidytext to parse texts

library(tidytext)
library(dplyr)
library(stringr)

noticias_tidy <- noticias %>% 
  select(titulo, texto) %>% 
  unnest_tokens(word, texto)

stop_port <- get_stopwords(language = "pt")

noticias_tidy %>% 
  anti_join(stop_port) %>%
  count(word, sort = TRUE)

Joining, by = "word"
# A tibble: 2,069 x 2
   word              n
   <chr>         <int>
 1 r                55
 2 é                52
 3 bilhões          44
 4 governo          37
 5 caminhoneiros    28
 6 paulo            27
 7 ônibus           26
 8 diesel           24
 9 petrobras        24
10 presidente       23
# ... with 2,059 more rows

Without reading any of the news, we can already see that the newspaper is now focused on matters concerning the truck drivers' strike and fuel policy.

Note: The words r and é appear because we did not clean the data to make this example simpler.

The use of bigramas makes this conclusion even more obvious:

regex_stop <- paste0("\b", stop_port$word, "\b", collapse = "|")

noticias_bigram <- noticias %>% 
  select(titulo, texto) %>% 
  mutate(texto = str_remove_all(texto, regex_stop)) %>% 
  unnest_tokens(word, texto, "ngrams", n = 2)

noticias_bigram %>% count(word, sort = TRUE)

# A tibble: 4,365 x 2
   word                    n
   <chr>               <int>
 1 são paulo              26
 2 quinta feira           13
 3 pis cofins             10
 4 greve caminhoneiros     8
 5 preço diesel            8
 6 15 dias                 7
 7 desta quinta            7
 8 nesta quinta            7
 9 além disso              6
10 capital paulista        6
# ... with 4,355 more rows

Step 3 - Choose your next goal

Once we have structured text in tidy format, the sky is the limit. From here we could, for example, create matrices of term-document that would feed a model of prediction of the author of the text; or view the use of words in a word cloud, etc.