Example of Tidytext utility

1

Hj in a lecture from my university I saw a package that is called Tidytext ... I understood how it works, so I can not think of any utility of it. Could someone give me an example of how we would take advantage of it in problems everyday? Thanks!

    
asked by anonymous 24.05.2018 / 01:39

1 answer

5

tidytext is a package that seeks to instrumentalize textual analysis in a general way, and therefore has 1001 utilities (and the most important ones can be found in the main package vignette, as pointed out by @MacusNunes). Some of the possibilities that exist in the text analysis that are deployed in tidytext would highlight:

  • Frequency of terms
  • Term-document matrix (tdm, English)
  • Frequency of terms - inverse of frequency in documents (tf-idf, English)
  • Analysis of feelings
  • Utility Example - Word Frequency

    Step 1 - Pick up any text for analysis

    # install.packages("devtools")
    # devtools::install_github("tomasbarcellos/valorrr")
    
    library(valorrr)
    sessao <- html_session("http://www.valor.com.br/")
    links <- links_pagina(sessao)
    # Primeiras 20 notícias
    noticias <- ler_noticia(sessao, links[1:20])
    

    Now we have the text of the first 20 news stories of the newspaper Valor Econômico.

    Step 2 - Use tidytext to parse texts

    library(tidytext)
    library(dplyr)
    library(stringr)
    
    noticias_tidy <- noticias %>% 
      select(titulo, texto) %>% 
      unnest_tokens(word, texto)
    
    stop_port <- get_stopwords(language = "pt")
    
    noticias_tidy %>% 
      anti_join(stop_port) %>%
      count(word, sort = TRUE)
    
    Joining, by = "word"
    # A tibble: 2,069 x 2
       word              n
       <chr>         <int>
     1 r                55
     2 é                52
     3 bilhões          44
     4 governo          37
     5 caminhoneiros    28
     6 paulo            27
     7 ônibus           26
     8 diesel           24
     9 petrobras        24
    10 presidente       23
    # ... with 2,059 more rows
    

    Without reading any of the news, we can already see that the newspaper is now focused on matters concerning the truck drivers' strike and fuel policy.

      

    Note: The words r and é appear because we did not clean the data to make this example simpler.

    The use of bigramas makes this conclusion even more obvious:

    regex_stop <- paste0("\b", stop_port$word, "\b", collapse = "|")
    
    noticias_bigram <- noticias %>% 
      select(titulo, texto) %>% 
      mutate(texto = str_remove_all(texto, regex_stop)) %>% 
      unnest_tokens(word, texto, "ngrams", n = 2)
    
    noticias_bigram %>% count(word, sort = TRUE)
    
    # A tibble: 4,365 x 2
       word                    n
       <chr>               <int>
     1 são paulo              26
     2 quinta feira           13
     3 pis cofins             10
     4 greve caminhoneiros     8
     5 preço diesel            8
     6 15 dias                 7
     7 desta quinta            7
     8 nesta quinta            7
     9 além disso              6
    10 capital paulista        6
    # ... with 4,355 more rows
    

    Step 3 - Choose your next goal

    Once we have structured text in tidy format, the sky is the limit. From here we could, for example, create matrices of term-document that would feed a model of prediction of the author of the text; or view the use of words in a word cloud, etc.

        
    24.05.2018 / 17:18