Hj in a lecture from my university I saw a package that is called Tidytext ... I understood how it works, so I can not think of any utility of it. Could someone give me an example of how we would take advantage of it in problems everyday? Thanks!
Hj in a lecture from my university I saw a package that is called Tidytext ... I understood how it works, so I can not think of any utility of it. Could someone give me an example of how we would take advantage of it in problems everyday? Thanks!
tidytext
is a package that seeks to instrumentalize textual analysis in a general way, and therefore has 1001 utilities (and the most important ones can be found in the main package vignette, as pointed out by @MacusNunes). Some of the possibilities that exist in the text analysis that are deployed in tidytext
would highlight:
# install.packages("devtools")
# devtools::install_github("tomasbarcellos/valorrr")
library(valorrr)
sessao <- html_session("http://www.valor.com.br/")
links <- links_pagina(sessao)
# Primeiras 20 notícias
noticias <- ler_noticia(sessao, links[1:20])
Now we have the text of the first 20 news stories of the newspaper Valor Econômico.
library(tidytext)
library(dplyr)
library(stringr)
noticias_tidy <- noticias %>%
select(titulo, texto) %>%
unnest_tokens(word, texto)
stop_port <- get_stopwords(language = "pt")
noticias_tidy %>%
anti_join(stop_port) %>%
count(word, sort = TRUE)
Joining, by = "word"
# A tibble: 2,069 x 2
word n
<chr> <int>
1 r 55
2 é 52
3 bilhões 44
4 governo 37
5 caminhoneiros 28
6 paulo 27
7 ônibus 26
8 diesel 24
9 petrobras 24
10 presidente 23
# ... with 2,059 more rows
Without reading any of the news, we can already see that the newspaper is now focused on matters concerning the truck drivers' strike and fuel policy.
Note: The words
r
andé
appear because we did not clean the data to make this example simpler.
The use of bigramas
makes this conclusion even more obvious:
regex_stop <- paste0("\b", stop_port$word, "\b", collapse = "|")
noticias_bigram <- noticias %>%
select(titulo, texto) %>%
mutate(texto = str_remove_all(texto, regex_stop)) %>%
unnest_tokens(word, texto, "ngrams", n = 2)
noticias_bigram %>% count(word, sort = TRUE)
# A tibble: 4,365 x 2
word n
<chr> <int>
1 são paulo 26
2 quinta feira 13
3 pis cofins 10
4 greve caminhoneiros 8
5 preço diesel 8
6 15 dias 7
7 desta quinta 7
8 nesta quinta 7
9 além disso 6
10 capital paulista 6
# ... with 4,355 more rows
Once we have structured text in tidy
format, the sky is the limit. From here we could, for example, create matrices of term-document that would feed a model of prediction of the author of the text; or view the use of words in a word cloud, etc.