One way to solve this problem is through the packages dplyr
and broom
:
library(dplyr)
library(broom)
iris.regressao <- iris %>%
group_by(Species) %>%
do(regressao =
lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=.))
tidy(iris.regressao, regressao)
# A tibble: 12 x 6
# Groups: Species [3]
Species term estimate std.error statistic p.value
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 setosa (Intercept) 2.3518898 0.39286751 5.9864707 3.034183e-07
2 setosa Sepal.Width 0.6548350 0.09244742 7.0833236 6.834434e-09
3 setosa Petal.Length 0.2375602 0.20801921 1.1420107 2.593594e-01
4 setosa Petal.Width 0.2521257 0.34686362 0.7268727 4.709870e-01
5 versicolor (Intercept) 1.8955395 0.50705524 3.7383295 5.112246e-04
6 versicolor Sepal.Width 0.3868576 0.20454490 1.8913091 6.488965e-02
7 versicolor Petal.Length 0.9083370 0.16543248 5.4906811 1.666695e-06
8 versicolor Petal.Width -0.6792238 0.43538206 -1.5600639 1.255990e-01
9 virginica (Intercept) 0.6998830 0.53360089 1.3116227 1.961563e-01
10 virginica Sepal.Width 0.3303370 0.17432873 1.8949086 6.439972e-02
11 virginica Petal.Length 0.9455356 0.09072204 10.4223360 1.074269e-13
12 virginica Petal.Width -0.1697527 0.19807243 -0.8570233 3.958750e-01
I took the data set iris
and considered the Species
column to be the product of your problem. This was my grouping variable.
I made a regression using the Sepal.Length
variable as a response and the others as predictors. The do
function applied this regression to all levels of the Species
variable. That is, I did three simultaneous regressions.
I organized the final result using the tidy
function, to make everything more presentable.
Editing done after the dataset to be made available in comments .
I used the original dataset and could not get the reported error.
library(dplyr)
library(broom)
dados <- read.csv(file="arquivo.csv", sep=";", dec=",")
Just for the sake of conscience, I wanted to see how many times each product appears in the database:
produtos.contagem <- dados %>%
select(Produto) %>%
group_by(Produto) %>%
count() %>%
arrange(n)
# A tibble: 2,418 x 2
# Groups: Produto [2,418]
Produto n
<fctr> <int>
1 Produto 1 1
2 Produto 100 1
3 Produto 1000 1
4 Produto 1006 1
5 Produto 1011 1
6 Produto 1012 1
7 Produto 1013 1
8 Produto 1021 1
9 Produto 1022 1
10 Produto 1023 1
# ... with 2,408 more rows
We can see that there are many products that appear only once, which will generate problems with regression. After all, geometrically speaking, you need at least 3 points in space to fit an equation of type y = x_1 + x_2
, which is the case here. And yet, an equation of this kind will be uniquely defined, giving no chance to estimate variability.
But leaving aside the theoretical assumptions of linear regression, it is still possible to adjust the desired regression:
dados.regressao <- dados %>%
group_by(Produto) %>%
do(regressao =
lm(Vendas.Diarias ~ Menor.Preco.Concorrente + Preco.Meu.Site, data=.))
tidy(dados.regressao, regressao)
# A tibble: 3,725 x 6
# Groups: Produto [2,418]
Produto term estimate std.error statistic p.value
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Produto 1 (Intercept) 2.000000 NaN NaN NaN
2 Produto 10 (Intercept) -73.045872 75.5475262 -0.9668863 0.4356077
3 Produto 10 Preco.Meu.Site 1.834862 1.8348624 1.0000000 0.4226497
4 Produto 100 (Intercept) 1.000000 NaN NaN NaN
5 Produto 1000 (Intercept) 2.000000 NaN NaN NaN
6 Produto 1001 (Intercept) 3.500000 1.5000000 2.3333333 0.2577621
7 Produto 1002 (Intercept) 38.970000 21.3485433 1.8254173 0.3190534
8 Produto 1002 Menor.Preco.Concorrente -0.300000 0.1732051 -1.7320508 0.3333333
9 Produto 1003 (Intercept) 110.760000 NaN NaN NaN
10 Produto 1003 Menor.Preco.Concorrente -0.800000 NaN NaN NaN
# ... with 3,715 more rows
The NaN
appear exactly in cases where it is impossible to adjust the desired model. For example, take Produto 1
. It occurred only once in the database. It is impossible to fit a single plane that passes through this point, because there are infinite planes with this characteristic.
What I recommend is removing products with few comments from the database. In the example below I keep only products with 5 or more comments in the database:
produtos.definitivos <- produtos.contagem %>%
filter(n >= 5)
dados.limpos <- dados %>%
filter(dados$Produto %in% produtos.definitivos$Produto)
dados.regressao <- dados.limpos %>%
group_by(Produto) %>%
do(regressao =
lm(Vendas.Diarias ~ Menor.Preco.Concorrente + Preco.Meu.Site, data=.))
tidy(dados.regressao, regressao)
# A tibble: 1,153 x 6
# Groups: Produto [526]
Produto term estimate std.error statistic
<fctr> <chr> <dbl> <dbl> <dbl>
1 Produto 1004 (Intercept) 3.626591760 55.74608778 0.06505554
2 Produto 1004 Menor.Preco.Concorrente 0.048689139 0.06657824 0.73130714
3 Produto 1004 Preco.Meu.Site -0.087390762 0.84020377 -0.10401139
4 Produto 1005 (Intercept) 22.974217287 68.67298304 0.33454521
5 Produto 1005 Menor.Preco.Concorrente 0.008938733 0.09727655 0.09188990
6 Produto 1005 Preco.Meu.Site -0.117537498 0.34017868 -0.34551694
7 Produto 1007 (Intercept) 38.987889938 77.15185884 0.50533961
8 Produto 1007 Menor.Preco.Concorrente -0.015792461 0.03091735 -0.51079615
9 Produto 1007 Preco.Meu.Site -0.175931611 0.39215747 -0.44862492
10 Produto 101 (Intercept) -45.326666667 23.83058282 -1.90203769
# ... with 1,143 more rows, and 1 more variables: p.value <dbl>
Note that the NaN
are gone, just because we now have more than the minimum number of observations required to fit the desired model in each product.