Multiple Linear Regression in R in different groups

2

I have a data set with 4 columns:

Product - Price Site 1 - Competitor Site - Sold Quantities

Where Product is a specific product (Refrigerator 1, refrigerator 2, etc.); Price Site 1 is the price charged by my site; Competitor Site Price is the lowest price for the same product as a competitor; and Quantities Sold is the number of products sold to my site. Each row is a different scenario, combining these variables. There are several records for each product, since I have priced different prices, as well as my competitor.

I want to make a multiple linear regression model, correlating these variables for each product. Ex: The Refrigerator 1, when I practiced the price x1, and my competitor practiced the price x2, sold and Products.

The problem is that I want to do this in bulk, for various products. How do I make a R code for this?

    
asked by anonymous 17.08.2017 / 02:47

1 answer

2

One way to solve this problem is through the packages dplyr and broom :

library(dplyr)
library(broom)

iris.regressao <- iris %>%
  group_by(Species) %>%
  do(regressao = 
    lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=.))

tidy(iris.regressao, regressao)

# A tibble: 12 x 6
# Groups:   Species [3]
      Species         term   estimate  std.error  statistic      p.value
       <fctr>        <chr>      <dbl>      <dbl>      <dbl>        <dbl>
 1     setosa  (Intercept)  2.3518898 0.39286751  5.9864707 3.034183e-07
 2     setosa  Sepal.Width  0.6548350 0.09244742  7.0833236 6.834434e-09
 3     setosa Petal.Length  0.2375602 0.20801921  1.1420107 2.593594e-01
 4     setosa  Petal.Width  0.2521257 0.34686362  0.7268727 4.709870e-01
 5 versicolor  (Intercept)  1.8955395 0.50705524  3.7383295 5.112246e-04
 6 versicolor  Sepal.Width  0.3868576 0.20454490  1.8913091 6.488965e-02
 7 versicolor Petal.Length  0.9083370 0.16543248  5.4906811 1.666695e-06
 8 versicolor  Petal.Width -0.6792238 0.43538206 -1.5600639 1.255990e-01
 9  virginica  (Intercept)  0.6998830 0.53360089  1.3116227 1.961563e-01
10  virginica  Sepal.Width  0.3303370 0.17432873  1.8949086 6.439972e-02
11  virginica Petal.Length  0.9455356 0.09072204 10.4223360 1.074269e-13
12  virginica  Petal.Width -0.1697527 0.19807243 -0.8570233 3.958750e-01

I took the data set iris and considered the Species column to be the product of your problem. This was my grouping variable.

I made a regression using the Sepal.Length variable as a response and the others as predictors. The do function applied this regression to all levels of the Species variable. That is, I did three simultaneous regressions.

I organized the final result using the tidy function, to make everything more presentable.

Editing done after the dataset to be made available in comments .

I used the original dataset and could not get the reported error.

library(dplyr)
library(broom)
dados <- read.csv(file="arquivo.csv", sep=";", dec=",")

Just for the sake of conscience, I wanted to see how many times each product appears in the database:

produtos.contagem <- dados %>% 
  select(Produto) %>%
  group_by(Produto) %>%
  count() %>%
  arrange(n)

# A tibble: 2,418 x 2
# Groups:   Produto [2,418]
        Produto     n
         <fctr> <int>
 1    Produto 1     1
 2  Produto 100     1
 3 Produto 1000     1
 4 Produto 1006     1
 5 Produto 1011     1
 6 Produto 1012     1
 7 Produto 1013     1
 8 Produto 1021     1
 9 Produto 1022     1
10 Produto 1023     1
# ... with 2,408 more rows

We can see that there are many products that appear only once, which will generate problems with regression. After all, geometrically speaking, you need at least 3 points in space to fit an equation of type y = x_1 + x_2 , which is the case here. And yet, an equation of this kind will be uniquely defined, giving no chance to estimate variability.

But leaving aside the theoretical assumptions of linear regression, it is still possible to adjust the desired regression:

dados.regressao <- dados %>%
  group_by(Produto) %>%
  do(regressao = 
       lm(Vendas.Diarias ~ Menor.Preco.Concorrente + Preco.Meu.Site, data=.))

tidy(dados.regressao, regressao)

# A tibble: 3,725 x 6
# Groups:   Produto [2,418]
        Produto                    term   estimate  std.error  statistic   p.value
         <fctr>                   <chr>      <dbl>      <dbl>      <dbl>     <dbl>
 1    Produto 1             (Intercept)   2.000000        NaN        NaN       NaN
 2   Produto 10             (Intercept) -73.045872 75.5475262 -0.9668863 0.4356077
 3   Produto 10          Preco.Meu.Site   1.834862  1.8348624  1.0000000 0.4226497
 4  Produto 100             (Intercept)   1.000000        NaN        NaN       NaN
 5 Produto 1000             (Intercept)   2.000000        NaN        NaN       NaN
 6 Produto 1001             (Intercept)   3.500000  1.5000000  2.3333333 0.2577621
 7 Produto 1002             (Intercept)  38.970000 21.3485433  1.8254173 0.3190534
 8 Produto 1002 Menor.Preco.Concorrente  -0.300000  0.1732051 -1.7320508 0.3333333
 9 Produto 1003             (Intercept) 110.760000        NaN        NaN       NaN
10 Produto 1003 Menor.Preco.Concorrente  -0.800000        NaN        NaN       NaN
# ... with 3,715 more rows

The NaN appear exactly in cases where it is impossible to adjust the desired model. For example, take Produto 1 . It occurred only once in the database. It is impossible to fit a single plane that passes through this point, because there are infinite planes with this characteristic.

What I recommend is removing products with few comments from the database. In the example below I keep only products with 5 or more comments in the database:

produtos.definitivos <- produtos.contagem %>% 
  filter(n >= 5)

dados.limpos <- dados %>%
  filter(dados$Produto %in% produtos.definitivos$Produto)

dados.regressao <- dados.limpos %>%
  group_by(Produto) %>%
  do(regressao = 
       lm(Vendas.Diarias ~ Menor.Preco.Concorrente + Preco.Meu.Site, data=.))

tidy(dados.regressao, regressao)

# A tibble: 1,153 x 6
# Groups:   Produto [526]
        Produto                    term      estimate   std.error   statistic
         <fctr>                   <chr>         <dbl>       <dbl>       <dbl>
 1 Produto 1004             (Intercept)   3.626591760 55.74608778  0.06505554
 2 Produto 1004 Menor.Preco.Concorrente   0.048689139  0.06657824  0.73130714
 3 Produto 1004          Preco.Meu.Site  -0.087390762  0.84020377 -0.10401139
 4 Produto 1005             (Intercept)  22.974217287 68.67298304  0.33454521
 5 Produto 1005 Menor.Preco.Concorrente   0.008938733  0.09727655  0.09188990
 6 Produto 1005          Preco.Meu.Site  -0.117537498  0.34017868 -0.34551694
 7 Produto 1007             (Intercept)  38.987889938 77.15185884  0.50533961
 8 Produto 1007 Menor.Preco.Concorrente  -0.015792461  0.03091735 -0.51079615
 9 Produto 1007          Preco.Meu.Site  -0.175931611  0.39215747 -0.44862492
10  Produto 101             (Intercept) -45.326666667 23.83058282 -1.90203769
# ... with 1,143 more rows, and 1 more variables: p.value <dbl>

Note that the NaN are gone, just because we now have more than the minimum number of observations required to fit the desired model in each product.

    
17.08.2017 / 03:50