What is rank-deficient and how to get around it?

Question

What is rank-deficient and how to get around it?

Navigation

#1 by (4 votes)

3

I made a linear regression lm() , where I declared some variables as factor , and got some betas as NA as:

citySão José
NA

When I made the prediction, the prediction occurred and I received the following warning:

Warning message:
In predict.lm(modeloAIC, matriz_de_estimação) :
uma predição a partir de um ajuste rank-deficient pode ser enganoso

I was left wondering how to circumvent this and how it was predicted who had the factor Saint Joseph.

r lm

asked by anonymous 23.10.2018 / 20:30

1 answer

What are the differences between Object.freeze () and Object.seal ()? Problem with sum in vector in C

score 4 · Accepted Answer

The general formula of linear regression is given by

Itcanberepresentedinmatrixformthroughtherelation

whereYandepsilonarevectorsofnelementsandXisamatrixgivenby

Theleastsquaresestimatorofthebetaparameterscanbeobtainedthroughtherelation

whereX'isthetransposeofXand(X'X)^(-1)istheinverseofX'X.

Inorderfortheinverse(X'X)^(-1)toexist,X'Xmustbeafullrankmatrix(inPortuguese).X'Xwillhavecompleterankif,andonlyif,itscolumnsarenotlinearcombinationsofeachother.Inthisway,thedeterminantofthematrixisnonzeroanditisinvertible.

Whenthecolumnsofanarrayarelinearcombinationsofoneanother,wesaythatthematrixisrankdeficient(orincomplete,inPortuguese).Theproblemisthatsucharraysarenotinvertible.Therefore,itisnotpossibletoestimatetheregressionparametersaccordingtotheformulashownabove,since(X'X)^(-1)doesnotexist.

It is impossible to give a solution to a rank-deficient array regression problem without looking at the data. However, there are some things that can be tempted:

1) One of the predictor variables is a linear combination of the others. That is, some variable in your model is redundant. Find out about multicollinear regression and how to remove variables from your model. See, mainly, what variance inflation factor means.

This example below, created especially for rank-deficient , shows a similar behavior to your problem, because between the two variables, one is exactly double the other and therefore a linear combination.

ajuste <- lm(mpg ~ wt + I(2*wt), data=mtcars)
predict(ajuste, mtcars)

Warning message:
In predict.lm(fit2, mtcars) :
  prediction from a rank-deficient fit may be misleading

2) The sample may not be large enough for the template to be adjusted. It takes at least two points to define a line. However, if I give a single point, with an x and y coordinate, R will fit a linear model to it, without complaining:

x <- 1
y <- 3

ajuste <- lm(y ~ x)

predict(ajuste, data.frame(x=1.5))

The warning only appears at the time of prediction. So it may be that your model has too many parameters and less sample size. See the following case where there are two predictor variables:

x <- c(1, 2)
y <- c(3, 1)
z <- c(5, 0)

ajuste <- lm(z ~ x + y)

predict(ajuste, data.frame(x=1.5, y=2.5))

It is also rank-deficient because there is little data. Here's how the problem is resolved when I increase my sample size:

x <- c(1, 2, 1)
y <- c(3, 1, 1)
z <- c(5, 0, 1)

ajuste <- lm(z ~ x + y)

predict(ajuste, data.frame(x=1.5, y=2.5))

The general rule is to have at least a number of points equal to the number of parameters to be adjusted in the model. This ensures that the array will not be rank-deficient . Even so, it is not ideal because other problems can occur. Run the command below and see that it was not possible to build the hypothesis tests for the parameters, even with the array not being rank-deficient .

summary(ajuste)

And if the predictor variables are categorical, there is another aggravating factor, since the size of the matrix (X'X) increases according to the number of levels. The rule I put above only applies if we consider that the predictor variables are quantitative.

In summary:

Simplify your model; or

Collect more data; or

Read a good multiple linear regression book