Warning: "In sqrt (diag (object $ vcov)): NaNs produced" in Hurdle Model

1

Hello

I have a set of data, with which I intend to perform a test to analyze the influence of some predictor variables on a response variable; as there are many zeros in my response variable (there are 766 zeros of 2830 sample units), I decided to use the Hurdle Model approach. There in R, I wrote these commands:

fórmula <- dados$BC ~ dados$z_primeiro_artigo +
 dados$z_capacidade_científica + dados$z_tamanho_corporal +
 z_reproduções_por_ano + dados$Red_List_Status +
 dados$Tipo_de_desenvolvimento | dados$z_capacidade_científica +
 dados$z_tamanho_corporal + z_reproduções_por_ano +
 dados$Red_List_Status + dados$Tipo_de_desenvolvimento

resultado <- hurdle(formula = fórmula, dist = "negbin", data = dados, na.action = "na.fail")
summary(resultado)

Call:
hurdle(formula = fórmula, data = dados, na.action = "na.fail", dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.1840 -0.6896 -0.2369  0.1864 16.3096 

Count model coefficients (truncated negbin with log link):
                                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)                           89.2998674  0.1065855 837.824  < 2e-16 ***
dados$z_primeiro_artigo               -0.0475314         NA      NA       NA    
dados$z_capacidade_científica          0.0751863  0.0048415  15.530  < 2e-16 ***
dados$z_tamanho_corporal               0.0020403  0.0006407   3.185  0.00145 ** 
z_reproduções_por_ano                  0.1797664  0.0761702   2.360  0.01827 *  
dados$Red_List_StatusEN               -0.4140505  0.1725280  -2.400  0.01640 *  
dados$Red_List_StatusLC                0.2434877  0.1372437   1.774  0.07604 .  
dados$Red_List_StatusNT               -0.2326801  0.1856711  -1.253  0.21014    
dados$Red_List_StatusVU                0.0002679  0.1702307   0.002  0.99874    
dados$Tipo_de_desenvolvimentoLarval    0.4254052  0.0928358   4.582  4.6e-06 ***
dados$Tipo_de_desenvolvimentoVivípara  0.0109588  0.3846127   0.028  0.97727    
Log(theta)                            -1.1538934  0.1093832 -10.549  < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
                                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)                            1.3147054  0.2712539   4.847 1.25e-06 ***
dados$z_capacidade_científica          0.0682073  0.0100039   6.818 9.23e-12 ***
dados$z_tamanho_corporal               0.0015036  0.0008404   1.789   0.0736 .  
z_reproduções_por_ano                  0.3522174  0.2009335   1.753   0.0796 .  
dados$Red_List_StatusEN               -0.4264203  0.1776977  -2.400   0.0164 *  
dados$Red_List_StatusLC               -0.1618832  0.1555683  -1.041   0.2981    
dados$Red_List_StatusNT               -0.2458956  0.2064901  -1.191   0.2337    
dados$Red_List_StatusVU               -0.2674147  0.1880392  -1.422   0.1550    
dados$Tipo_de_desenvolvimentoLarval    0.0385487  0.0989498   0.390   0.6968    
dados$Tipo_de_desenvolvimentoVivípara  0.1392403  0.4588545   0.303   0.7615    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 0.3154
Number of iterations in BFGS optimization: 27 
Log-likelihood: -5853 on 22 Df
Warning message:
In sqrt(diag(object$vcov)): NaNs produzidos
Note that the values of the variable "z_first_article" appear as "NA", and I did not understand this warning message at the end: "In sqrt (diag (object $ vcov)): NaNs produced". Would anyone know how to give me a help?

    
asked by anonymous 19.08.2018 / 01:54

1 answer

1
Generalized linear models do not make magic. It's no use having data, trying to fit a model to them and believing that everything will work out. Also, it is very difficult (maybe impossible) to give you a definitive answer without working with the same data that you are using. However, it is possible to raise some assumptions about what might be happening.

0) Before throwing the data into a model, do an exploratory analysis. Plot them. Make simple statistics such as mean and standard deviations for quantitative variables and frequency tables for categorical variables. This will help define better ways to solve problems that may arise in your analyzes.

1) I counted 6 variables for modeling the counts and 5 for the excess of zeros. Is that correct? Do you have any reason to exclude z_primeiro_artigo from extra zeros modeling? Should the modeling part of excess zeros be so complex? However, with 6 covariates, it is possible that among these 6 predictor variables, some of them have a high correlation. This creates a problem called multicollinearity. Find out about this and see how it can affect your regression.

2) z_primeiro_artigo has default error equal to NA . This means that it was not possible to calculate the variability of the estimation error of this parameter. Make sure z_primeiro_artigo is constant. The fact that there is no variation in this covariate may be the reason for this.

3) In sqrt(diag(object$vcov)): NaNs produzidos means that some elements of the diagonal of the object$vcov array are negative. Make sure diag(resultado$vcov) has negative numbers. If it does, this means that the Hessian matrix of the model is not positive definite. One way to solve this problem is to check if your data is on the same scale. For example, some covariates may be in the order of units and another in the order of hundreds. This almost always gives a problem when adjusting linear models. Here's how to transform data using scaling. Just be careful that inferences made from transformed data are different from the inferences made in the original data.

As you can see, no answer from me is definitive. This problem is not simple and it is impossible to give an accurate diagnosis without having access to the data. Finally, I do not think sample size is a problem. 2830 is a fairly reasonable size for this case, with this number of covariates.

    
19.08.2018 / 02:59