How to correlate qualitative and quantitative variables in R?

3

I have a data frame in which I have qualitative variables, such as sex and origin, and quantitative variables such as cholesterol, weight and height. It is possible to correlate these variables using the cor() function, when using it I get a warning that the variable should be just numeric:

cor(rehab.2)
  

Error in color (rehab.2): 'x' must be numeric

Is there a function that can correlate all these variables in R, regardless of whether it is quantitative or qualitative?

Example of my data frame:

    
asked by anonymous 24.03.2016 / 19:28

2 answers

5

As I said, this is a question more related to statistics, but since I do not have a statexchange in Portuguese, I'll help you with that.

The correlation method you are trying will only work for numeric variables, if you want to create view relationships between categorical variables with continuous variables what I recommend most would be boxplots or histograms / density.

I will demonstrate some examples in R of these analyzes. For this I'm using the dataset iris found in the default package of datasets and the package ggplot2 to plot the graphics. Within the dataset we will compare the different sizes of the%% of the different species that we have iris$Sepal.Length .

BOXPLOT

require(datasets)
require(ggplot2)

ggplot(iris, aes(x = Species, y = Sepal.Length)) + 
  geom_boxplot()

DENSITY

require(datasets)require(ggplot2)ggplot(iris,aes(x=Sepal.Length,fill=Species))+geom_density(alpha=0.3)

Butifyoureallywanta"number" to guide yourself, an ANOVA test can give you this, basically it will tell you if the differences in the means (the test can be applied to other attributes) of your variable continues for each category are "statistically significant".

ANOVA

require(datasets)

anova <- aov(Sepal.Length ~ Species, iris)
summary(anova)

output:

             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this case, the null hypothesis that the sepals have an average of equal size is rejected by a p-value

28.03.2016 / 15:10
3

One way to obtain a coefficient that measures the intensity of the association between a categorical variable and a continuous variable is to use the square root of the coefficient of determination of an adjusted logistic regression model .

This idea came from a question I did on Cross Validated some time ago.

The square root of the coefficient of determination is always a number between 0 and 1.1 indicating very closely related and slightly related, as well as the Pearson correlation coefficient. The use of this measure seems to make sense since in simple linear regression the R^2 is equivalent to the square of the Pearson correlation.

In% with%, this function can be easily written as follows:

cor_cat_cont <- function(cat, cont){
  modelo <- glm(cat ~ cont, family = binomial(link = "logit"), 
                control = glm.control(maxit = 10e6))

  R2 <- binomTools::Rsq(modelo)$R2cor
  sqrt(R2)  
}

For example, in the database R , you can use it like this:

> cor_cat_cont(iris$Species, iris$Sepal.Length)
[1] 0.8158366

To use the function, you need to install the package iris , using binomTools .

At the time I did the following post on my blog simulating some categorical data and measuring the correlation calculated in this way and I found the result quite satisfactory.

    
28.03.2016 / 15:39