How to correlate qualitative and quantitative variables in R?

Question

How to correlate qualitative and quantitative variables in R?

Navigation

#1 by (5 votes)
#2 by (3 votes)

3

I have a data frame in which I have qualitative variables, such as sex and origin, and quantitative variables such as cholesterol, weight and height. It is possible to correlate these variables using the cor() function, when using it I get a warning that the variable should be just numeric:

cor(rehab.2)

Error in color (rehab.2): 'x' must be numeric

Is there a function that can correlate all these variables in R, regardless of whether it is quantitative or qualitative?

Example of my data frame:

r

asked by anonymous 24.03.2016 / 19:28

2 answers

Linux or GNU / Linux? How to send information hidden through a form?

score 5 · Accepted Answer

As I said, this is a question more related to statistics, but since I do not have a statexchange in Portuguese, I'll help you with that.

The correlation method you are trying will only work for numeric variables, if you want to create view relationships between categorical variables with continuous variables what I recommend most would be boxplots or histograms / density.

I will demonstrate some examples in R of these analyzes. For this I'm using the dataset iris found in the default package of datasets and the package ggplot2 to plot the graphics. Within the dataset we will compare the different sizes of the%% of the different species that we have iris$Sepal.Length .

BOXPLOT

require(datasets)
require(ggplot2)

ggplot(iris, aes(x = Species, y = Sepal.Length)) + 
  geom_boxplot()

DENSITY

require(datasets)require(ggplot2)ggplot(iris,aes(x=Sepal.Length,fill=Species))+geom_density(alpha=0.3)

Butifyoureallywanta"number" to guide yourself, an ANOVA test can give you this, basically it will tell you if the differences in the means (the test can be applied to other attributes) of your variable continues for each category are "statistically significant".

ANOVA

require(datasets)

anova <- aov(Sepal.Length ~ Species, iris)
summary(anova)

output:

             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this case, the null hypothesis that the sepals have an average of equal size is rejected by a p-value

28.03.2016 / 15:10