Negative variance in R? Floating-Point Error Propagation

10

Suppose the following formula to calculate the variance:

variancia <- function(x) {
  n <- length(x)
  (1/(n^2-n))*(n*(sum(x^2))-(sum(x)^2))
}

See that it is equivalent to the var function in most cases:

teste <- 1:5
var(teste)
[1] 2.5
variancia(teste)
[1] 2.5
all.equal(var(teste),variancia(teste))
[1] TRUE

Or in this other example:

set.seed(1)
x1 <- rnorm(100, 10, 100)
var(x1)
[1] 8067.621
variancia(x1)
[1] 8067.621
all.equal(variancia(x1), var(x1))
[1] TRUE

However, in the case below, it results in an impossible value (negative value):

set.seed(1)
x2 <- runif(1000) + 10^12
variancia(x2)
[1] -140878367
var(x2)
[1] 0.08316728

Why the difference between the two functions? How to ensure that the variancia function gets the correct value in the last example?

    
asked by anonymous 21.02.2014 / 21:14

2 answers

11

Your job has been a victim of the catastrophic cancellation catastrophic cancellation . This can happen when you subtract two numbers from the same signal, in the case of your function:

sum(x2^2)
[1] 1e+27

sum(x2)^2 / length(x2)
[1] 1e+27

In the case of the formula used in the variancia function this usually occurs when the vector variance is much smaller than its mean.

I will propose two inefficient but simple solutions:

  • Use another formula:
variancia2 <- function(x) {
  n <- length(x)
  media <- mean(x)
  sum((x - media)^2) / (n - 1)
}

variancia2(x2)
[1] 0.08316728
  • Use your formula, but remove the vector mean first, it does not change the variance value.
variancia(x2 - mean(x2))
[1] 0.08316727
    
22.02.2014 / 01:02
4

Complementing Marcos Banik's answer.

A double-floating (64-bit) floating-point number can roughly be summarized in 3 parts:

floating point type double: signal (1bit), order of magnitude (11 bits) and precision (52 bits) >

This can represent orders of magnitude of about 10^308 but with an accuracy of about 16 digits (details on how the R base package deals with numbers can be seen in ?.Machine help) , moreover irrational numbers or whose denominator are not power of 2 are approximate.

So you see that a very large number can be represented by double , but not so accurately. This can cause major problems with operations such as addition and subtraction. The numbers calculated in the formula variancia are of the order of (10^12)^2=10^24 in the third example, and we only have 52 bits to represent significant digits (the others are imprecise). When we subtract from each other, we eliminate the "good digits" and only have the "bad digits", causing the absurd result.

One way to solve the problem is to find more stable floating-point algorithms, such as those proposed by Marcos. But supposing this is not possible, you can use arbitrary precision numbers .

No R the package Rmpfr (Multiple Precision Floating-Point Reliable) provides numbers with arbitrary precision (at the cost of spending more memory and runtime, so depending on your computer and the problem, it is not always possible).

So if it were not possible to somehow improve the calculation algorithm of the variancia formula, we could use Rmpfr . We would need more than 30 digits of precision, which would give more than log2(10^30)=99.65 bits. Rounding to 128 bits:

 
library(Rmpfr)
x2.mpfr <- mpfr(x2, 128)
variancia(x2.mpfr)
1 'mpfr' number of precision  128   bits 
[1] 0.0831672741323709434253943823576340475867
    
22.02.2014 / 02:24