# Negative variance in R? Floating-Point Error Propagation

10

Suppose the following formula to calculate the variance:

``````variancia <- function(x) {
n <- length(x)
(1/(n^2-n))*(n*(sum(x^2))-(sum(x)^2))
}
``````

See that it is equivalent to the `var` function in most cases:

``````teste <- 1:5
var(teste)
[1] 2.5
variancia(teste)
[1] 2.5
all.equal(var(teste),variancia(teste))
[1] TRUE
``````

Or in this other example:

``````set.seed(1)
x1 <- rnorm(100, 10, 100)
var(x1)
[1] 8067.621
variancia(x1)
[1] 8067.621
all.equal(variancia(x1), var(x1))
[1] TRUE
``````

However, in the case below, it results in an impossible value (negative value):

``````set.seed(1)
x2 <- runif(1000) + 10^12
variancia(x2)
[1] -140878367
var(x2)
[1] 0.08316728
``````

Why the difference between the two functions? How to ensure that the `variancia` function gets the correct value in the last example?

asked by anonymous 21.02.2014 / 21:14

11

Your job has been a victim of the catastrophic cancellation catastrophic cancellation . This can happen when you subtract two numbers from the same signal, in the case of your function:

``````sum(x2^2)
[1] 1e+27

sum(x2)^2 / length(x2)
[1] 1e+27
``````

In the case of the formula used in the `variancia` function this usually occurs when the vector variance is much smaller than its mean.

I will propose two inefficient but simple solutions:

• Use another formula:
``````variancia2 <- function(x) {
n <- length(x)
media <- mean(x)
sum((x - media)^2) / (n - 1)
}

variancia2(x2)
[1] 0.08316728
``````
• Use your formula, but remove the vector mean first, it does not change the variance value.
``````variancia(x2 - mean(x2))
[1] 0.08316727
``````

22.02.2014 / 01:02
4

A double-floating (64-bit) floating-point number can roughly be summarized in 3 parts:

floating point type double: signal (1bit), order of magnitude (11 bits) and precision (52 bits) >

This can represent orders of magnitude of about `10^308` but with an accuracy of about `16` digits (details on how the `R` base package deals with numbers can be seen in `?.Machine` help) , moreover irrational numbers or whose denominator are not power of 2 are approximate.

So you see that a very large number can be represented by `double` , but not so accurately. This can cause major problems with operations such as addition and subtraction. The numbers calculated in the formula `variancia` are of the order of `(10^12)^2=10^24` in the third example, and we only have 52 bits to represent significant digits (the others are imprecise). When we subtract from each other, we eliminate the "good digits" and only have the "bad digits", causing the absurd result.

One way to solve the problem is to find more stable floating-point algorithms, such as those proposed by Marcos. But supposing this is not possible, you can use arbitrary precision numbers .

No `R` the package Rmpfr (Multiple Precision Floating-Point Reliable) provides numbers with arbitrary precision (at the cost of spending more memory and runtime, so depending on your computer and the problem, it is not always possible).

So if it were not possible to somehow improve the calculation algorithm of the `variancia` formula, we could use `Rmpfr` . We would need more than 30 digits of precision, which would give more than `log2(10^30)=99.65` bits. Rounding to 128 bits:

``````library(Rmpfr)
x2.mpfr <- mpfr(x2, 128)
variancia(x2.mpfr)
1 'mpfr' number of precision  128   bits
[1] 0.0831672741323709434253943823576340475867
``````

22.02.2014 / 02:24