How to calculate the median when the data is in chunks?

Question

How to calculate the median when the data is in chunks?

Navigation

#1 by (7 votes)

7

Suppose my data is divided into 2 vectors:

x <- c(100, 400, 120)
y <- c(500, 112)

I could calculate the median, joining the two vectors and then using the median function.

median(c(x,y))
[1] 120

Suppose for some reason I can not load all my data at once into RAM, so I can not combine the two vectors and then calculate the median.

If I were to calculate the average, this way, in chunks, I could do the following:

I calculate the sum and number of elements of x
I calculate the sum and number of elements of y
We sum the sum of x with the sum of y and the number of elements of x with the number of elements of y
I divide one by the other, for example:

Example:

x_sum <- sum(x)
x_l <- length(x)
y_sum <- sum(y)
y_l <- length(y)
(x_sum + y_sum)/(x_l + y_l)
[1] 246.4

This would allow me to calculate the average without at any time needing to have the two vectors loaded in RAM.

What algorithm could I use to do something like this, but to calculate the median?

Obs I used the R language for the examples, but the answer need not be in R as long as I explain the algorithm.

algoritmo r

asked by anonymous 30.12.2016 / 02:08

1 answer

How to mount a SELECT to return the last status change of each id? Input type file does not work in webview

score 7 · Accepted Answer

My intuition says that this problem does not have an exact solution as in the case of the mean. Thinking of a simple case, with two samples x and y , where

x <- c(1, 1, 1)
y <- c(1, 2, 3)

We have to

median(c(x, y))
[1] 1

But if we try to average the medians or the medians of the medians, we get values different from the value sought:

mean(c(median(x), median(y)))
[1] 1.5
median(c(median(x), median(y)))
[1] 1.5

This is due to the fact that the average throw away information, even more information than the average throws. For example, the sample size is not important to calculate the median. The only thing that matters is whether the sample size is even or odd.

So I come back to my initial statement. My intuition says this problem does not have an exact solution , but nothing prevents us from finding an approximate solution.

Ever heard of bootstrap ? It is a statistical technique that allows finding the sample distribution of any statistic about which we have no further information.

The basic idea is to take a sample of size n = x_l + y_l , with replacement, of the two vectors that you have, and calculate the median of each one of them. Repeat this procedure thousands of times. With this, you will have the sample distribution of the median of your population of interest. It will then be possible to calculate the mean or median of these samples. This link has a pretty cool application of this concept . I wrote a small code that does this, without using the package boot of R . In this code I take samples of size 10, which can have size x_l from 0 to 10 for the vector x and size n-x_l for the y vector. I repeat this procedure ten thousand times. In the end, I average these simulations as the median value of c(x,y) :

n           <- 10
repl        <- 100000
median_boot <- 0

for (i in 1:repl){
    x_l <- sample(0:n, 1)
    y_l <- n-x_l

    x_sample <- sample(x, x_l, replace=TRUE)
    y_sample <- sample(y, y_l, replace=TRUE)

    median_boot[i] <- median(c(x_sample, y_sample))
}

mean(median_boot)
1.29458

median(median_boot)
1

sd(median_boot)
0.4767598

hist(median_boot)

Notethatthemean1.29458wasfarfromthemedianactualvalue,whichis1.However,itwasmuchbetterthanthe1.5valueIestimatedearlier,naively.ThisisbecausethedataIusedisprettybadlybehaved.Theseresultsshouldlookalotbetterifyouhavea"better" behavior, with more data, but not necessarily symmetric to the average.

In addition, it is possible, with the median_boot vector, to create an empirical confidence interval for the median. In this case, just pick up the percentiles at 2.5% and 97.5% to find it:

c(quantile(median_boot, 0.025), quantile(median_boot, 0.975))
 2.5% 97.5% 
  1.0   2.5

Again, the result was not very good because the distribution I used is not good. The results will be more appropriate if you work with a distribution with better behavior, not something degenerate like this example.

You even tried to use R artifacts that allow you to load very large databases in memory ? Maybe something like this might help solve your problem, since the data would not be loaded completely into RAM?