My intuition says that this problem does not have an exact solution as in the case of the mean. Thinking of a simple case, with two samples x
and y
, where
x <- c(1, 1, 1)
y <- c(1, 2, 3)
We have to
median(c(x, y))
[1] 1
But if we try to average the medians or the medians of the medians, we get values different from the value sought:
mean(c(median(x), median(y)))
[1] 1.5
median(c(median(x), median(y)))
[1] 1.5
This is due to the fact that the average throw away information, even more information than the average throws. For example, the sample size is not important to calculate the median. The only thing that matters is whether the sample size is even or odd.
So I come back to my initial statement. My intuition says this problem does not have an exact solution , but nothing prevents us from finding an approximate solution.
Ever heard of bootstrap ? It is a statistical technique that allows finding the sample distribution of any statistic about which we have no further information.
The basic idea is to take a sample of size n = x_l + y_l , with replacement, of the two vectors that you have, and calculate the median of each one of them. Repeat this procedure thousands of times. With this, you will have the sample distribution of the median of your population of interest. It will then be possible to calculate the mean or median of these samples. This link has a pretty cool application of this concept . I wrote a small code that does this, without using the package boot
of R
. In this code I take samples of size 10, which can have size x_l
from 0 to 10 for the vector x
and size n-x_l
for the y
vector. I repeat this procedure ten thousand times. In the end, I average these simulations as the median value of c(x,y)
:
n <- 10
repl <- 100000
median_boot <- 0
for (i in 1:repl){
x_l <- sample(0:n, 1)
y_l <- n-x_l
x_sample <- sample(x, x_l, replace=TRUE)
y_sample <- sample(y, y_l, replace=TRUE)
median_boot[i] <- median(c(x_sample, y_sample))
}
mean(median_boot)
1.29458
median(median_boot)
1
sd(median_boot)
0.4767598
hist(median_boot)
Notethatthemean1.29458
wasfarfromthemedianactualvalue,whichis1
.However,itwasmuchbetterthanthe1.5
valueIestimatedearlier,naively.ThisisbecausethedataIusedisprettybadlybehaved.Theseresultsshouldlookalotbetterifyouhavea"better" behavior, with more data, but not necessarily symmetric to the average.
In addition, it is possible, with the median_boot
vector, to create an empirical confidence interval for the median. In this case, just pick up the percentiles at 2.5% and 97.5% to find it:
c(quantile(median_boot, 0.025), quantile(median_boot, 0.975))
2.5% 97.5%
1.0 2.5
Again, the result was not very good because the distribution I used is not good. The results will be more appropriate if you work with a distribution with better behavior, not something degenerate like this example.
You even tried to use R artifacts that allow you to load very large databases in memory ? Maybe something like this might help solve your problem, since the data would not be loaded completely into RAM?