Here is some statistical theory. I thought it up because it is applicable to an aspect of my AI project. I don’t know much statistics; it seems likely that I’m groping towards something that already exists. But here it is.
Suppose that we have a black box which outputs a series of numbers between zero and one. We don’t know how it works, and we want to determine the mean value of its output.
One way we could do this would be to record a series of outputs, and take their mean. This works if the black box’s behavior is regular; that is, the mean will be the same in the distant future as it is right now. But what if the trends of the black box’s output gradually change over time, so that the mean will be different in the future?
An analogous case would be the mean income of a U.S. citizen. This is a mean which changes over time. How do we deal with this sort of case?
One solution would be to re-calculate the mean periodically, taking a new sample every so often and calculating its mean. This is how we produce statistics about mean income.
But we could also compute the mean dynamically. Returning to our black box, we could come up with a scheme by which every time the black box outputs a number, it adjusts our estimate of the mean slightly. This solution is more elegant in the case of an AI which wants to dynamically monitor the trends in its own internal data structures.
In the case of the black box, how do we compute a running estimate of the mean?
Suppose we imitate the conventional way of estimating the mean. We can do this by keeping track of the number of numbers we have observed (the sample size), and a running sum of all of the numbers we have observed. Every time a new number comes down the pipe, we add one to the sample size, add the number to the running sum, and re-compute the mean as the running sum over the sample size.
The problem with this method is that, as time goes by, each number will have a progressively smaller impact on the estimate of the mean. It doesn’t capture the idea that the mean is changing over time, which demands that we put more weight on recent samples than on samples in the distant past.
Let us consider another method. Suppose we keep a running mean (m). Let p = 0.5 be its initial estimate, before we have observed any numbers. Furthermore, we have a number (w, for “weight”) which says how much impact every sample has on the running mean. w is analogous to the sample size. For instance, w = .05 is analogous to a sample size of 20 (1/.05 = 20).
Every time we observe a new number (n), we say m’ = (1 – w) * m + w * n. m’ is the new mean, and m is the old mean.
Let t be the number of numbers we have observed since we started our running estimate, and n1, n2, …, nt be those numbers. Then we can say that, at any time, m = p * (1 – w)^t + n1 * w(1 – w)^(t-1) + n2 * w(1 – w)^(t-2) + … + nt * w(1 – w)^(t-t).
Suppose that w is set to a value analogous to a statistically significant sample size. Then, if the real mean doesn’t change over time, this method of estimating the mean will always keep the estimate near the real mean; m will never change significantly.
But this method will capture fluctuations in the real mean, keeping the estimate near the real mean as the real mean changes. It will do so more in a more fine-grained fashion with larger values of w, or a more coarse-grained fashion with smaller values of w.
The biggest questions for this method are: how do we choose a value for p (the initial estimate of the mean), and how do we choose a value for w (the amount of impact each new number has on our mean estimate, which determines the coarseness with which we follow fluctuations in the mean)?
These are two very difficult questions. The first question is a basic problem of Bayesian statistics: how do we choose our prior probabilities? The second question I haven’t seen before, but it also seems fairly enigmatic.