Rationalists, like those at Less Wrong (think Eliezer Yudkowsky and Scott Alexander), are prone to fetishsize Bayes theorem, seeing it as the key to all thought. It isn’t. Bayes is a helpful tool, and no more, and like all tools, not always needed. But because of the perceived importance of Bayes, people think they have discovered flaws in it. These are almost always based on simple mistakes, which can go decades without anybody noticing. As in the so-called problem of old evidence.
Here’s what one prominent author (Colin Howson) thinks is the “problem” of old evidence: Can a hypothesis h be confirmed by evidence e if the evidence is old and already known?
The answer will seem to be obvious, and yes. Howson and others say no. That “no” and the “problem” arises when people write things like this (as in the link):
Pr(h|e) = [ Pr(e|h)Pr(h) ] / Pr(e).
That might look to you like Bayes theorem, favorite of “rationalists” everywhere, but it is not. It is missing something. The missing parts are what cause the “problem.”
Howson, and many like him, says (modifying his notation so that it’s consistent with mine): “This [existence of background knowledge] has the following unpleasant consequence, however. If e is known at the time h is proposed, then e is in [the background knowledge] and so Pr(e)=Pr(e|h)= 1, giving, Pr(h|e) = Pr(h); which means that e gives no support to h.”
Before reading further, and recalling the hint about something missing, see if you can spot the flaw in this thinking.
Don’t cheat. Think.
The answer is this: There is no such thing as “Pr(h)” or “Pr(e)”. While “Pr(h|e)” and “Pr(h|e)” are fine, as such, they are incomplete in the face of the first two elements.
There is no such thing as unconditional probability: all probability is conditional. Every probability everywhere needs premises, conditions, assumptions, some evidence upon which to pass the judgement. That means “Pr(e)” is impossible. No such creature exists.
We can write, perhaps, Pr(h|K), which is the probability of h given some background knowledge K (the K is from Howson). We could also—and here comes the trouble—write Pr(e|K).
That’s fine as it stands, and it could be as Howson suggests that Pr(e|K) = 1. But that only happens when K includes the premise (or proposition, or assumption, or whatever you want to call it), “e has been observed.” That makes K = “‘bunch of other premises related to h’ & ‘e has been observed’.”
With that K, then indeed Pr(e|K) = 1. (Make sure you see this.)
Let’s rewrite the equation above properly, using this K (two letters put together mean logical “and”, so that “eK” means “e and K”):
Pr(h|eK) = [ Pr(e|hK)Pr(h|K) ] / Pr(e|K).
We have Pr(e|K) = 1, since K says e was observed, which obviously makes the probability of e equal to 1, given e was observed. Of course it does! Adding the h, unless that h says “e is impossible” or something like that, gives Pr(e|hK) = Pr(e|K) = 1. But since logically eK = K, then Pr(h|eK) = Pr(h|K). The math works! Both sides are Pr(h|K).
And so it seems e says nothing about h. But that’s not how evidence works.
What happens with evidence in real life is this. We do indeed start with some background knowledge, or surmises, etc. about h. Call that B. B says nothing about e already having been observed. It says stuff about h. We then write:
Pr(h|eB) = [ Pr(e|hB)Pr(h|B) ] / Pr(e|B).
No change, except from K to B. Let’s look at each piece.
Pr(e|hB) is the probability that e can be observed given h is true and B (which are our assumptions). This is so even if e never is observed! Even if e remains a thought experiment. Don’t read more until you grasp this.
Since B is silent on e having been observed (and ignoring “degenerate situations” like hB = “e is impossible”), then 0 < Pr(e|hB) < 1. Pr(h|B) is our "prior", given by our background information. Again (and still ignoring degenerate scenarios like B = 'h is impossible') 0 < Pr(h|B) < 1. Pr(e|B) is the probability e could be true given B, but it says nothing directly about h. We could always "expand" Pr(e|B) like this (using "total probability"): Pr(e|B) = Pr(e|hB)Pr(h|B) + Pr(e|not-hB)Pr(not-h|B). The first term on the right we already did. The second is similar, and where "not-h" is the logical contrary of whatever h is1. We could find Pr(e|not-hB), the probability e is true given h is false and B, and recalling Pr(h|B) + Pr(not-h|B) = 1 (this works for every h!).
So as long as
[ Pr(e|hB) / Pr(e|B) ] > 1,
which is to say, as long as the evidence e is more probable under hB than under B alone, then e supports or confirms h. Even if nobody in the world ever observes e! You must get this.
If [ Pr(e|hB) / Pr(e|B) ] < 1, then e disconfirms h. If [ Pr(e|hB) / Pr(e|B) ] = 1, then knowledge of e is irrelevant to h.
That’s it. The simple solution to the “problem”. It does not matter when e is observed, or even if it is observed. It could be ancient wisdom—like apples fall onto heads and do not soar into the air. And h is “gravity attracts”. Or it could be entirely novel.
It only matters whether e is already part of h, as in the “problem” which uses K, or that it is considered on its own, as with B.
There has been a lot of ink spilled on this “problem”, all of it because of bad notation. Notation that become popular because it was forgotten all probability is conditional. Change the conditions, change the probability.
1 h is a complex proposition, usually, of the form P_1 & P_2 & … & P_q, where each P_i is some proposition; thus not-h is not-“P_1 & P_2 & … & P_q”. Only one of the P_i need be false for not-h to be true. Failure to understand this leads to much confusion about what models and theories are.
This is not the first time we tackled this subject; however, the first article was put in obscure terms in answer to a technical question, and the point was lost.
Subscribe or donate to support this site and its wholly independent host using credit card click here. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank.