The Probabilistic Age
Q: Why are people so uncomfortable with Wikipedia? And Google? And, well, that whole blog thing?
A: Because these systems operate on the alien logic of probabilistic statistics, which sacrifices perfection at the microscale for optimization at the macroscale.
Q: Huh?
A: Exactly. Our brains aren't wired to think in terms of statistics and probability. We want to know whether an encyclopedia entry is right or wrong. We want to know that there's a wise hand (ideally human) guiding Google's results. We want to trust what we read.
When professionals--editors, academics, journalists--are running the show, we at least know that it's someone's job to look out for such things as accuracy. But now we're depending more and more on systems where nobody's in charge; the intelligence is simply emergent. These probabilistic systems aren't perfect, but they are statistically optimized to excel over time and large numbers. They're designed to scale, and to improve with size. And a little slop at the microscale is the price of such efficiency at the macroscale.
But how can that be right when it feels so wrong?
There's the rub. This tradeoff is just hard for people to wrap their heads around. There's a reason why we're still debating Darwin. And why Jim Suroweicki's book on Adam Smith's invisible hand is still surprising (and still needed to be written) more than 200 years after the great Scotsman's death. Both market economics and evolution are probabilistic systems, which are simply counterintuitive to our mammalian brains. The fact that a few smart humans figured this out and used that insight to build the foundations of our modern economy, from the stock market to Google, is just evidence that our mental software has evolved faster than our hardware.
Probability-based systems are, to use Kevin Kelly's term, "out of control". His seminal book by that name looks at example after example, from democracy to bird-flocking, where order arises from what appears to be chaos, seemingly reversing entropy's arrow. The book is more than a dozen years old and decades from now we'll still find the insight surprising. But it's right.
Is Wikipedia "authoritative"? Well, no. But what really is? Britannica is reviewed by a smaller group of reviewers with higher academic degrees on average. There are, to be sure, fewer (if any) total clunkers or fabrications than in Wikipedia. But it's not infallible either; indeed, it's a lot more flawed that we usually give it credit for.
Britannica's biggest errors are of omission, not commission. It's shallow in some categories and out of date in many others. And then there are the millions of entries that it simply doesn't--and can't, given its editorial process--have. But Wikipedia can scale to include those and many more. Today Wikipedia offers 860,000 articles in English - compared with Britannica's 80,000 and Encarta's 4,500. Tomorrow the gap will be far larger.
The good thing about probabilistic systems is that they benefit from the wisdom of the crowd and as a result can scale nicely both in breadth and depth. But because they do this by sacrificing absolute certainty on the microscale, you need to take any single result with a grain of salt. As Zephoria puts it in this smart post, Wikipedia "should be the first source of information, not the last. It should be a site for information exploration, not the definitive source of facts."
The same is true for blogs, no single one of which is authoritative. As I put it in this post, "blogs are a Long Tail, and it is always a mistake to generalize about the quality or nature of content in the Long Tail--it is, by definition, variable and diverse." But collectively they are proving more than an equal to mainstream media. You just need to read more than one of them before making up your own mind.
Likewise for Google, which seems both omniscient and inscrutable. It makes connections that you or I might not, because they emerge naturally from math on a scale we can't comprehend. Google is arguably the first company to be born with the alien intelligence of the Web's large-N statistics hard-wired into its DNA. That's why it's so successful, and so seemingly unstoppable.
Paul Graham puts it beautifully:
"The Web naturally has a certain grain, and Google is aligned with it. That's why their success seems so effortless. They're sailing with the wind, instead of sitting becalmed praying for a business model, like the print media, or trying to tack upwind by suing their customers, like Microsoft and the record labels. Google doesn't try to force things to happen their way. They try to figure out what's going to happen, and arrange to be standing there when it does."
The Web is the ultimate marketplace of ideas, governed by the laws of big numbers. That grain Graham sees is the weave of statistical mechanics, the only logic that such really large systems understand. Perhaps someday we will, too.
[Update: Nicholas Carr, who seems to have inherited the Clifford Stoll chair of reliable techno-skepticism, has a clever and well-written response here.]


Wikipedia is not a probabilistic system.
I do not really "understand" Google because the math is beyond me, but I trust it. I understand Wikipedia just fine, which is why I don't trust it.
Information systems are only useful to the user at the point in time at which the system is accessed. At the time of a Google search you are presented with a mathmatically determined 'average' value; the sum wisdom of the internet's hyperlinks. It is an average value, and even if 30% of the links on the web are "wrong" you still get the right answer.
Wikipedia does not work like that. When you access Wikipedia you do not get the average value of an article; you get the last author's value only. Instead of getting a probabilistic average you instead are getting a single data-point.
Google is "wrong" only when the entire web is wrong. This happens on occasion, such as when an urban legend becomes more popular than the truth (when it's done purposefully it's called a Google Bomb). Wikipedia is wrong when a single person is wrong. It is also incredibly easier to "bomb" Wikipedia. Anyone with a login can do it with 1 minute's work. With 860,000 articles an error in an obscure article can remain undetected for some time.
(I found an article where someone had inserted "Jake is the best!" or something like that in the middle of a sentence. As an experiment I left it there to see how long it took for someone to find it. It's still there 4 months later, and that's with an obvious error. An error in the data that only an authoritative source would know was wrong is likely to last even longer.)
To use an analogy most survivors of the Dot.Bomb would understand, a Google search is like predicting stock performance by taking the average stock price of every Wall St. analyst (occasionally wrong and sometimes very wrong, but usually close); while a Wikipedia search is like doing the same by trolling chat rooms for tips.
Posted by: Brock | December 18, 2005 at 05:39 PM
Brock,
In the popular entries with many eyes watching, Wikipedia becomes closer to the statistical average of the views of the participants, weighted by such factors the authority of each as defined by the others (frequent contributors to any entry tend to win any vote-offs). Studies have shown that for such entries, the mean time to repair vandalism of the sort you describe is measured in minutes. As Wikipeida grows that rapid self-repairing property will spread to more entries.
But the main point I was making about Wikipedia was not that any single entry is probabilistic, but that the *entire encylopedia* is probabilistic. Your odds of getting a substantive, up-to-date and accurate entry for any given subject are excellent on Wikipedia, even if every individual entry isn't excellent.
To put it another way, the quality range in Britannica goes from, say, 5 to 9, with an average of 7. Wikipedia goes from 0 to 10, with an average of, say, 5. But given that Wikipedia has ten times as many entries as Britannica, your chances of finding a reasonable entry on the topic you're looking for are actually higher on Wikipedia.
That doesn't mean that any given entry will be better, only that the overall value of Wikipedia is higher than Britannica when you consider it from this statistical perspective.
Posted by: chris anderson | December 18, 2005 at 06:03 PM
Either way it takes the academics in ivory towers out of the equation, which is both a very good and a very bad thing.
Posted by: kitchen hand | December 18, 2005 at 08:04 PM
Chris,
I agree that Wikipedia as a whole has more total value than Britannica as a whole. It probably does produce more social utility than Britannica, just as the Web + Google produces more utility than a good library + a card catalog.
But no one needs the whole of Wikipedia. They need the article they need, and they need it to be (mostly) right.
My point was that individual Google searches are probabilisitic, but that individual Wikipedia articles (the ones in the Long Tail at any rate) are not. Since individual searches and articles are what matter to individual people, I think that's the more important thing to focus on.
I think Wikipedia would be more probabilistic to the user if disputed issues, history of changes, and "voting" was displayed in the actual article without having to comb through the changes. Put the statistics of opinion right out in front where the intelligent reader can judge them for himself.
I just want to make clear that I think Wikipedia is great in a lot of ways, but it is engineered poorly. Wikipedia is a lot like Communism - a nice idea, but inappropriate for humans. Too many of us has motivations far from the pursuit of objective truth. It would be far better if each author could write his own, complete version (perhaps borrowing sections using a Creative Commons license). If you don't like it, write your own, but don't mess with his. Then all readers have to do is find both articles, read them, and judge for himself.
Of course Step 1, "finding", brings us back to Google ... :-)
Posted by: Brock | December 18, 2005 at 09:56 PM
Brock;
You don't care about the entire Google database either, just one or two entries. PageRank isn't an average either, it's basically whoever gets has the most links today (with weighting).
I think you make an erroneous argument, that the latest Wikipedia article is the result of only the last person's edit. This would be true if every edit involved a complete rewrite of the article. This is astronomically rare. Almost all changes are incremental, and as a matter of practical interest they're often reviewed by the most recent contributors. As such, the wiki article you view is more of an average, or better an aggregation, of all previous edits. The most recent edit might be less trusted than the previous ten, but it usually represents a small portion of article.
Add to that, if you have even the slightest doubt about something, you can persue the article history to find when such a crazy thing was added.
Then there's the human habit of yielding to people who seem to know what they're talking about. This means that uninformed people tend to avoid putting in the work to contest something they don't understand, and informed and motivated people tend to do most of the work. Wikipedia's NPOV policy, maintained by crowd without the natural stimuluses towards mob mentality, means that demagogues naturally lose. This is rather unlike the practice of mid-sized groups that produce traditional encyclopedias.
And finally, I have to say that anyone who regards any *single* source as authorative gets what they deserve. Wikipedia is my first stop, and it's sometimes my last stop (for revisions) when I find out most authorative sources say something a little different.
Just try to write a report on something like witchcraft based on the Encyclopedia Britanica I grew up with. It won't even get you started. Wikipedia will though, because contributors try to be comprehensive to all input, not authorative about what something should be. That's precisely Wikipedia's strength: It's not meant to be authorative, but it will take authorative input (even when two authoraties viciously disagree). It doesn't take academic authorities out of the equation--they're reduced from all powerful to merit-weighted influence.
Posted by: JCJ | December 18, 2005 at 11:20 PM
Brock makes some good points about Wikipedia. Surowiecki explains in WoC that a good "aggregation function" is critical to extracting the wisdom from the crowd, such as a voting mechanism or calculating the average. Wikipedia doesn't really have one. Chris suggests that "frequent contributors" win vote-offs, but that is rare, and it puts the quality issue back in the hands of a few. (Google's aggregation function is the math that Brock and I don't understand, and is their core asset).
There is another concept relevant to the WoC that Surowiecki does not spend much time on called the Condorcet Jury Theorem, which says that if the members of the crowd each individually have a less than 50% chance of getting the answer right, then the chance the