In January, in my column “Beware of Greeks Bearing Spam,” I wrote of spam attacks, one of the perils of the brave new World Wide Web. The solution, in my case, was one of those “Captcha” filters, which makes anyone who wishes to email me from my website decipher a pair of words before their email is sent.
The “Captcha” filter has worked like a charm. The spam that used to flood my inbox has practically vanished, thanks to the filter’s ability to distinguish between human beings and “botnets,” those evil automated spam generators.
In fact, “Captcha” is an acronym for “completely automated public Turing test to tell computers and humans apart,” a nod to computer pioneer Alan Turing, one of the British codebreakers who cracked the German Enigma code during World War Two, and the man who devised the famous Turing test, which is still used by computer scientists today as a benchmark of “artificial intelligence.”
The Turing test is very simple. A computer is said to “think” if a human interrogator, through conversation, can’t tell that it’s a computer.
The Captcha filter flips that principle around: human beings are pretty good at deciphering wavy words; whatever can’t decipher them must be a computer.
I’ve been on both sides of the Captcha fence. As a user of websites, I find it irritating to have to submit to the wavy word test. Sometimes, even I — a bona fide human being — have trouble making sense of the wiggly letters.
On the other hand, Captcha has materially improved my life by blocking an ocean of phony pharmaceutical ads.
It’s estimated that something like 200 million Captchas are deciphered each day, which amounts to about 500,000 man-hours. Wouldn’t it be nice if all that effort were put to productive use?
Well, apparently, it is. A company called reCaptcha, which is now used by about three quarters of the websites that employ a Captcha filter, is taking that vast cloud of seemingly wasted human brainpower and using it to perfect digital versions of old manuscripts.
Here’s how this ingenious system works. Let’s say you’re scanning a 19th century manuscript for Google Books, with the goal of offering its obscure text to the public. Most of the process is automated. The manuscript is scanned, then that scan is processed by Optical Character Recognition (OCR) software, which converts an image of the page into a file full of actual English words.
There’s plenty of magic in that process. Think of the difference between a photograph of one of my newspaper columns versus the Microsoft Word document I used to write it. You wouldn’t be able to search the photograph for keywords, but you could easily search the Word document.
The OCR software is very good at converting the “photograph” of the manuscript page into a kind of “Word” document, but it’s not perfect. Smudges can confuse the software, as can broken bits of text, or even rare or misspelled words.
The geniuses at reCaptcha have figured out how to farm out these problem words to the general public, which is providing solutions without even realizing it. There’s a very good chance that one of those two wavy words you’ve been entering in the blank box is actually derived from an old manuscript!
When people talk about the transformative power of the Internet, they’re often talking about a phenomenon called “distributed computing,” which is a way of breaking a huge problem into lots and lots of much smaller problems. One of the oldest examples of this is the SETI@home project, which has been running since 1999. “SETI” stands for “search for extra-terrestrial intelligence,” as in, looking for aliens in outer space. Every day, the massive Arecibo radio telescope in Puerto Rico gathers gigabits of recordings of radio noise from deep space. Somewhere in this haystack of data, there could be the needle of a radio message from another civilization. The problem is sifting through all the noise.
Crunching that much data would be a workout even for a supercomputer, but by divvying the data into little chunks, and farming out those chunks to tens of thousands of civic-minded PC owners, a Herculean task is converted into lots of tiny tasks, each one small enough to be handled by a simple screensaver program.
SETI@home is just one example of distributed computing, a movement aimed at putting millions of sleeping computers to productive use. Other projects span the sciences, from earthquake detection to modeling the Milky Way.
The idea that “many hands make light work” is nothing new. But when you add to those hands the dormant computing power of a world full of PCs, laptops, and smartphones, not even the sky’s the limit.
This column was published in the Perry Co Times on 14 April 2011
For more information, please contact Mr. Olshan at writing@matthewolshan.com