Sunday, May 11, 2008

Google Speaks in Tongues

Teaching a computer to understand languages isn't rocket science -- it's not nearly that easy, said Peter Norvig, director of research at Google

(Nasdaq: GOOG) .

It takes a limited number of calculations to send a spacecraft to the moon, Mars or other planets. And while the calculations aren't so simple, they are fairly easily managed by a computer, he said.

But learning what words mean, how they fit together and how they translate into other languages is much more challenging, he said.

Rules and Exceptions

"In physics, we've been able to use computers very well for a long time. We can get our spacecraft to the moon or Mars very accurately," Norvig said. "But part of the problem with language is there's lots and lots of rules, and there are lots and lots of exceptions to those rules."

Rather than using grammar, about two years ago Google started to take a different approach to teach a computer how to understand languages, which is more like the way humans learn them, he said.

Every Word Counts

What the strategy comes down to is programming the computer to learn through examples. By exposing it to an abundance of texts in a specific language, it can learn to pick out patterns, Norvig said.

And if you teach it to compare two different languages side by side, it can figure out which words or characters generally correspond to one another.

"Most of the answer to how you do this is counting -- it's just the fancy phrase for counting is 'probability theory,'" Norvig said.

What Google's language tools do, for example, is let you do a word or phrase search in English. Then it will find results for that search among Web sites written in Spanish. And it will translate them so the English-language user can sort through those links in English.

Building a Collection

So far, it works with about 15 languages, but the hope is to add more soon, he said.

The tools also let you translate Web pages and text, among other things.

The key to building the language tools program was to feed it lots and lots of texts, gathering them from groups that already have documents translated into several languages, such as international news sites and United Nations archives, Norvig said.

"Then we build a model that says, 'Here's all these translations, and we know this page is a translation of that page, but we don't know exactly which corresponds to which,'" Norvig said. "What we have, though, is probabilities. Like the first sentence in English is similar to the first sentence in Chinese, but it could be the first two sentences, the first three, or it could be one to one."

After one example, the computer is still confused. But after a million examples, it starts to make associations that make sense, he said.

For instance, a Chinese character may come up often in relation to the English word "dog" or "terrier." And from that the computer learns to make a connection, he said.

"We've been able to do this, and our translation software is usually right at the top of a search," Norvig said. "And we've even been able to do this in some languages where nobody on the team speaks the language." More............

No comments: