Tuesday, May 15, 2007

How Google translates without understanding

The Register: "The Google approach is a lesson in practical software development: try things and see what sticks. It has just a few major steps:
1. Google starts with lots and lots of paired-example texts, like formal documents from the United Nations, in which identical content is expertly translated into many different languages. With these documents they can discover that 'white house' tends to co-occur with 'casa blanca,' so that the next time they have to translate a text containing 'white house' they will tend to use 'casa blanca' in the output.
2. They have even more untranslated text in each language, which lets them make models of 'well-formed' sentence fragments (for example, preferring 'white house' to 'house white'). So the raw output from the first translation step can be further massaged into (statistically) nicer-sounding text.
3. Their key for improving the system - and winning competitions - is an automated performance metric, which assigns a translation quality number to each translation attempt. More on this fatally weak link below."

