Language Processing News

Saturday, November 25, 2006

Language Log: Morphemedar

Language Log: Morphemedar

When I read about the pattern apparently established by "gaydar" using the reanalysis of ra(dio)-dar, that word seemed very familiar even though it was the first time I saw it. And then came to mind Arkadiy Gaydar (Аркадий Гайдар), a Russian writer of the leninist era. Looking it up on Google, I found 1,130,000 ghits for Гайдар against the 930,000 ghits for the English spelling. This made me think of the usefulness of a phonetic search engine. Using finite state transducers from Xerox PARC, I had built one long time ago (1994) - it ended up being used in telephone directories to look up names by knowing approximate spelling (or just how to pronounce them...)

Saturday, October 21, 2006

Why We Can Be Confident of Turing Test Capability Within a Quarter Century

Why We Can Be Confident of Turing Test Capability Within a Quarter Century: "Computer language translation continues to improve gradually. Because this is a Turing-level task—that is, it requires full human-level understanding of language to perform at human levels—it will be one of the last application areas to compete with human performance. Franz Josef Och, a computer scientist at the University of Southern California, has developed a technique that can generate a new language-translation system between any pair of languages in a matter of hours or days. All he needs is a 'Rosetta stone'—that is, text in one language and the translation of that text in the other language—although he needs millions of words of such translated text. Using a self-organizing technique, the system is able to develop its own statistical models of how text is translated from one language to the other and develops these models in both directions.
This contrasts with other translation systems, in which linguists painstakingly code grammar rules with long lists of exceptions to each rule. Och's system recently received the highest score in a competition of translation systems conducted by the U.S. Commerce Department's National Institute of Standards and Technology."

Saturday, September 23, 2006

IBM technology translates Arabic media broadcasts to English

IBM Research Press Resources IBM technology translates Arabic media broadcasts to English: "Codenamed 'TALES' (for Translingual Automatic Language Exploitation System), the IBM technology processes the audio signal from Arabic television and radio stations and translates its spoken content into English text. Once this text is indexed by the CriticalTV platform, Critical Mention's clients will be able to conduct real-time searches of Arabic media, and receive alerts instantly when a search term is detected."

Tuesday, September 12, 2006

There's no data like more data...

Intelligent Enterprise Magazine: Google, Competitors Look Toward the Ultimate Search: "'Page rank is one factor with which we work; others are classification, clustering and synonym finding,' says Peter Norvig, Google's director of search quality. Norvig adds that Google is also working with technologies such as statistical machine translation, speech recognition and entity detection. The plan is to leverage what Google 'owns' on the Web to learn as many words, and consequent word relations, as possible. That, he says, would enable intuitive, cognitive 'conversations' to take place between searcher and search engine.
'We are on our way to learning from more than 1 trillion words procured from public Web pages, where others may have a billion,' he says, adding, 'there's no data like more data. ... Regardless of how clever the algorithm, the number of words is a critical factor.'"

Friday, August 04, 2006

Google is sharing N-gram data

Official Google Research Blog: All Our N-gram are Belong to You: "we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times."

Saturday, July 29, 2006

Check this out!

AMTA 2006 - Boston, MA: "Context-Based Machine Translation Authors: Jaime Carbonell, Steve Klein, David Miller, Mike Steinbaum, Tomer Grassiany and Jochen Frey"

The most exciting new MT approach around. Meaningful Machines will finally let us know something more detailed. The corresponding patents have been out for a while, but I am looking forward to read this paper...

UCI researchers 'text mine' the New York Times, demonstrating evolution of potent new technology

press release @ the bren school of information and computer sciences: "'We have shown in a very practical way how a new text mining technique makes understanding huge volumes of text quicker and easier,' said David Newman, a computer scientist in the Donald Bren School of Information and Computer Sciences at UCI. 'To put it simply, text mining has made an evolutionary jump. In just a few short years, it could become a common and useful tool for everyone from medical doctors to advertisers; publishers to politicians.'"