Friday, December 01, 2006

Wired 14.12: Me Translate Pretty One Day

A nice article by Evan Ratliff on the Fluent Machines MT (machine translation) system.

Saturday, November 25, 2006

Language Log: Morphemedar

When I read about the pattern apparently established by "gaydar" using the reanalysis of ra(dio)-dar, that word seemed very familiar even though it was the first time I saw it. And then came to mind Arkadiy Gaydar (Аркадий Гайдар), a Russian writer of the leninist era. Looking it up on Google, I found 1,130,000 ghits for Гайдар against the 930,000 ghits for the English spelling. This made me think of the usefulness of a phonetic search engine. Using finite state transducers from Xerox PARC, I had built one long time ago (1994) - it ended up being used in telephone directories to look up names by knowing approximate spelling (or just how to pronounce them...)

Saturday, October 21, 2006

Why We Can Be Confident of Turing Test Capability Within a Quarter Century

Why We Can Be Confident of Turing Test Capability Within a Quarter Century: "Computer language translation continues to improve gradually. Because this is a Turing-level task—that is, it requires full human-level understanding of language to perform at human levels—it will be one of the last application areas to compete with human performance. Franz Josef Och, a computer scientist at the University of Southern California, has developed a technique that can generate a new language-translation system between any pair of languages in a matter of hours or days. All he needs is a 'Rosetta stone'—that is, text in one language and the translation of that text in the other language—although he needs millions of words of such translated text. Using a self-organizing technique, the system is able to develop its own statistical models of how text is translated from one language to the other and develops these models in both directions.
This contrasts with other translation systems, in which linguists painstakingly code grammar rules with long lists of exceptions to each rule. Och's system recently received the highest score in a competition of translation systems conducted by the U.S. Commerce Department's National Institute of Standards and Technology."

Saturday, September 23, 2006

IBM technology translates Arabic media broadcasts to English

IBM Research Press Resources IBM technology translates Arabic media broadcasts to English: "Codenamed 'TALES' (for Translingual Automatic Language Exploitation System), the IBM technology processes the audio signal from Arabic television and radio stations and translates its spoken content into English text. Once this text is indexed by the CriticalTV platform, Critical Mention's clients will be able to conduct real-time searches of Arabic media, and receive alerts instantly when a search term is detected."

Tuesday, September 12, 2006

There's no data like more data...

Intelligent Enterprise Magazine: Google, Competitors Look Toward the Ultimate Search: "'Page rank is one factor with which we work; others are classification, clustering and synonym finding,' says Peter Norvig, Google's director of search quality. Norvig adds that Google is also working with technologies such as statistical machine translation, speech recognition and entity detection. The plan is to leverage what Google 'owns' on the Web to learn as many words, and consequent word relations, as possible. That, he says, would enable intuitive, cognitive 'conversations' to take place between searcher and search engine.
'We are on our way to learning from more than 1 trillion words procured from public Web pages, where others may have a billion,' he says, adding, 'there's no data like more data. ... Regardless of how clever the algorithm, the number of words is a critical factor.'"

Friday, August 04, 2006

Google is sharing N-gram data

Official Google Research Blog: All Our N-gram are Belong to You: "we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times."

Saturday, July 29, 2006

Check this out!

AMTA 2006 - Boston, MA: "Context-Based Machine Translation Authors: Jaime Carbonell, Steve Klein, David Miller, Mike Steinbaum, Tomer Grassiany and Jochen Frey"

The most exciting new MT approach around. Meaningful Machines will finally let us know something more detailed. The corresponding patents have been out for a while, but I am looking forward to read this paper...

UCI researchers 'text mine' the New York Times, demonstrating evolution of potent new technology

press release @ the bren school of information and computer sciences: "'We have shown in a very practical way how a new text mining technique makes understanding huge volumes of text quicker and easier,' said David Newman, a computer scientist in the Donald Bren School of Information and Computer Sciences at UCI. 'To put it simply, text mining has made an evolutionary jump. In just a few short years, it could become a common and useful tool for everyone from medical doctors to advertisers; publishers to politicians.'"

Wednesday, May 31, 2006

ZyLAB - Document management software and digital archiving

ZyLAB - Document management software and digital archiving: "ZyLAB, an innovative developer of Information Access Solutions, today announced a new strategic partnership agreement with Language Weaver, Inc., a leading software company developing enterprise software for automated language translation. ZyLAB will integrate Language Weaver's statistical machine translation software (SMTS) with ZyIMAGE, its Information Access Platform, for customers requiring language translation capabilities alongside advanced data archiving and information management functionalities."

Wednesday, May 03, 2006

Great News from Google: Statistical machine translation live

Official Google Research Blog: Statistical machine translation live: "Several research systems, including ours, take a different approach: we feed the computer with billions of words of text, both monolingual text in the target language, and aligned text consisting of examples of human translations between the languages. We then apply statistical learning techniques to build a translation model. We have achieved very good results in research evaluations."

A very informative commentary by Mark Liberman is found here and a not so exciting discussion going on here.

Sunday, March 26, 2006

Bibliotheque Numerique Francophone

About a year ago, LanguageLog derided the reaction of the head of the National Library of France. Since then, there have been several tries to defeat the evil Google. As "La République Internationale des Lettres" writes in very sarcastic terms, there is yet another effort but this time only Francophone... No hurry though: "Le projet reste pour l'instant au stade de la réflexion." The project is still in its reflection stage.

Bibliothèque Numérique Francophone: "Après Gallica, après la Bibliothèque Numérique Européenne, cap donc aujourd'hui vers la Bibliothèque Numérique Francophone. Nul doute qu'avec ce chef d'escadrille visionnaire, les éditeurs et autres milieux professionnels moutonniers du livre français lancés derrière lui dans le combat anti-Google sont en train de participer activement au futur rayonnement des savoirs, de la culture et de la langue française sur internet. Qui parlait de déclin de la France?"

Monday, March 13, 2006

Found in Translation - Military Information Technology

Found in Translation - Military Information Technology: "Spurred by the military and intelligence communities’ growing need to translate and retrieve pertinent foreign-language intelligence, the Defense Advanced Research Project Agency has launched a program aimed at improving automated, searchable translations."

Tuesday, March 07, 2006

Speak It in Chinese, Hear It in English - Newsweek: International Editions -

Speak It in Chinese, Hear It in English - Newsweek: International Editions - "A three-year EU project called TC-STAR is pumping €10 million into language-software R&D."
That's great - but - what's new in there? OCR? Siemens' MT (METAL)? In any case, everything seems to be two years away - even this statement is not new...

Sunday, March 05, 2006

IBM's research juggling act | Tech News on ZDNet

Paul Horn, the director of IBM Research: It continues to be a big thing for IBM and for IBM Research, but it's not just WebFountain. The basic issues are, really, natural language understanding in general. What WebFountain was able to do, which made it powerful, was it would go in and would scan text documents on the Web and it would understand enough about what people were saying that you could query it about what people were saying. You could imagine that there's a lot of countries, including our own, that would care a lot about scanning documents and even open documents and crawling through them to see what people were saying. A lot of the early work on WebFountain was done in three languages--English, Arabic and Chinese--and you can guess who might sponsor that work.

WebFountain is an example of a natural language technology that allows you to essentially analyze from an intelligence point of view what people are saying, but the important point is that this is just a small piece of many, many problems that companies have and where you want to take advantage of natural language understanding, such as translating spoken English to Russian and back again.

We talked about call centers. Natural language understanding can be incredibly powerful, even if you've got a call center operator, just by monitoring the calls and trying to understand what the issues are. There's enormous amounts of natural language and analytic issues in how companies interact with their customers. WebFountain was a specific application of natural language and search technology, but it's just one.

Wednesday, January 25, 2006

What's Next: Meet Your New Executives

Nice article about the current developments in text mining.

...Text-mining engines, which can cost as little as a few thousands dollars, take up where Google leaves off, searching articles, webpages, blogs, and e-mail (and eventually, even mobile phone calls or television broadcasts) for ideas and even emotions, rather than specific terms...