Language Processing News: 2005

Friday, December 02, 2005

Language Weaver Offers New Language Translation Module for Persian

Language Weaver, a leading developer of enterprise software for the automation of human language translation, today announced the commercial availability of a bidirectional Persian/English language pair module for its automated translation product. Persian may also be referred to as Farsi.
Bidirectional language pairs available include: Arabic/English, Chinese/English, Persian/English, French/English, and Spanish/English; unidirectional languages include Somali to English and Hindi to English.

Wednesday, November 23, 2005

OpenLogos

Members of the MT community may be interested in knowing, if they do not always do so, that the German Research Center for Artificial Intelligence (DFKI) is offering the Logos Machine Translation System in an open-source derivative known as OpenLogos. OpenLogos runs on the Linux platform with PostgreSQL and maybe downloaded from http://logos-os.dfki.de/

This open-software offering is being made to individuals, universities and public institutions free-of-charge, with a view to its exploitation in both current and new language combinations.

OpenLogos is based upon the long-standing commercial, rule-driven Logos System owned by GlobalWare AG (Eisennach)
http://allpr.de/20096/GlobalWare-AG-und-DFKI-praesentieren-LOGOS-Open-Source.html

For those interested in knowing about the underlying linguistic technology of OpenLogos, the article Bernard (Bud) Scott: The Logos Model: An Historical Perspective. In: Machine Translation 18 (2003), pp. 1-72 provides a comprehensive overview of the Logos approach to machine translation.

An earlier on-line description of the linguistic and computational motivations for the Logos Model is available at http://iai.iai.uni-sb.de/iaien/iaiwp/p11/index.html

Bud Scott
Parse International, Inc.
bud.scott@verizon.net

[NLP around NYC] free toolkit for syntax-driven SMT

The 2005 JHU Language Engineering Workshop has released a free toolkit for syntax-driven statistical machine translation (a.k.a. "translation by parsing"). The "GenPar" Toolkit is intended to serve as a springboard for research. Its modular design makes it also useful for educational purposes.

GenPar features:
* User, system, and design documentation.
* Flexibility -- it is dynamically configurable via nested config files.
* Intuitive, object-oriented design, making it easy to modify and extend.
* Complete validation suite.
* Fully integrated prototype SMT systems for 3 language pairs. These prototypes are certainly not state-of-the-art (so far). However, they are complete, in the sense that no additional software is required to build an MT system, apply it to new input, and automatically evaluate the results. These prototypes can also serve as blueprints/templates for other language pairs.

GenPar is downloadable from here:
http://www.clsp.jhu.edu/ws2005/groups/statistical/GenPar.html

The accompanying "MTV" tool for visualizing tree-structured alignments is downloadable from here:
http://www.clsp.jhu.edu/ws2005/groups/statistical/mtv.html

A report outlining the context in which these tools were created is at
http://www.clsp.jhu.edu/ws2005/groups/statistical/documents/finalreport.pdf

Researchers at several institutions are actively developing GenPar and MTV. We welcome inquiries from potential contributors and collaborators. Of course, we also welcome feedback from users.

Contact:
Dan Melamed
New York University
lastname AT cs DOT nyu DOT edu

Monday, October 31, 2005

Slashdot | Can Your Mouth Become Multilingual?

Slashdot | Can Your Mouth Become Multilingual? is the question discussed on Slashdot...

Friday, October 28, 2005

Let's talk! The computer can translate

Let's talk! The computer can translate...announced he would take questions from reporters in Germany and America, the computer heard it as "so we glycogen it alternating questions between Germany and America."

I would be curious to see how this got translated into German and then hear it synthesized by some speech generator for German listeners :) ...and, as always, it's only five years away from working perfectly.

Thursday, September 15, 2005

Slashdot | A Useful Grammar Checker?

Yesterday was a very "linguistic" day at Slashdot. There was a post about A Useful Grammar Checker
Programming
Posted by Cliff on Wednesday September 14, @06:02PM
from the what-set-of-rules-do-you-use dept.
burtdub asks: 'With the amount of raw text data available, there seems to be no shortage of ambitious language projects on the horizon, from Universal Language Translators to Junk Email Filtering. However, the mess that is the English language still seems to elude commercial attempts while being relatively ignored by the open source community. What would it take to make a useful, functional grammar checker?'"

And, of course, as we are used to see on Slashdot, responses flew from all over and in any direction...

A graduate student, apparently from OSU, gave a tough answer to the usual reproduction of assertion of messiness of English compared to perfection of other languages in his post:
"Most of the comments about grammar here have been incredibly stupid, by the way. Here's an important thing you learn in an intro to ling class: all languages are equally complicated. It's not going to be easier to write a grammar checker for any language above any other. e.g. You might have to worry more about morphology in one language and word order in another."

Monday, September 12, 2005

Information Sciences Institute - Grammar Lost Translation Machine In Researchers Fix Will

Daniel Marcu and Kevin Knight at UCS/ISI "propose to implement a trainable tree-based language model and parser, and to carry out empirical machine-translation experiments with them. USC/ISI's state-of-the-art machine translation system already has the ability to produce, for any input sentence, a list of 25,000 candidate English outputs. This list can be manipulated in a post-processing step. We will re-rank these lists of candidate string translations with our tree-based language model, and we plan for better translations to rise to the top of the list."

In this vein, David Chiang of the University of Maryland, Institute for Advanced Computer Studies, will present Friday, September 16, at the monthly NYCNLP meetings organized by NYU's Dan Melamed. Below is David's abstract:

The introduction of data-driven methods into machine translation (MT) in the 1990s created a whole new way of doing MT, and the recent move from the word-based models developed at IBM to the phrase-based models developed by Och and others has led to a breakthrough in MT performance. The next breakthrough, the move to syntax-based models that deal with the full hierarchical structures of sentences, is still waiting to happen. Several approaches have been tried, making considerable progress but not yet surpassing the performance level of simpler phrase-based models. Hiero is a step towards that breakthrough from the other side: it starts with a phrase-based model and incorporates formal characteristics of syntax-based models to improve on both. Like the latter, it deals with hierarchical structures, but it takes after the former in that it is unconstrained by syntactic theories, and can be trained from parallel bilingual text without any syntactic annotation, manual or automatic. In the recent NIST MT Evaluation, it outperformed several state-of-the-art systems, both phrase-based and syntax-based, on both Chinese-English and Arabic-English translation. I will present Hiero's underlying model, its implementation, and experimental results.

PAPER:
D. Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proc. ACL-05.
http://www.umiacs.umd.edu/~dchiang/papers/chiang-acl05.pdf

Wednesday, August 24, 2005

NIST 2005 Machine Translation Evaluation Results

Finally something real from Google... and this time even beating many of the old timers. The table below shows the BLEU results for Arabic to English. While this is good advertizing for Google, it lacks comparaison to the present leaders in Arabic translation: Language Weaver and, the well established Apptek. It's great though to see competition heating up. I wonder when Fluent Machines will show off their high BLEU scores.

Site BLEU-4 Score
GOOGLE 0.5131
ISI 0.4657
IBM 0.4646
UMD 0.4497
JHU-CU 0.4348
EDINBURGH 0.3970
SYSTRAN 0.1079
MITRE 0.0772
FSC 0.0037

The participants were:
U.S. Army Research Laboratory, Advanced Telecommunications Research Institute International Spoken Language Translation Research Laboratories - Japan, University of Edinburgh - UK, Fitchburg State College, Google, Harbin Institute of Technology Machine Intelligence & Translation Laboratory - China, IBM, Chinese Academy of Sciences Institute of Computing Technology - China, University of Southern California Information Sciences Institute, ITC-IRST - Italy, Johns Hopkins University & University of Cambridge, Linear B - UK, MITRE Corporation, National Research Council of Canada, NTT Communication Science Laboratories - Japan, RWTH Aachen University - Germany, Saarland University - Germany, Sakhr Software, SYSTRAN Language Translation Technologies, University of Maryland

Saturday, August 06, 2005

Can Google Stay Google?

Can Google Stay Google?: "'We're in a target-rich environment of interesting problems,' says Alan Eustace, one of Google's handful of vice presidents of engineering and its head of research. Take the technology for 'machine translation' of human language. Right now, Google can automatically translate Web pages from English into a bunch of major languages and vice versa -- German, Spanish, French, Italian, Portuguese, Japanese, Chinese, and Korean. The list will get longer in the next year or two. But that's just the beginning, Eustace says: 'The goal is to make the Internet language-independent.' Ultimately, all search results will come back instantly in your own language, regardless of what tongue you speak -- and what dialect the pages are written in. Every Google user will be like a delegate in the General Assembly of the United Nations putting on headphones to hear translations of the speaker up front. At the UN, it doesn't matter whether you speak only French and the orator is waxing eloquent in Chinese. The Web will be the same way.
Automated universal translation is the kind of long-range vision that inspires people like Eustace. It fascinates them because it's a technical Mount Everest that they can climb, but also because it's an idealistic goal that's potentially enriching to global society. 'In the long term, if you can create technology that can unify information around the world and remove the language barrier, that would be very special,' he says. "

Are we there yet? - I would love to be able to finally translate something in this engine. From what can be seen at Google, the quality of translation isn't far from what Systran, Logos, AppTek and Barcelona systems have been delivering since the '80.

Here is what Google writes about its translation in their language tools FAQ:
The translation isn't as good as I'd like it to be. Can you make it more accurate?

The translation you are seeing was produced automatically by state-of-the-art technology. Unfortunately, today's most sophisticated software doesn't approach the fluency of a native speaker or possess the skill of a professional translator. Automatic translation is very difficult, as the meaning of words depends upon the context in which they are used. Because of this, accurate translation requires an understanding of context, as well as an understanding of the structure and rules of a language. While many engineers and linguists are working on the problem, it will be some time before anyone can offer a quick and seamless translation experience. In the interim, we hope the service we provide is useful for most purposes.

Saturday, June 25, 2005

Google and Albanian

Even though with a few mistakes, Google does have an interface in Albanian. Yet, its language ID tools do not allow to search for pages only in Albanian. Until now, I have been using a little trick: I include the most frequent Albanian word "të" in a query and I get results from Albanian pages. Unfortunately, as discussed previously by David Beaver in Language Log's entry
Pass the hát, Google appears to also change "ë" in "e". Fortunately for Albanophiles, Yahoo maintains the difference by not folding the accented characters.

Read also Language Log and Technologies du Langage.

[just found out that, if the word containing diacritics is surrounded by quotes, Google will limit the search only to diacritic-marked-words]

Thursday, June 02, 2005

The Google Translator

Slashdot | Coming Soon, The Google Translator: "Google gave journalists a glimpse of its next generation machine translation system at a May 19th Google Factory Tour." - A great publicity stunt from Google... and, as always in the MT world, it's 95% done. Let's hope Google (which needs billions of words in parallel corpora) or some other company (that doesn't need parallel corpora) start carving something out of that 5% left there for decades. Looking at these patents, it doesn't seem a very crowded field.

Friday, May 20, 2005

Still waiting for that first translation...

Will it be Google to reach this long lasting dream first? Maybe Fluent Machines still has an advantage... Nothing new in this aritcle from The Stanford Daily Online Edition: "Machine translations, for instance, have come a long way at Google.

“Historically, the approach to building machine translation systems is to have expert machine linguists write down dictionaries and rules on how to translate, say, from Chinese to English,” said researcher Franz Och. “Trying to write down all the rules on how to translate from Chinese to English is very hard.”

Instead, Google is fine-tuning a translation program that can automatically translate back and forth between documents in different languages — a sort of virtual Rosetta Stone.

Current machine translations are inconsistent at best, Och said. One current translation program translated “The White House confirmed the existence of a new bin Laden tape” in Arabic to “Alpine white new presence tape registered for coffee confirms Laden,” in English."

Tuesday, May 17, 2005

Washington Post uses Teragram

Teragram is doing well: After reading that NY Times uses their categorization engine, Washington Post is mentioned in Teragram News: "'Our paper covers all facets of news and is continually receiving updates, information and electronic content,' said John Whall, director of applications development for Washingtonpost.Newsweek Interactive (WPNI). 'The taxonomy management abilities of Teragram TK240 enable us to manage, process and present this constant flow of information in a way that is useful to our editors and, more importantly, our readers.'"

Sunday, May 08, 2005

Computers Grading Students' Writing?

News: "SAGrader analyzes sentences and paragraphs, looking for keywords as well as the relationship between terms.

Other programs compare a student's paper with a database of already-scored papers, seeking to assign it a score based on what other similar-quality assignments have received.

Educational Testing Service sells Criterion, which includes the 'e-Rater' used to score GMAT essays. Vantage Learning has IntelliMetric, Maplesoft sells Maple T.A., and numerous other programs are used on a smaller scale. "

Thursday, May 05, 2005

Les Européens unissent leurs forces pour créer une bibliothèque virtuelle

Les Européens unissent leurs forces pour créer une bibliothèque virtuelle: "La semaine dernière, le président de la République a donc lancéun programme de développement d'un nouveau moteur de recherche sur le Net franco-allemand avec le chancelier allemand Gerhard Schröder. Quaero – c'est son nom – sera dédié à l'image, au son et à la vidéo."

So, it is Quaero the name of the EU library project covered in more linguistically relevant detail by Mark Liberman at Language Log.

Wednesday, May 04, 2005

Microsoft IP Ventures - Natural Language Processing for Educational Courseware

Microsoft's research labs and developments teams, for years, have produced technologies that have been out of reach to outside entrepreneurs. Finally, they are making available their IP through Microsoft IP Ventures. Among other technologies, some of their NLP stuff is available as well:

Microsoft IP Ventures - Natural Language Processing for Educational Courseware: "Natural Language Processing for Educational Courseware creates dynamic learning programs from any static educational content consisting of questions from the material that continuously adapt to a student based on previous answers."

Based on NLPWin which processes English, Spanish, German, French, and Japanese, it could become a very effective tool for creating new and interesting learning tools. I wonder whether it could break the monopoly of classrooms in language teaching/learning.

Monday, May 02, 2005

Teragram Adds Hungarian to its Linguistic Suite

Teragram Adds Hungarian to its Linguistic Suite: "Hungarian's linguistic challenges are easily handled by Teragram's dictionary as it breaks apart and parses meaning from highly agglutinative words. For example, the Hungarian word 'mostohagyerekeidhez,' meaning 'to your step children' is actually composed of many smaller pieces and can be deconstructed into 'mostoha gyerek e i d hez.' In this example, Teragram's software breaks the word down into its basic elements to derive meaning: 'mostoha' (meaning 'step' in English), gyerek ('child'), 'gyereke' (the possessive marker 'e' turns the meaning to 'child belonging to'), 'gyerekei' (the plural marker 'i' turns the meaning to 'children belonging to'), 'gyerekeid' (the second person marker 'd' turns the meaning to 'the children belonging to you' i.e. 'your children'), 'gyerekeidhez' (the inflectional 'hez' turns the meaning to 'to your children'). 'The ability of Teragram's powerful linguistic engine to deconstruct words into meaningful parts is critical to improving the precision of information retrieval applications and search accuracy,' says Dr. Schabes. 'That's what our customers look to us to uniquely provide.'"

Saturday, April 30, 2005

Microsoft Looks to Yukon for Data Mining Gold - but text mining remains just a "capability"

Redmond | Redmond Report Article: Microsoft Looks to Yukon for Data Mining Gold: "Microsoft added seven more algorithms in Yukon, including regression trees, sequence clustering, association rules and time series. It also included a capability called text mining, a tool for finding trends in unstructured data such as e-mails and documents."

Sunday, February 20, 2005

Why chattering classes have nothing to say

Now it all makes sense... People are longing for real conversation but, that happens only through telephones or other media.

The Observer | UK News | Why chattering classes have nothing to say: The art of conversation is dead but the artistry of chatter is thriving, with Britons overwhelmingly admitting they rarely talk about anything more serious than traffic and television.

According to a survey of more than 2,000 adults, almost two-thirds of us admit to indulging in shallow chit-chat at the expense of weighty dialogue - even though we secretly long for more meaningful exchanges.

...

"The survey also found that more than two -thirds of people believe the telephone is the best way to have intelligent conversations, although Ned Sherrin, presenter of Loose Ends , the Radio 4 comedy show, a lexicographer and author of 20 books, admits hating the telephone. 'I would rather see the contours of their face, the clouds and the flicker of their tears. I find the telephone irritating and unsatisfactory, and like to get them over with as quickly as possible,' he said."

Monday, February 07, 2005

Cindy Adams of PageSix: Natural Language Dialogue

There is a lof of NLP research dealing with modelling and simulation of human-dialogues. I would love to see the following as one of their test cases.

Yahoo! Movies: Entertainment News & Gossip: "LAGUARDIA Airport ladies room. A voice from another stall says, 'Hi, how are you?' The other lady, not one to chat up restroom strangers, sputters, 'Oh . . . fine . . . .' The Voice: 'So what're you up to?' The Embarrassed Sputterer, 'Ohhh, just traveling . . . .' The Voice: 'Can I come over?' Not quite knowing how to handle this bizarre turn, the Embarrassed Sputterer sputters: 'N-n-n-no. I'm a little busy right now.' The next sound is The Voice saying nervously: 'Listen, I'll have to call you back. There's an idiot in the other stall who keeps answering all my questions.'"

Friday, February 04, 2005

Slashdot | DARPA Contracts For AI Technology

Slashdot | DARPA Contracts For AI Technology: "DARPA has contracted two professors from RPI to develop artificial intelligences that can learn by reading and understanding natural language."

2B enhances Factiva's reputation

2B enhances Factiva's reputation - Computeractive: "Factiva has acquired the business and assets of 2B Reputation Intelligence, but details of the deal were not disclosed. 2B provide media and reputation monitoring software, as well as consulting services. Clare Hart, president and CEO of Factiva said the two companies have been working together for over a year and described 2B as a 'critical piece for an effective reputation management solution'.

'Whilst we were re-assessing the market this looked like the best way to accelerate our re-entry into the market,' Hart said of the decision to acquire 2B after dropping IBM.

Factiva announced in December 2004 that the IBM WebFountain web analysis platform was being dropped as the core technology for Factiva Insight for Reputation. WebFountain failed to provide timely content for analysis according to Factiva insiders. Hart denied that the IBM chapter had put Factiva behind in its reputation management plans."

Corpora launches 'language-savvy' knowledge discovery tool

.:: SourceWire :: ::.: "'Language-savvy' Jump! tackles the assault of information overload and simplifies working life by enabling people to interrogate large electronic documents for relevant details quickly and easily. Launched by knowledge management company, Corpora plc, the new tool 'speed-reads' material to create a map of contents that accelerates understanding and improves efficiency. "

Saturday, January 08, 2005

Live OpenSource Dictionary Project

I am glad this group of people started this project which contains a lot of the features we had projected years ago at Logos Corp. but never implemented. Live OpenSource Dictionary Project - About Us: "Lingster is the first ever freeware multilingual dictionary initiative launched to help language enthusiasts to keep up with constantly evolving languages. New words emerge on a daily basis, and both general and domain-specific dictionaries are hopelessly lagging behind. Argot and slang represent yet another rapidly moving target that is hard to follow with conventional language-learning methods."

Monday, January 03, 2005

CBS News | Defining Google | January 2, 2005�20:01:07

CBS News | Defining Google | January 2, 2005�20:01:07
...Google engineer Alan Eustace explains, "One of the ideas that we’re working on is machine translation. We strongly believe that there’s enough data on the Web and in the world right now to allow us to automatically translate from one language to another." ...