Thursday, September 15, 2005
Posted by Cliff on Wednesday September 14, @06:02PM
from the what-set-of-rules-do-you-use dept.
burtdub asks: 'With the amount of raw text data available, there seems to be no shortage of ambitious language projects on the horizon, from Universal Language Translators to Junk Email Filtering. However, the mess that is the English language still seems to elude commercial attempts while being relatively ignored by the open source community. What would it take to make a useful, functional grammar checker?'"
And, of course, as we are used to see on Slashdot, responses flew from all over and in any direction...
A graduate student, apparently from OSU, gave a tough answer to the usual reproduction of assertion of messiness of English compared to perfection of other languages in his post:
"Most of the comments about grammar here have been incredibly stupid, by the way. Here's an important thing you learn in an intro to ling class: all languages are equally complicated. It's not going to be easier to write a grammar checker for any language above any other. e.g. You might have to worry more about morphology in one language and word order in another."
Monday, September 12, 2005
Daniel Marcu and Kevin Knight at UCS/ISI "propose to implement a trainable tree-based language model and parser, and to carry out empirical machine-translation experiments with them. USC/ISI's state-of-the-art machine translation system already has the ability to produce, for any input sentence, a list of 25,000 candidate English outputs. This list can be manipulated in a post-processing step. We will re-rank these lists of candidate string translations with our tree-based language model, and we plan for better translations to rise to the top of the list."
In this vein, David Chiang of the University of Maryland, Institute for Advanced Computer Studies, will present Friday, September 16, at the monthly NYCNLP meetings organized by NYU's Dan Melamed. Below is David's abstract:
The introduction of data-driven methods into machine translation (MT) in the 1990s created a whole new way of doing MT, and the recent move from the word-based models developed at IBM to the phrase-based models developed by Och and others has led to a breakthrough in MT performance. The next breakthrough, the move to syntax-based models that deal with the full hierarchical structures of sentences, is still waiting to happen. Several approaches have been tried, making considerable progress but not yet surpassing the performance level of simpler phrase-based models. Hiero is a step towards that breakthrough from the other side: it starts with a phrase-based model and incorporates formal characteristics of syntax-based models to improve on both. Like the latter, it deals with hierarchical structures, but it takes after the former in that it is unconstrained by syntactic theories, and can be trained from parallel bilingual text without any syntactic annotation, manual or automatic. In the recent NIST MT Evaluation, it outperformed several state-of-the-art systems, both phrase-based and syntax-based, on both Chinese-English and Arabic-English translation. I will present Hiero's underlying model, its implementation, and experimental results.
D. Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proc. ACL-05.