Saturday, June 25, 2005

Google and Albanian

Even though with a few mistakes, Google does have an interface in Albanian. Yet, its language ID tools do not allow to search for pages only in Albanian. Until now, I have been using a little trick: I include the most frequent Albanian word "të" in a query and I get results from Albanian pages. Unfortunately, as discussed previously by David Beaver in Language Log's entry
Pass the hát, Google appears to also change "ë" in "e". Fortunately for Albanophiles, Yahoo maintains the difference by not folding the accented characters.

Read also Language Log and Technologies du Langage.

[just found out that, if the word containing diacritics is surrounded by quotes, Google will limit the search only to diacritic-marked-words]

Thursday, June 02, 2005

The Google Translator

Slashdot | Coming Soon, The Google Translator: "Google gave journalists a glimpse of its next generation machine translation system at a May 19th Google Factory Tour." - A great publicity stunt from Google... and, as always in the MT world, it's 95% done. Let's hope Google (which needs billions of words in parallel corpora) or some other company (that doesn't need parallel corpora) start carving something out of that 5% left there for decades. Looking at these patents, it doesn't seem a very crowded field.