Language Processing News: 2008

Saturday, November 29, 2008

Official Google Blog: Our international approach to search

Official Google Blog: Our international approach to search: "... improving Google's international search. This is a tough challenge, since Google search is used in many countries and languages where our engineers have little personal knowledge. Initially, the international search improvements were done by Search Quality engineers who were passionate about their languages and countries: Lina from Sweden improved our parsing of compound words in German and Swedish; Dimitra from Greece introduced diacritical support; Ishai from Israel worked on transliteration corrections for Hebrew and Arabic; Trystan from Australia created methods for identifying local search results and ranking them together with foreign ones from the same language; Alex, a bilingual Ukrainian and Russian, introduced morphological understanding of these languages. As the importance of our international search grew, we solicited help from Googlers in all our offices. Finally, we are leveraging an international network of search specialists who help us understand search within the unique combination of their language and country."

The post is long and there are many improvements presented, however, after running my old tests for Albanian, I see that diacritics still cannot play the discriminating role they should. One of the top pages when searching for 'të' is the page for Tellurium. As suggested in a previous post here 'të' should be entered surrounded by quotes - it's the only way to be assured that the result will contain this exact string. One of the pages returned, containing as many 'ë' as Albanian was in Mixe (Ayuk) - a language spoken in Totontepec, Oaxaca, Mexico.

Tuesday, October 14, 2008

Google goes to the top of the language class - ZDNet.co.uk

Google goes to the top of the language class - ZDNet.co.uk: "Google's machine translation wasn't perfect, but it was well ahead of the competition. On a scale from zero to one, the company's software scored 0.5137 on the Arabic tests and 0.3531 on the Chinese tests. In Arabic, the University of Southern California's Information Sciences Institute came in second with a .4657 and second in Chinese with .3073. IBM scored .4646 on Arabic and .2571 on Chinese."

Sunday, October 05, 2008

Next Week's Turing Test

'Intelligent' computers put to the test | Technology | The Observer: "In the 'Turing test' a machine seeks to fool judges into believing that it could be human. The test is performed by conducting a text-based conversation on any subject. If the computer's responses are indistinguishable from those of a human, it has passed the Turing test and can be said to be 'thinking'.

No machine has yet passed the test devised by Turing, who helped to crack German military codes during the Second World War. But at 9am next Sunday, six computer programs - 'artificial conversational entities' - will answer questions posed by human volunteers at the University of Reading in a bid to become the first recognised 'thinking' machine. If any program succeeds, it is likely to be hailed as the most significant breakthrough in artificial intelligence since the IBM supercomputer Deep Blue beat world chess champion Garry Kasparov in 1997. It could also raise profound questions about whether a computer has the potential to be 'conscious' - and if humans should have the 'right' to switch it off."

Sunday, September 21, 2008

Official Google Blog: The intelligent cloud

Official Google Blog: The intelligent cloud: "We could train our systems to discern not only the characters or place names in a YouTube video or a book, for example, but also to recognize the plot or the symbolism. The potential result would be a kind of conceptual search: 'Find me a story with an exciting chase scene and a happy ending.' As systems are allowed to learn from interactions at an individual level, they can provide results customized to an individual's situational needs: where they are located, what time of day it is, what they are doing. And translation and multi-modal systems will also be feasible, so people speaking one language can seamlessly interact with people and information in other languages."

Saturday, August 16, 2008

Sun's or IBM's BreakIterator

Sun's java.text.BreakIterator is supposed to be synchronized with IBM's ICU4J BreakIterator. Apparently not... The code below shows a puzzling difference in behavior between the sentence iterators included in Java 6 and ICU4J 4.0.

public class Sentences {

  public static void main(String[] args) {
      String testText = "Elle courut à son père et l'embrassa, en l'étreignant.\n - Eh bien, partons-nous? dit-elle.";

      java.text.BreakIterator sunSentenceTokenizer = java.text.BreakIterator.getSentenceInstance();
      com.ibm.icu.text.BreakIterator icuSentenceTokenizer = com.ibm.icu.text.RuleBasedBreakIterator.getSentenceInstance();

      sunSentenceTokenizer.setText(testText);
      icuSentenceTokenizer.setText(testText);
    
      int sentenceStart = 0;
      int sentenceOffset = 0;
      int sentenceCounter = 0;
      System.out.println("sun");
      while ((sentenceOffset = sunSentenceTokenizer.next()) != sunSentenceTokenizer.DONE) {
          System.out.println(sentenceCounter + " " + testText.substring(sentenceStart, sentenceOffset));
          sentenceStart = sentenceOffset;
          sentenceCounter++;
      }

      sentenceStart = 0;
      sentenceOffset = 0;
      sentenceCounter = 0;
      System.out.println("icu");
      while ((sentenceOffset = icuSentenceTokenizer.next()) != icuSentenceTokenizer.DONE) {
          System.out.println(sentenceCounter + " " + testText.substring(sentenceStart, sentenceOffset));
          sentenceStart = sentenceOffset;
          sentenceCounter++;
      }
  }
}

This what I get:

sun
  0 Elle courut à son père et l'embrassa, en l'étreignant
  1 .
  2
   - Eh bien, partons-nous?
  3 dit-elle.
icu
  0 Elle courut à son père et l'embrassa, en l'étreignant.

  1  - Eh bien, partons-nous?
  2 dit-elle.

I did try it with French locale and with other languages and corresponding locales. The presence of followed by causes Sun's break iterator to break the preceding as a sentence apart. Decided to use IBM's ICU4J.

'\n' is left at the beginning of the sentence when using Sun's iterator.

Thursday, August 07, 2008

Google Translation Center: The World’s Largest Translation Memory - GigaOM

Google Translation Center: The World’s Largest Translation Memory - GigaOM: "Google has been investing significant resources in a multi-year effort to develop its statistical machine translation technology. Statistical MT works by comparing large numbers of parallel texts that have been translated between languages and from these learns which words and phrases usually map to others — similar to the way humans acquire language. The problem with statistical MT is that it requires a large number of directly translated sentences. These are hard to find, and because of this SMT systems use sources like the proceedings from the European Parliament, United Nations, etc. Which are fine if you’re writing in bureaucrat-speak, but aren’t so great for other texts. Google Translation Center is a straightforward and very clever way to gather a large corpus of parallel texts to train its machine translation systems.

Part machine translator and part translation memory (a sort of search engine for translation that helps translators to recall translations), GTC will help translators by providing a free, global translation memory, and in turn drive costs down by reducing the amount of work needed to complete a text. It will help Google by providing an excellent source of high quality parallel texts that can be fed back into the statistical translation systems."

Tuesday, July 29, 2008

CUIL vs Google

p2pnet news » Blog Archive » CUIL vs Google: "“Rather than assigning priority to pages based on inbound links as Google does (”Pagerank”), Cuil analyzes the content of Web pages to divine their relevance to a search query. Costello bristled when I asked if this was a semantic search engine like PowerSet (recently sold to Microsoft). Costello said Cuil’s search is ‘contextual,’ and that, ‘we’re trying to understand the real world, not the Web’.”
Cuil claims to have better search results than Google and others, “based on how they index websites,” says TechCrunch, going on:
“They do not simply catalog keywords on a site and then rank the site based on its importance. They also work to understand how words are related (France - cheese - wine, for example), to return more relevant results to users. This is a semantic approach to search, but very different from Powerset’s natural language approach.”
Powerset, “uses artificial intelligence to try to understand what sentences on a website actually mean” but Cuil, “simply tries to properly categorize and file a web page, even if the category name doesn’t appear on the site.”"

Thursday, July 24, 2008

Semantic Search Arrives at the Web

Semantic Search Arrives at the Web: "Semantic search has attracted a lot of attention in the past year, largely due to the growth of the semantic web as a whole. The term semantic search itself is popular enough to be considered overused. The term refers to searching large semantic web datasets, which is a typical problem for semantic web search engines such as Swoogle, Sindice, SWSE, Falcon-S, and Watson. The term also refers to methods of searching web documents beyond the syntactic level of matching keywords. This article discusses semantic search in this second sense." (Article by Peter Mika @ Yahoo)

Monday, May 26, 2008

Teragram Integrates Linguistic Tools with Apache Lucene

Teragram Integrates Linguistic Tools with Apache Lucene: "'Lucene is expanding its user base to high-profile corporate and consumer-facing websites around the world, and quickly becoming the open source alternative to traditional enterprise search,' said Dr. Yves Schabes, president and co-founder of Teragram. 'We're happy to provide Lucene users with language processing enhancements so they can meet the high-performance standards of traditional enterprise search engines, while still enjoying the freedom of the open source experience.'"

Monday, May 12, 2008

Powerset, with new search technology, launches today - SiliconValley.com

Powerset, with new search technology, launches today - SiliconValley.com: "Now based in San Francisco, the core of Powerset's technology was developed at Xerox's Palo Alto Resarch Center (PARC), which is famous for incubating breakthroughs like the computer mouse and the graphical user interface. Pell's co-founder Lorenzo Thione is a research scientist who has worked at CommerceNet consortium and the Fuji-Xerox Palo Alto Laboratory. About 25 of Powerset's 60 employees have Ph.Ds., mostly in computational linguistics.
Google has also hired dozens of specialists in computational linguistics, though its executives warn it will take years before machines can truly understand human text.
In the meantime, Powerset may face a bigger challenge: turning its computational break throughs into cash."

Powerset Debuts With Search of Wikipedia - Bits - Technology - New York Times Blog

Powerset Debuts With Search of Wikipedia - Bits - Technology - New York Times Blog: "Ask “Who did Henry VIII marry?” or “What did the FDA ban?” or “What did Bill Clinton sign?” and Powerset will come up with remarkably good answers. (Incidentally, Google does a decent job of answering the first of these questions but not the other two.) Powerset also has other nifty features, like its ability to create mini-dossiers that summarize the information it finds and to take users directly to a section of a document that is most relevant to their search. But Powerset remains a long way off from its promise and faces a seemingly intractable problem: for a very large fraction, if not the vast majority, of searches, keywords work just fine."

Powerset unveils semantic Wikipedia search tool

Article Reuters: "SAN FRANCISCO (Reuters) - Powerset on Sunday unveiled tools for searching Wikipedia that use conversational phrasing instead of keywords, marking the first step of its challenge to established Web search services such as Google.
Powerset's technology breaks down the meaning of words and sentences into related concepts, freeing users from always needing to type the exact words they want to find."

Saturday, February 23, 2008

Gates Sees Diminished Role for Keyboards: Financial News - Yahoo! Finance

Gates Sees Diminished Role for Keyboards: Financial News - Yahoo! Finance: "People will increasingly interact with computers using speech or touch screens rather than keyboards, Microsoft Corp. Chairman Bill Gates said.
'It's one of the big bets we're making,' he said during the final stop of a farewell tour before he withdraws from the company's daily operations in July."