Q&A: Peter Norvig: We think what's important about natural language is the mapping of words onto the concepts that users are looking for. But we don't think it's a big advance to be able to type something as a question as opposed to keywords. Typing 'What is the capital of France?' won't get you better results than typing 'capital of France.' But understanding how words go together is important. To give some examples, 'New York' is different from 'York,' but 'Vegas' is the same as 'Las Vegas,' and 'Jersey' may or may not be the same as 'New Jersey.' That's a natural-language aspect that we're focusing on. Most of what we do is at the word and phrase level; we're not concentrating on the sentence. We think it's important to get the right results rather than change the interface.
[comments at slashdot]
Tuesday, December 18, 2007
Thursday, December 13, 2007
Document & Media Exploitation
Document & Media Exploitation
From Going Multimedia Vol. 5, No. 7 - November/December 2007 by Simson L. Garfinkel, Ph.D.
The DOMEX challenge is to turn digital bits into actionable intelligence.
A computer used by Al Qaeda ends up in the hands of a Wall Street Journal reporter. A laptop from Iran is discovered that contains details of that country's nuclear weapons program. Photographs and videos are downloaded from terrorist Web sites.
As evidenced by these and countless other cases, digital documents and storage devices hold the key to many ongoing military and criminal investigations. The most straightforward approach to using these media and documents is to explore them with ordinary tools - open the word files with Microsoft Word, view the Web pages with Internet Explorer, and so on.
Although this straightforward approach is easy to understand, it can miss a lot. Deleted and invisible files can be made visible using basic forensic tools. Programs called carvers can locate information that isn't even a complete file and turn it into a form that can be readily processed. Detailed examination of e-mail headers and log files can reveal where a computer was used and other computers with which it came into contact. Linguistic tools can discover multiple documents that refer to the same individuals, even though names in the different documents have different spellings and are in different human languages. Data-mining techniques such as cross-drive analysis can reconstruct social networks - automatically determining, for example, if the computer's previous user was in contact with known terrorists. This sort of advanced analysis is the stuff of DOMEX, the little-known intelligence practice of document and media exploitation.
The U.S. intelligence community defines DOMEX as "the processing, translation, analysis, and dissemination of collected hard-copy documents and electronic media, which are under the U.S. government's physical control and are not publicly available."1 That definition goes on to exclude "the handling of documents and media during the collection, initial review, and inventory process." DOMEX is not about being a digital librarian; it's about being a digital detective.
Although very little has been disclosed about the government's DOMEX activities, in recent years academic researchers - particularly those concerned with electronic privacy - have learned a great deal about the general process of electronic document and media exploitation. My interest in DOMEX started while studying data left on hard drives and memory sticks after files had been deleted or the media had been "formatted." I built a system to automatically copy the data off the hard drives, store it on a server, and search for confidential information. In the process I built a rudimentary DOMEX system. Other recent academic research in the fields of computer forensics, data recovery, machine translation, and data mining is also directly applicable to DOMEX.
From Going Multimedia Vol. 5, No. 7 - November/December 2007 by Simson L. Garfinkel, Ph.D.
The DOMEX challenge is to turn digital bits into actionable intelligence.
A computer used by Al Qaeda ends up in the hands of a Wall Street Journal reporter. A laptop from Iran is discovered that contains details of that country's nuclear weapons program. Photographs and videos are downloaded from terrorist Web sites.
As evidenced by these and countless other cases, digital documents and storage devices hold the key to many ongoing military and criminal investigations. The most straightforward approach to using these media and documents is to explore them with ordinary tools - open the word files with Microsoft Word, view the Web pages with Internet Explorer, and so on.
Although this straightforward approach is easy to understand, it can miss a lot. Deleted and invisible files can be made visible using basic forensic tools. Programs called carvers can locate information that isn't even a complete file and turn it into a form that can be readily processed. Detailed examination of e-mail headers and log files can reveal where a computer was used and other computers with which it came into contact. Linguistic tools can discover multiple documents that refer to the same individuals, even though names in the different documents have different spellings and are in different human languages. Data-mining techniques such as cross-drive analysis can reconstruct social networks - automatically determining, for example, if the computer's previous user was in contact with known terrorists. This sort of advanced analysis is the stuff of DOMEX, the little-known intelligence practice of document and media exploitation.
The U.S. intelligence community defines DOMEX as "the processing, translation, analysis, and dissemination of collected hard-copy documents and electronic media, which are under the U.S. government's physical control and are not publicly available."1 That definition goes on to exclude "the handling of documents and media during the collection, initial review, and inventory process." DOMEX is not about being a digital librarian; it's about being a digital detective.
Although very little has been disclosed about the government's DOMEX activities, in recent years academic researchers - particularly those concerned with electronic privacy - have learned a great deal about the general process of electronic document and media exploitation. My interest in DOMEX started while studying data left on hard drives and memory sticks after files had been deleted or the media had been "formatted." I built a system to automatically copy the data off the hard drives, store it on a server, and search for confidential information. In the process I built a rudimentary DOMEX system. Other recent academic research in the fields of computer forensics, data recovery, machine translation, and data mining is also directly applicable to DOMEX.
Thursday, December 06, 2007
Google and Its Enemies
Google and Its Enemies
The much-hyped project to digitize 32 million books sounds like a good idea. Why are so many people taking shots at it?
Nice article by Jonathan V. Last at Weekly Standard
The much-hyped project to digitize 32 million books sounds like a good idea. Why are so many people taking shots at it?
Nice article by Jonathan V. Last at Weekly Standard
Tuesday, October 23, 2007
How Do You Say... - WSJ.com
How Do You Say... - WSJ.com: "Google also has been developing its own translation software, which it uses to translate Web sites written in Chinese and Arabic. Google's technology is different from other translation software. Google feeds massive volumes of existing translations of text into a program, which uses that material to perform new translations by determining the statistical probability that a word or phrase in one language is equivalent to that of the other, says Peter Norvig, Google's director of research. The source text can include matching articles from news sites written in both Chinese and English, or European Union documents that are translated into the languages of the group's member countries, he says. Other translation technologies rely on preprogrammed dictionaries and grammatical rules to perform translations."
That's all the article says about the future... The rest is a look at the past, at technologies that have been invented in the 60's and early 70's of last century.
That's all the article says about the future... The rest is a look at the past, at technologies that have been invented in the 60's and early 70's of last century.
Saturday, September 22, 2007
Taking on Google: Is Semantic Technology the Answer?
Taking on Google: Is Semantic Technology the Answer?: "The startups have one key advantage: Google is rapidly pushing into new markets such as word processing, online payment systems, and mobile devices. These new markets provide higher growth—and more satisfaction for Wall Street—than rebuilding its existing search engine would. That leaves an opening for upstarts – if they can provide users with a good enough reason to switch from Google’s powerful simplicity, said Greg Sterling of Sterling Market Intelligence. “These engines need to create incentives to change and reward people for their behavioral change,” he said. “If (semantic search engines) deliver, people will likely respond.”
Topics: Google, Search, Geron, Hakia, Tomio, Semantic, Semantic Search, Natural Language, Powerset, Radar Networks, Adaptive Blue "
Topics: Google, Search, Geron, Hakia, Tomio, Semantic, Semantic Search, Natural Language, Powerset, Radar Networks, Adaptive Blue "
Tuesday, September 18, 2007
Drudge Report links to language news
| News | This is London: "Following the operation William, a pupil at Hempland Primary School in York, was in hospital for more than four weeks. He lost the ability to read and write and his memory was also affected. But remarkably he was able to play the piano and trumpet much better than before. After he came out of hospital William went on a family holiday to Northumberland with his parents and brothers Alex, 16, and Edward, 15. 'William was playing on the beach,' said Mrs McCartney-Moore. 'He suddenly said, 'Look, I've made a sand castle' but really stretched the vowels out, which made him sound really posh. 'We all just stared back at him - we couldn't believe what we had just heard because he had a northern accent before his illness. 'But the strange thing was that he had no idea why we were staring at him - he just thought he was speaking normally.'"
Tuesday, September 11, 2007
Microsoft Launches Translation Service
Haven't seen much from Microsoft Research MT group. The latest papers listed in their website are from 2002. Maybe they are doing something in production now...
Microsoft Launches Translation Service: "Windows Live Translator's presentation is extremely interesting: the default view shows the original page and the translation side by side in two vertical frames. If you hover over a sentence in one of the pages, the sentence is highlighted in both pages. If you scroll in one of the pages, the other page performs the same action. This is an interesting approach especially for those who speak both languages fairly well or want to learn a new language. Unfortunately, it's difficult to read a page that requires to scroll horizontally."
Microsoft Launches Translation Service: "Windows Live Translator's presentation is extremely interesting: the default view shows the original page and the translation side by side in two vertical frames. If you hover over a sentence in one of the pages, the sentence is highlighted in both pages. If you scroll in one of the pages, the other page performs the same action. This is an interesting approach especially for those who speak both languages fairly well or want to learn a new language. Unfortunately, it's difficult to read a page that requires to scroll horizontally."
Thursday, September 06, 2007
You can't index meaning...
» Hakia, a meaning-based search engine: "You can’t index meaning “You can’t index meaning,” Riza explains. ”You can only index words, addresses, and URLs.” “We have invented a new system called Qdexing, which is specifically designed for meaning representation. Qdex means query detection and extraction. This entails analyzing the entire content of a webpage, then extracting all possible queries that can be asked to this content, at various lengths and forms. These queries become gateways to the originating documents, paragraphs and sentences during the retrieval mode. Note that this is done off-line before any actual query is received from a user.”"
Saturday, July 28, 2007
IBM Research | Almaden Research Center | Computer Science
IBM Research Almaden Research Center Computer Science: "An increasingly important class of keyword search tasks are those where users are looking for a specific piece of information buried within a few documents in a large collection. Examples include searching for (a) someone's phone number or a package tracking URL, within a personal email collection, (b) reviews from blogs and (c) internal homepage for a person or a group within the company intranet. While modern information extraction techniques can be used to extract the concepts involved in these tasks (persons, phone numbers, restaurant reviews, etc.), since users only provide keywords as input, the problem of identifying the documents that contain the information of interest remains a challenge.
In Avatar Semantic Search, we are building a solution to this problem based on the concept of automatically generating ``interpretations'' of keyword queries. Interpretations are precise structured queries, over the extracted concepts, that model the real intent behind a keyword query. We have formalized the notion of interpretations and are addressing the various challenges in identifying the most likely interpretations for a given keyword query. The resulting interpretations are presented in an intuitive interface resulting in a dialogue between the user and the system to determine the true user intent (as shown in the screenshots below).
[Screenshot 1] [Screenshot 2] "
In Avatar Semantic Search, we are building a solution to this problem based on the concept of automatically generating ``interpretations'' of keyword queries. Interpretations are precise structured queries, over the extracted concepts, that model the real intent behind a keyword query. We have formalized the notion of interpretations and are addressing the various challenges in identifying the most likely interpretations for a given keyword query. The resulting interpretations are presented in an intuitive interface resulting in a dialogue between the user and the system to determine the true user intent (as shown in the screenshots below).
[Screenshot 1] [Screenshot 2] "
Tuesday, July 17, 2007
Technology Review: The Future of Search
Technology Review: The Future of Search: "The two biggest projects are machine translation and the speech project. Translation and speech went all the way from one or two people working on them to, now, live systems."
Tuesday, June 26, 2007
Human Language Technology Center at Johns Hopkins
Johns Hopkins Gazette | June 25, 2007: "The Johns Hopkins University has been awarded a long-term multimillion-dollar contract to establish and operate a Human Language Technology Center of Excellence near the Homewood campus. The center's research will focus on advanced technology for automatically analyzing a wide range of speech, text and document image data in multiple languages."
Thursday, June 21, 2007
FactSpotter from XRCE Grenoble
Searching for documents that contain specific information can be time consuming and frustrating in today’s office environment. Xerox scientist Frédérique Segond helped develop FactSpotter, a new technology that takes ordinary search to the next level by digging into more documents, analyzing meaning of words and context, accepting queries in everyday language. Segond manages parsing and semantics research at Xerox Research Centre Europe, in Grenoble.
Friday, May 25, 2007
Business Objects to Acquire Text Analytics Leader Inxight Software
Combination of Inxight and Business Objects to Deliver First Full Spectrum Business Intelligence Platform: "With the acquisition of Inxight Software, Inc., Business Objects expands its leadership in extending BI to embrace enterprise search. Going beyond basic keyword searches and solutions that simply provide a ranked listing of searched items, Inxight’s web services-based federated search and extraction capabilities extend the value of enterprise search engines by instantly clustering and filtering results from multiple search engines, including Google Search Appliance and Oracle Secure Enterprise Search. By providing a BI platform that leverages these capabilities, Business Objects will become the first vendor to bridge the gap between search and intelligence – delivering a broader view of data and dramatically accelerating the ability to locate hidden information in search results that might otherwise be overlooked. "
Tuesday, May 15, 2007
How Google translates without understanding
How Google translates without understanding The Register: "The Google approach is a lesson in practical software development: try things and see what sticks. It has just a few major steps:
1. Google starts with lots and lots of paired-example texts, like formal documents from the United Nations, in which identical content is expertly translated into many different languages. With these documents they can discover that 'white house' tends to co-occur with 'casa blanca,' so that the next time they have to translate a text containing 'white house' they will tend to use 'casa blanca' in the output.
2. They have even more untranslated text in each language, which lets them make models of 'well-formed' sentence fragments (for example, preferring 'white house' to 'house white'). So the raw output from the first translation step can be further massaged into (statistically) nicer-sounding text.
3. Their key for improving the system - and winning competitions - is an automated performance metric, which assigns a translation quality number to each translation attempt. More on this fatally weak link below."
1. Google starts with lots and lots of paired-example texts, like formal documents from the United Nations, in which identical content is expertly translated into many different languages. With these documents they can discover that 'white house' tends to co-occur with 'casa blanca,' so that the next time they have to translate a text containing 'white house' they will tend to use 'casa blanca' in the output.
2. They have even more untranslated text in each language, which lets them make models of 'well-formed' sentence fragments (for example, preferring 'white house' to 'house white'). So the raw output from the first translation step can be further massaged into (statistically) nicer-sounding text.
3. Their key for improving the system - and winning competitions - is an automated performance metric, which assigns a translation quality number to each translation attempt. More on this fatally weak link below."
Monday, May 07, 2007
PROMT 8.0: revamped translation software
PROMT revamped translation software product line: OSP International: "Evaluation of machine translation quality is usually quite individual but PROMT claims that PROMT 8.0 analyzes the context and generates grammatically correct translation of most of linguistic structures and set expressions. The user can teach the translator, enriching its vocabulary by adding personal dictionaries and using earlier translated text pieces in further translations. The quality of translation, especially of specialized texts, also largely depends on setting up software according to the document subject. The system set-up procedure, which many users used to ignore because of its length and complexity, has been much simplified in version 8.0. "
Saturday, April 07, 2007
Rosette Linguistics Platform by Basis Technology
Rosette Linguistics Platform by Basis Technology: "Basis Technology's interface module for the Rosette® Linguistics Platform (RLP) adds extensive multilingual support to Lucene quickly and easily.
RLP is the same multilingual text analysis technology used by the leading commercial search engines including Google, Yahoo!, Ask, and Live.com Search. That means users can enjoy the same quality of experience with Lucene they have come to expect with their favorite web and enterprise search engines."
RLP is the same multilingual text analysis technology used by the leading commercial search engines including Google, Yahoo!, Ask, and Live.com Search. That means users can enjoy the same quality of experience with Lucene they have come to expect with their favorite web and enterprise search engines."
Sunday, April 01, 2007
Machine Translation: Google News Headlines
Buzz of the week: it's interesting to see how each of the titles of this same story puts a different slant - from "seek" to "on the cards" to "speaking" and, finally to "has visions."
Google seeks world of instant translations
Boston Globe - Boston,MA,USA
Franz Och, head of statistical machine translation efforts at Google, is
photographed at his office in Mountain View, California, March 20, 2007.
...
Instant translation of content on the cards for Google
IT PRO - London,Greater London,UK
Could statistical machine translation deliver real-time language
translation of text and other content for the search giant? ...
Google speaking everyone's language
CNNMoney.com - USA
Google's (down $0.45 to $463.17, Charts) approach, called statistical
machine translation, differs from past efforts in that it forgoes language
experts who ...
Google has visions of instant online translation
Mobile Digest - London,England,UK
This 'statistical machine translation' doesn't rely directly on language
experts, grammatical rules, and dictionaries, as existing systems do, ...
Google seeks world of instant translations
Boston Globe - Boston,MA,USA
Franz Och, head of statistical machine translation efforts at Google, is
photographed at his office in Mountain View, California, March 20, 2007.
...
Instant translation of content on the cards for Google
IT PRO - London,Greater London,UK
Could statistical machine translation deliver real-time language
translation of text and other content for the search giant? ...
Google speaking everyone's language
CNNMoney.com - USA
Google's (down $0.45 to $463.17, Charts) approach, called statistical
machine translation, differs from past efforts in that it forgoes language
experts who ...
Google has visions of instant online translation
Mobile Digest - London,England,UK
This 'statistical machine translation' doesn't rely directly on language
experts, grammatical rules, and dictionaries, as existing systems do, ...
Thursday, March 29, 2007
Google Translates Into Revenues [Fool.com] March 29, 2007
Google Translates Into Revenues [Fool.com] March 29, 2007: "All of this should be music to Google investors' ears. Just imagine the possibilities as millions of citizens all across the globe stream onto the Internet and are no longer restricted to reading materials or accessing websites only in their native language. At a minimum, the service should lead to a sizeable increase in advertising revenue and that, in turn, should translate into higher revenues and profits for Google."
1954 all over again...
1954 all over again...
Language Weaver Launches Consumer-Focused Subsidiary, Kontrib, First to Offer Multilingual Social Bookmarking Site
Language Weaver Launches Consumer-Focused Subsidiary, Kontrib, First to Offer Multilingual Social Bookmarking Site: "Kontrib uses Language Weaver’s automated language translation software, a proprietary software engine developed using statistical methodology, or mathematical probability algorithms, to automatically translate English, Spanish, French and Arabic-language news items and postings into any or all of these languages. If a user submits a story in English, Spanish, French or Arabic it is available in the other languages in a matter of minutes. "
Tuesday, March 27, 2007
Google seeks world of instant translations - washingtonpost.com
Google seeks world of instant translations - washingtonpost.com: "Google chairman Eric Schmidt also sees broad political consequences of a world with easy translations.
'What happens when we have 100 languages in simultaneous translation? Google and other companies are working on statistical machine translation so that we can on demand translate everything all the time,' he told a conference earlier this year.
'Many, many societies have operated in language-defined communities where they really don't understand and are not particularly sympathetic to other peoples' views because of the barrier of language. We're about to have that breakthrough and it is a huge thing.'"
'What happens when we have 100 languages in simultaneous translation? Google and other companies are working on statistical machine translation so that we can on demand translate everything all the time,' he told a conference earlier this year.
'Many, many societies have operated in language-defined communities where they really don't understand and are not particularly sympathetic to other peoples' views because of the barrier of language. We're about to have that breakthrough and it is a huge thing.'"
Thursday, March 22, 2007
The Phoenix Online - Linguistics professor’s new book laments dying languages
The Phoenix Online - Linguistics professor’s new book laments dying languages: "K. David Harrison’s new book “When Languages Die: The Extinction of the World’s Languages and the Erosion of Human Knowledge” looks at what is lost from scientific, linguistic and humanistic vantage points when a language dies by examining field studies of endangered languages in Siberia, Mongolia, the Himalayas, North America and elsewhere."
Just finished reading this book. Very interesting, especially the discussion on calendars. It was a topic I hadn't read about in a long time...
Here is an interview to National Geographic.
Just finished reading this book. Very interesting, especially the discussion on calendars. It was a topic I hadn't read about in a long time...
Here is an interview to National Geographic.
Thursday, March 15, 2007
Powerset Welcomes Ronald Kaplan as Chief Technology and Science Officer @ SYS-CON Media
http://www.sys-con.com/read/348724.htm: Barney Pell, Founder and CEO of Powerset: "Ron's computational linguistic expertise is widely respected, and his decision to join our team gives us a clear competitive advantage over today's keyword-based search offerings. In addition, our ongoing exclusive right to the technology that Ron and his former team at PARC are developing comprises a core part of our aggressive business strategy."
Thursday, February 15, 2007
Saturday, February 10, 2007
Technology News: Portals & Search: PARC Licenses Search System, Aims to Upstage Google
Technology News: Portals & Search: PARC Licenses Search System, Aims to Upstage Google: "PARC's natural language technology -- which enables computers to understand plain-language expressions instead of having to work with keywords or preprogrammed commands -- is considered among the best in the world by search mavens.
The question remains, however, how well and how quickly that technology can be converted into a consumer-facing search engine.
Powerset said its cofounders -- Barney Pell, Steve Newcomb and Lorenzo Thione -- have been working with PARC since 2005 to explore and develop a market opportunity that could employ translating natural language in Internet search.
'Our collaboration with PARC results in remarkable new search capabilities that will turn the current statistical search model on its head,' said Pell.
As part of the deal, PARC researcher Ron Kaplan will join Powerset as chief technology officer. Pell called Kaplan 'an esteemed voice within the computational linguistics community.'
Though they did not release specific go-to-market plans, the two parties hinted the technology was close to ready.
'The time is right to tell the world about the game-changing technology we've created,' Kaplan said."
The question remains, however, how well and how quickly that technology can be converted into a consumer-facing search engine.
Powerset said its cofounders -- Barney Pell, Steve Newcomb and Lorenzo Thione -- have been working with PARC since 2005 to explore and develop a market opportunity that could employ translating natural language in Internet search.
'Our collaboration with PARC results in remarkable new search capabilities that will turn the current statistical search model on its head,' said Pell.
As part of the deal, PARC researcher Ron Kaplan will join Powerset as chief technology officer. Pell called Kaplan 'an esteemed voice within the computational linguistics community.'
Though they did not release specific go-to-market plans, the two parties hinted the technology was close to ready.
'The time is right to tell the world about the game-changing technology we've created,' Kaplan said."