Document & Media Exploitation
From Going Multimedia Vol. 5, No. 7 - November/December 2007 by Simson L. Garfinkel, Ph.D.
The DOMEX challenge is to turn digital bits into actionable intelligence.
A computer used by Al Qaeda ends up in the hands of a Wall Street Journal reporter. A laptop from Iran is discovered that contains details of that country's nuclear weapons program. Photographs and videos are downloaded from terrorist Web sites.
As evidenced by these and countless other cases, digital documents and storage devices hold the key to many ongoing military and criminal investigations. The most straightforward approach to using these media and documents is to explore them with ordinary tools - open the word files with Microsoft Word, view the Web pages with Internet Explorer, and so on.
Although this straightforward approach is easy to understand, it can miss a lot. Deleted and invisible files can be made visible using basic forensic tools. Programs called carvers can locate information that isn't even a complete file and turn it into a form that can be readily processed. Detailed examination of e-mail headers and log files can reveal where a computer was used and other computers with which it came into contact. Linguistic tools can discover multiple documents that refer to the same individuals, even though names in the different documents have different spellings and are in different human languages. Data-mining techniques such as cross-drive analysis can reconstruct social networks - automatically determining, for example, if the computer's previous user was in contact with known terrorists. This sort of advanced analysis is the stuff of DOMEX, the little-known intelligence practice of document and media exploitation.
The U.S. intelligence community defines DOMEX as "the processing, translation, analysis, and dissemination of collected hard-copy documents and electronic media, which are under the U.S. government's physical control and are not publicly available."1 That definition goes on to exclude "the handling of documents and media during the collection, initial review, and inventory process." DOMEX is not about being a digital librarian; it's about being a digital detective.
Although very little has been disclosed about the government's DOMEX activities, in recent years academic researchers - particularly those concerned with electronic privacy - have learned a great deal about the general process of electronic document and media exploitation. My interest in DOMEX started while studying data left on hard drives and memory sticks after files had been deleted or the media had been "formatted." I built a system to automatically copy the data off the hard drives, store it on a server, and search for confidential information. In the process I built a rudimentary DOMEX system. Other recent academic research in the fields of computer forensics, data recovery, machine translation, and data mining is also directly applicable to DOMEX.
Thursday, December 13, 2007
Thursday, December 06, 2007
Google and Its Enemies
Google and Its Enemies
The much-hyped project to digitize 32 million books sounds like a good idea. Why are so many people taking shots at it?
Nice article by Jonathan V. Last at Weekly Standard
The much-hyped project to digitize 32 million books sounds like a good idea. Why are so many people taking shots at it?
Nice article by Jonathan V. Last at Weekly Standard
Tuesday, October 23, 2007
How Do You Say... - WSJ.com
How Do You Say... - WSJ.com: "Google also has been developing its own translation software, which it uses to translate Web sites written in Chinese and Arabic. Google's technology is different from other translation software. Google feeds massive volumes of existing translations of text into a program, which uses that material to perform new translations by determining the statistical probability that a word or phrase in one language is equivalent to that of the other, says Peter Norvig, Google's director of research. The source text can include matching articles from news sites written in both Chinese and English, or European Union documents that are translated into the languages of the group's member countries, he says. Other translation technologies rely on preprogrammed dictionaries and grammatical rules to perform translations."
That's all the article says about the future... The rest is a look at the past, at technologies that have been invented in the 60's and early 70's of last century.
That's all the article says about the future... The rest is a look at the past, at technologies that have been invented in the 60's and early 70's of last century.
Saturday, September 22, 2007
Taking on Google: Is Semantic Technology the Answer?
Taking on Google: Is Semantic Technology the Answer?: "The startups have one key advantage: Google is rapidly pushing into new markets such as word processing, online payment systems, and mobile devices. These new markets provide higher growth—and more satisfaction for Wall Street—than rebuilding its existing search engine would. That leaves an opening for upstarts – if they can provide users with a good enough reason to switch from Google’s powerful simplicity, said Greg Sterling of Sterling Market Intelligence. “These engines need to create incentives to change and reward people for their behavioral change,” he said. “If (semantic search engines) deliver, people will likely respond.”
Topics: Google, Search, Geron, Hakia, Tomio, Semantic, Semantic Search, Natural Language, Powerset, Radar Networks, Adaptive Blue "
Topics: Google, Search, Geron, Hakia, Tomio, Semantic, Semantic Search, Natural Language, Powerset, Radar Networks, Adaptive Blue "
Tuesday, September 18, 2007
Drudge Report links to language news
| News | This is London: "Following the operation William, a pupil at Hempland Primary School in York, was in hospital for more than four weeks. He lost the ability to read and write and his memory was also affected. But remarkably he was able to play the piano and trumpet much better than before. After he came out of hospital William went on a family holiday to Northumberland with his parents and brothers Alex, 16, and Edward, 15. 'William was playing on the beach,' said Mrs McCartney-Moore. 'He suddenly said, 'Look, I've made a sand castle' but really stretched the vowels out, which made him sound really posh. 'We all just stared back at him - we couldn't believe what we had just heard because he had a northern accent before his illness. 'But the strange thing was that he had no idea why we were staring at him - he just thought he was speaking normally.'"
Tuesday, September 11, 2007
Microsoft Launches Translation Service
Haven't seen much from Microsoft Research MT group. The latest papers listed in their website are from 2002. Maybe they are doing something in production now...
Microsoft Launches Translation Service: "Windows Live Translator's presentation is extremely interesting: the default view shows the original page and the translation side by side in two vertical frames. If you hover over a sentence in one of the pages, the sentence is highlighted in both pages. If you scroll in one of the pages, the other page performs the same action. This is an interesting approach especially for those who speak both languages fairly well or want to learn a new language. Unfortunately, it's difficult to read a page that requires to scroll horizontally."
Microsoft Launches Translation Service: "Windows Live Translator's presentation is extremely interesting: the default view shows the original page and the translation side by side in two vertical frames. If you hover over a sentence in one of the pages, the sentence is highlighted in both pages. If you scroll in one of the pages, the other page performs the same action. This is an interesting approach especially for those who speak both languages fairly well or want to learn a new language. Unfortunately, it's difficult to read a page that requires to scroll horizontally."
Thursday, September 06, 2007
You can't index meaning...
» Hakia, a meaning-based search engine: "You can’t index meaning “You can’t index meaning,” Riza explains. ”You can only index words, addresses, and URLs.” “We have invented a new system called Qdexing, which is specifically designed for meaning representation. Qdex means query detection and extraction. This entails analyzing the entire content of a webpage, then extracting all possible queries that can be asked to this content, at various lengths and forms. These queries become gateways to the originating documents, paragraphs and sentences during the retrieval mode. Note that this is done off-line before any actual query is received from a user.”"