Friday, August 04, 2006

Google is sharing N-gram data

Official Google Research Blog: All Our N-gram are Belong to You: "we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times."

