For all the Machine Learning fans out there, here is a short list of various datasets released by Google over the years.
Co-occurrence of words for word n-gram model training (translation, spelling correction, speech recognition):
blog post
Job queue traces from Google clusters:
blog postdata
800M documents (search corpus) annotated with Freebase entities: blog post
Wikilinks, 40M disambiguated mentions in 10M web pages linked to Wikipedia entities: blog post
Human-judged corpus of binary relations about Wikipedia public figures (pairings of people to freebase concepts, annotated with supporting document and a human rater confidence): blog postdata
Wikipedia Infobox edit history (39M updates of attributes of 1.8M entities) blog post
Triples of (phrase, URL of a Wikipedia entity, number of times phrase appears in the page at the URL) - useful for entity word dictionaries blog post