Reload Original PagePrint PageEmail Page

Daily learnings

For all the Machine Learning fans out there, here is a short list of various datasets released by Google over the years.

  • Co-occurrence of words for word n-gram model training (translation, spelling correction, speech recognition): blog post
  • Job queue traces from Google clusters: blog post data
  • 800M documents (search corpus) annotated with Freebase entities: blog post
  • Wikilinks, 40M disambiguated mentions in 10M web pages linked to Wikipedia entities: blog post
  • Human-judged corpus of binary relations about Wikipedia public figures (pairings of people to freebase concepts, annotated with supporting document and a human rater confidence): blog post data
  • Wikipedia Infobox edit history (39M updates of attributes of 1.8M entities) blog post
  • Triples of (phrase, URL of a Wikipedia entity, number of times phrase appears in the page at the URL) - useful for entity word dictionaries blog post
  • ::...
    免责声明:
    当前网页内容, 由 大妈 ZoomQuiet 使用工具: ScrapBook :: Firefox Extension 人工从互联网中收集并分享;
    内容版权归原作者所有;
    本人对内容的有效性/合法性不承担任何强制性责任.
    若有不妥, 欢迎评注提醒:

    或是邮件反馈可也:
    askdama[AT]googlegroups.com


    订阅 substack 体验古早写作:


    点击注册~> 获得 100$ 体验券: DigitalOcean Referral Badge

    关注公众号, 持续获得相关各种嗯哼:
    zoomquiet


    自怼圈/年度番新

    DU22.4
    关于 ~ DebugUself with DAMA ;-)
    粤ICP备18025058号-1
    公安备案号: 44049002000656 ...::