Syntactic annotations for the google books ngram corpus

Yuri Lin, Jean Baptiste Michel, Erez Lieberman Aiden, Jon Orwant, Will Brockman, Slav Petrov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

350 Scopus citations

Abstract

We present a new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages; it reflects 6% of all books ever published. This new edition introduces syntactic annotations: words are tagged with their part-of-speech, and head-modifier relationships are recorded. The annotations are produced automatically with statistical models that are specifically adapted to historical text. The corpus will facilitate the study of linguistic trends, especially those related to the evolution of syntax.

Original languageEnglish (US)
Title of host publicationACL 2012 - 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the System Demonstrations
EditorsMin Zhang
PublisherAssociation for Computational Linguistics (ACL)
Pages169-174
Number of pages6
ISBN (Electronic)9781937284275
StatePublished - 2012
Externally publishedYes
Event50th Annual Meeting of the Association for Computational Linguistics, ACL 2012 - Jeju Island, Korea, Republic of
Duration: Jul 10 2012 → …

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

Conference50th Annual Meeting of the Association for Computational Linguistics, ACL 2012
Country/TerritoryKorea, Republic of
CityJeju Island
Period7/10/12 → …

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Syntactic annotations for the google books ngram corpus'. Together they form a unique fingerprint.

Cite this