Skip to content

Instantly share code, notes, and snippets.

@sdjacobs
Created December 4, 2014 18:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sdjacobs/d6dd0a65debdd89849ff to your computer and use it in GitHub Desktop.
Save sdjacobs/d6dd0a65debdd89849ff to your computer and use it in GitHub Desktop.
Get count of all French unigrams in the Google Books corpus
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from google_ngram_downloader import readline_google_store
all_records = readline_google_store(ngram_len=1, lang="fre")
this_ngram = "WORDS"
this_count = "COUNT"
for (fname, url, records) in all_records:
for r in records:
if r.year >= 1990:
if (r.ngram == this_ngram):
this_count += r.match_count
else:
print u'{}\t{}'.format(this_ngram, this_count)
this_ngram = r.ngram
this_count = r.match_count
@sdjacobs
Copy link
Author

sdjacobs commented Dec 4, 2014

This script creates a tab-separated-values file where the first column is the ngrams, and the second column is the counts of ngrams.

Ngrams are enumerated in alphabetical order, so we can stream through the entire corpus without building up large data structures in memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment