Skip to content

Instantly share code, notes, and snippets.

Created July 16, 2015 17:01
Show Gist options
  • Save honnibal/30499850449a46c167a8 to your computer and use it in GitHub Desktop.
Save honnibal/30499850449a46c167a8 to your computer and use it in GitHub Desktop.
Syntax-specific search with spaCy
Example use of the spaCy NLP tools for data exploration.
Here we will look for reddit comments that describe Google doing something,
i.e. discuss the company's actions. This is difficult, because other senses of
"Google" now dominate usage of the word in conversation, particularly references to
using Google products.
The heuristics here are quick and dirty --- about 5 minutes work. A better approach
is to use the word vector of the verb. But, the demo here is just to show what's
possible to build up quickly, to start to understand some data.
from __future__ import unicode_literals
from __future__ import print_function
import sys
import plac
import bz2
import ujson
import spacy.en
def main(input_loc):
nlp = spacy.en.English() # Load the model takes 10-20 seconds.
for line in bz2.BZ2File(input_loc): # Iterate over the reddit comments from the dump.
comment_str = ujson.loads(line)['body'] # Parse the json object, and extract the 'body' attribute.
comment_parse = nlp(comment_str) # Apply the spaCy NLP pipeline.
for word in comment_parse: # Look for the cases we want
if google_doing_something(word):
# Print the clause
print(''.join(w.string for w in word.head.subtree).strip())
def google_doing_something(w):
if w.lower_ != 'google':
return False
elif w.dep_ != 'nsubj': # Is it the subject of a verb?
return False
elif w.head.lemma_ == 'be' and w.head.dep_ != 'aux': # And not 'is'
return False
elif w.head.lemma_ in ('say', 'show'): # Exclude e.g. "Google says..."
return False
return True
if __name__ == '__main__':
Copy link

Example output. Many false positives remain. Some are from incorrect interpretations of the sentence by spaCy, some are flaws in our filtering logic. But the results are vastly better than a string-based search, which returns almost no examples of the pattern we're looking for.

Google dropped support for Android < 4.0 already
google drive
Google to enforce a little more uniformity in its hardware so that we can see a better 3rd party market for things like mounts, cases, etc
When Google responds
Google translate cyka pasterino.
A quick google looks like Synology does have a sync'ing feature which does support block level so that should work (Never used Synology, but they really do have some great features you expect in NAS's.
(google came up with some weird One Piece/FairyTail crossover stuff), and is their knowledge universally infallible?
Until you have the gear, google some videos on best farming runs on each planet, you can get a lot REAL fast with the right loop.
Google offers something like this already, but it is truly terrible.
google isn't helping me
Google tells me: 0 results, 250 pages removed from google.
how did Google swoop in and eat our lunch
Google does come up with some good ideas.
google translate with my life or anything
what Google told you-
Google now cards
that Google are actively paying Snapchat to ignore Windows Phone
that google has a monopoly in search engines
Google which handles 9 million take downs a week doesn't hire out to thousands of TFW to handle the work load
So, Google bought the flying cars patent and Apple acquired self lacing shoes.
Google making cross platform applications tied to a web-browser
Google crawls the web and takes snapshots of each page as a backup just in case the current page is not available.
because Google is going to steal your credit card info
Google disagrees with you
before Google bought them
Google had not pushed L to these devices or would make it easy for people like me who remain in stock with OTA updates to easily revert back to Kitkat
Google putting the work in to make themselves a direct competitor
newer versions of android and Google moving away from SD cards
that you can Google and shop

Copy link

Awesome. Absolutely awesome work man.

The logic of catching some of those tricky adverbs/verbs (I.e. 'A quick Google') would be hard to generalize... Maybe this is too strict, but I assume it's possible to check the word falling directly before/after Google and negate results that contain any verb/adverb on a given blacklist?

Also, 1000 points just for trawling this:
So, Google bought the flying cars patent and Apple acquired self lacing shoes.

Copy link

Well, the idea is actually that words-before-and-after is actually just a proxy measure for the syntactic structure, which is really a "tree". But we don't have to use the string order — spaCy gives you that tree :).

Like, compare these sentences (trees provided by CMU's parser, since I don't have spaCy linked up to a visualiser yet):

a) "a quick Google would show you're wrong"
b) "Google shows you're wrong"

You see the arc labelled "nsubj" from "show" to Google? That's the sort of relationship we're checking out in the google_doing_something function. The "dep" property refers to the label of the arc (e.g. nsubj), and the "lemma" property ensures we get the uninflected form ("show", not "shows").

The idea is to give representations that abstract away a lot of the incidental variation, so that you can write more precise rules for what you're looking for. The CMU parser page has an example of a representation that's more abstract still, the semantic parse. But then the accuracy starts to go down, and we get too many parse errors. The syntactic parse is a sort of compromise, where we can extract this "view" of the sentence reasonably reliably (about 92% of the arcs are correct), but abstract enough to be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment