Skip to content

Instantly share code, notes, and snippets.

@thejefflarson
Created October 19, 2016 15:35
Show Gist options
  • Save thejefflarson/319e1828c6f3b4dd02bf7dd956c8f1dd to your computer and use it in GitHub Desktop.
Save thejefflarson/319e1828c6f3b4dd02bf7dd956c8f1dd to your computer and use it in GitHub Desktop.
find {../data/www.huffingtonpost.com,../data/www.thenation.com} -type f -print0 |\
xargs -0 pv |\
iconv -c -t UTF8 |\
gsed "s/['’]s//g" | gsed "s/s['’]//g" |\
gsed 's/http.* //g' |\
gsed "s|[“”,‘/\"—…:;()#@!<>{}?=% &*_]| |g" |\
gtr -d "'" |\
gtr -d "’" |\
gtr "[:upper:]" "[:lower:]" |\
gsed 's/[0-9]/ /g' |\
gsed 's/--/ /g' |\
gtr '[' ' ' |\
gtr '.' ' ' |\
gsed -E "s/[[:space:]]+/ /g" |\
gsed "s/creepilybut/creepily but/g" |\
gsed 's/-year-old//g' |\
gsed 's/-month-old//g' |\
gsed 's/ - //g' |\
gtr ']' ' ' > $@
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment