Skip to content

Instantly share code, notes, and snippets.

@edwindj
Last active March 17, 2016 21:59
Show Gist options
  • Save edwindj/5c68bde4d6460d672fab to your computer and use it in GitHub Desktop.
Save edwindj/5c68bde4d6460d672fab to your computer and use it in GitHub Desktop.
devtools::install_github("hrbrmstr/hyphenatr")
library(magrittr)
library(hyphenatr)
switch_dict("nl_NL")
words <- c( "hottentottententententoonstelling" # extremely long word (not really used in dutch)
, "feeëriek" # = fairy, contains a special character
)
# and that is correctly hyphenated!
words %>%
hyphenate %>%
writeLines()
# however, UTF-8 encoding is not doing great: "feeëriek"is messed up.
words %>%
enc2utf8() %>%
hyphenate %>%
writeLines()
### test on larger corpus: Dutch wikipedia entry on Evolution
library(rvest)
library(stringi)
text <-
read_html("https://nl.wikipedia.org/wiki/Evolutie_(biologie)") %>%
html_node("#mw-content-text") %>%
html_text(trim=TRUE)
Encoding(text) <- "UTF-8" # otherwise word extraction is not working correctly
# looks great
words <-
text %>%
stri_extract_all(regex="([:alpha:]+)") %>%
unlist %>%
unique %>%
iconv(from="UTF-8") %>%
sort %>%
hyphenate()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment