Skip to content

Instantly share code, notes, and snippets.

@hrbrmstr
Created December 29, 2014 15:29
Show Gist options
  • Save hrbrmstr/dc62bb2b35617e9badc5 to your computer and use it in GitHub Desktop.
Save hrbrmstr/dc62bb2b35617e9badc5 to your computer and use it in GitHub Desktop.
Scraping gnarly sites with phantomjs & rvest
library(rvest)
# example of using phantomjs for scraping sites that use a twisty maze
# of javascript to render HTML tables or other tags
# grab phantomjs binaries from here: http://phantomjs.org/
# and stick it somehere PATH will find it
# this example scrapes the user table from:
url <- "http://64px.com/instagram/"
# write out a script phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
# process it with phantomjs
system("phantomjs scrape.js > scrape.html")
# use rvest as you would normally use it
page_html <- html("scrape.html")
page_html %>% html_nodes(xpath="//table[2]") %>% html_table()
# OR #
page_html %>% html_nodes("table:nth-of-type(2)") %>% html_table()
# if you prefer CSS selectors over XPath
@bmacNHL
Copy link

bmacNHL commented May 15, 2018

if you use the intern=TRUE option in system(), you can save the result as a character vector.

data = system("phantomjs scrape.js > scrape.html", intern=TRUE)

@RoelensThomas
Copy link

Hi, whenever I execute the system() command to launch phantomjs the R console freezes and I end up with an empty html page. Would you have any idea why this happens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment