Skip to content

Instantly share code, notes, and snippets.

@hrbrmstr
Created December 29, 2014 15:29
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save hrbrmstr/dc62bb2b35617e9badc5 to your computer and use it in GitHub Desktop.
Save hrbrmstr/dc62bb2b35617e9badc5 to your computer and use it in GitHub Desktop.
Scraping gnarly sites with phantomjs & rvest
library(rvest)
# example of using phantomjs for scraping sites that use a twisty maze
# of javascript to render HTML tables or other tags
# grab phantomjs binaries from here: http://phantomjs.org/
# and stick it somehere PATH will find it
# this example scrapes the user table from:
url <- "http://64px.com/instagram/"
# write out a script phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
# process it with phantomjs
system("phantomjs scrape.js > scrape.html")
# use rvest as you would normally use it
page_html <- html("scrape.html")
page_html %>% html_nodes(xpath="//table[2]") %>% html_table()
# OR #
page_html %>% html_nodes("table:nth-of-type(2)") %>% html_table()
# if you prefer CSS selectors over XPath
@tchakravarty
Copy link

tchakravarty commented Nov 4, 2016

On Windows, L23 dumps the HTML to the R console, rather than to scrape.html. This works instead: write(readLines(pipe("phantomjs scrape.js", "r")), "scrape.html"). Any idea why the former might not be working?

@bmacNHL
Copy link

bmacNHL commented May 15, 2018

if you use the intern=TRUE option in system(), you can save the result as a character vector.

data = system("phantomjs scrape.js > scrape.html", intern=TRUE)

@RoelensThomas
Copy link

Hi, whenever I execute the system() command to launch phantomjs the R console freezes and I end up with an empty html page. Would you have any idea why this happens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment