Skip to content

Instantly share code, notes, and snippets.

@lmullen
Created January 23, 2015 16:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lmullen/1b71a7e8df1f1755ab9c to your computer and use it in GitHub Desktop.
Save lmullen/1b71a7e8df1f1755ab9c to your computer and use it in GitHub Desktop.
Scraping the Joseph Smith papers
library(rvest)
library(dplyr)
library(magrittr)
# First find the list of people and parse out their names and urls.
base <- "http://josephsmithpapers.org"
list_of_people <- "/reference/people#a::"
results <- paste0(base, list_of_people) %>%
html() %>%
html_nodes(".alphaItem")
names <- results %>%
html_text() %>%
unlist()
path <- results %>%
html_attr("href") %>%
unlist()
people <- data_frame(names, path)
get_person_data <- function(url) {
result <- html(url)
full_name <- result %>%
html_node(".metadata:nth-child(1) dd") %>%
html_text()
gender <- result %>%
html_node(".metadata:nth-child(2) dd") %>%
html_text()
bio <- result %>%
html_nodes("p") %>%
.[3] %>%
html_text()
mentions <- result %>%
html_nodes("#paper-link a") %>%
as.list()
data_frame(full_name, gender, bio) %>%
bind_cols(mentions)
}
temp <- paste0(base, people[1,2]) %>%
get_person_data()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment