Skip to content

Instantly share code, notes, and snippets.

@jrladd
Last active March 21, 2017 19:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jrladd/0c5ddaf286f457191010411ef79a83c3 to your computer and use it in GitHub Desktop.
Save jrladd/0c5ddaf286f457191010411ef79a83c3 to your computer and use it in GitHub Desktop.
How to Prepare a Corpus for DocuScope

This gist contains a Jupyter Notebook tutorial for corpus preparation.

Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Finding and Preparing a Plaintext Corpus\n",
"\n",
"Corpus preparation is the first step of any text analysis project. Before you can run any kind of text analysis, you need to have texts to analyze. Fortunately, there are lots of resources for finding and preparing plaintext files.\n",
"\n",
"## What is plaintext?\n",
"\n",
"Most text analysis tools, including DocuScope, take *plaintext* files as input. (DocuScope, of course, takes other kind of input as well, but plaintext is a good standard.) This refers to files, usually with the `.txt` suffix, that don't have any additional encoding or notation. Plaintext files are very versatile---since they don't have any tags or encoding they can be read by almost any program and their simplicity makes them easy to manage. In this tutorial, we'll work toward getting our files into plaintext. We'll create a folder and each file in it will be a single text, as a file ending in `.txt`.\n",
"\n",
"## Ready-to-use corpora\n",
"\n",
"While web scraping is an attractive method---see a set of texts, pull them off the web---not all websites allow scraping or would be happy if they were scraped. That's why it's always best to look first at existing available collections and datasets. Fortunately, as DH and text analysis get more popular, there are more and more of these around. Some resources:\n",
"\n",
"### [Alan Liu's DH resources](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets)\n",
"\n",
"Digital Humanities scholar Alan Liu, from UC Santa Barbara, curates an excellent list of corpora readily available for download. The list is quite extensive, and many of the \"demo corpora\" at the top of the page are already in plaintext format, ready for download.\n",
"\n",
"### [Project Gutenberg](http://www.gutenberg.org/)\n",
"\n",
"Over 53,000 public domain texts available for free download. It would be relatively easy to pick a small subset of these based on your interests.\n",
"\n",
"### [Data is Plural Archives](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0)\n",
"\n",
"Jeremy Singer-Vine's [Data is Plural](https://tinyletter.com/data-is-plural) newsletter keeps track of all the dataset it publicized in a Google sheet. There are some gems in here, including [lyrics to all the Billboard hot 100 songs from the last 50 years](http://kaylinwalker.com/50-years-of-pop-music/).\n",
"\n",
"### [Public APIs](https://github.com/toddmotto/public-apis)\n",
"\n",
"APIs (Application Programming Interfaces) allow you to access site data through a set of pre-defined rules---sort of a language that allows your computer to talk to their site. These require a little more expertise, but they're free to use.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Web Scraping\n",
"\n",
"If you still need access to texts on the web that aren't in any other form, you can use a web scraping technique to get the material you need. Keep in mind that it's best to check the site's terms of service or any public copyright notices before you start pulling information down.\n",
"\n",
"We'll be following [this tutorial from *Automate the Boring Stuff with Python*](https://automatetheboringstuff.com/chapter11/). If you need more detailed help, this book, the entirety of which is available online, is a great start.\n",
"\n",
"Let's list our goals:\n",
"\n",
"- get necessary libraries/packages in Python\n",
"- find a list of links to pages we want to scrape\n",
"- scrape HTML from websites\n",
"- extract plaintext from HTML\n",
"- export plaintext into files"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Step 1: Python Packages\n",
"\n",
"The Python programming language comes pre-installed on most Linux and Mac computers. Open the Terminal application and type Python to make sure. [n.b. This tutorial is in a newer version of Python, Python3, but the changes are minute. I've marked in the code anything you'd need to change for it to work in Python2, the default version.]\n",
"\n",
"Python comes installed automatically with a package manager called `pip`. The two packages we will need for this short program are [`requests`](http://docs.python-requests.org/en/latest/), for pulling down web pages, and [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing the resulting HTML. Install them with pip by typing:\n",
"\n",
"`pip install requests`\n",
"\n",
"and\n",
"\n",
"`pip install beautifulsoup4`\n",
"\n",
"Then in our code itself, we can begin by importing these libraries:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Step 2: Get a list of links\n",
"\n",
"Many times we find a website that has the texts we want, but we need a list of links to the individual pages in order to get those texts. To get that list, you can look for a page that has links to the material you need. This can often be an archive page or other page that has lists of links. In this example, we'll use Wikipedia's [list of science fiction authors](https://en.wikipedia.org/wiki/List_of_science_fiction_authors) since we know that data is freely available and well-structured. The first thing to do is go to the page in question and use your browser's \"inspect\" tool to see how the site is structured.\n",
"\n",
"![](inspect_archive.png)\n",
"\n",
"Here we can see that the links are nested in a series of `<dd>` tags. We can use this information to construct our script:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://en.wikipedia.org/wiki/Dafydd_ab_Hugh\n",
"https://en.wikipedia.org/wiki/Alexander_Abasheli\n",
"https://en.wikipedia.org/wiki/Edwin_Abbott_Abbott\n",
"https://en.wikipedia.org/wiki/K%C5%8Db%C5%8D_Abe\n",
"https://en.wikipedia.org/wiki/Robert_Abernathy\n",
"https://en.wikipedia.org/wiki/Dan_Abnett\n",
"https://en.wikipedia.org/wiki/Daniel_Abraham_(author)\n",
"https://en.wikipedia.org/wiki/Forrest_J_Ackerman\n",
"https://en.wikipedia.org/wiki/Douglas_Adams\n",
"https://en.wikipedia.org/wiki/Robert_Adams_(science_fiction_writer)\n",
"https://en.wikipedia.org/wiki/Ann_Aguirre\n",
"https://en.wikipedia.org/wiki/Jerry_Ahern\n",
"https://en.wikipedia.org/wiki/Humayun_Ahmed\n",
"https://en.wikipedia.org/wiki/Jim_Aikin\n",
"https://en.wikipedia.org/wiki/Alan_Burt_Akers\n",
"https://en.wikipedia.org/wiki/Kenneth_Bulmer\n",
"https://en.wikipedia.org/wiki/Brian_Aldiss\n",
"https://en.wikipedia.org/wiki/David_M._Alexander\n",
"https://en.wikipedia.org/wiki/Lloyd_Alexander\n",
"https://en.wikipedia.org/wiki/Roger_MacBride_Allen\n"
]
}
],
"source": [
"# First, use requests to pull down the page\n",
"\n",
"res = requests.get('https://en.wikipedia.org/wiki/List_of_science_fiction_authors') # Request the page\n",
"html = res.text # Get just the html\n",
"\n",
"# Since we know what the html looks like, we can use BeautifulSoup to capture the links\n",
"\n",
"soup = BeautifulSoup(html, 'html.parser') # Turn our html into a more readable \"soup\"\n",
"\n",
"# And since we know that the links we want are inside any \"a\" tag that is within a \"dd\" tag, we can use\n",
"# BeautifulSoup's \"select\" method to get those:\n",
"\n",
"a_tags = soup.select('dd a')\n",
"\n",
"# Then we simply go through each of those tags and pull out the necessary link, adding to it the prefix we want.\n",
"\n",
"all_links = []\n",
"for a_tag in a_tags:\n",
" link_end = a_tag.get('href') # Get the contents of the href attribute\n",
" link = 'https://en.wikipedia.org' + link_end # Add the link prefix, which we already know\n",
" all_links.append(link) # Put these links in a list for later use\n",
" \n",
"# Let's print the first 20 links in the list:\n",
"for link in all_links[:20]:\n",
" print(link) # This is the only difference between Python 2 and 3: just make sure you don't use parantheses in Python2\n",
" # Instead type: print link"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Step 3: Request Pages and Get Text\n",
"\n",
"Now that we have a list of links that we want to scrape, we can use the same \"inspect\" developer tool to examine one of our sample article pages:\n",
"\n",
"![](inspect_article.png)\n",
"\n",
"We can see the material we want is inside a \"div\" tag where the \"id\" element is \"mw-content-text\". Here's how we get that part of every page in the above list of links:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n",
"Finished a text!\n"
]
}
],
"source": [
"all_texts = [] # This is the list where our texts will go.\n",
"\n",
"for link in all_links[:20]: #Loop through all the links in our lists. For testing purposes, we'll just do the first 20.\n",
" res = requests.get(link) # As above, request the page.\n",
" html = res.text # Get the HTML\n",
" soup = BeautifulSoup(html, 'html.parser') # Turn the HTML into a \"soup\" object\n",
" content = soup.select('div[id=\"mw-content-text\"]') # Get only the content that's inside the tag we want\n",
" # Now we've got our content, but we only want the text, without the tags. Thankfully, that's easy:\n",
" text = content[0].getText()\n",
" all_texts.append(text) # Put all the texts in a list\n",
" print(\"Finished a text!\") # So we can see the progress, print a confirmation every time"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Step 4: Write Texts to File\n",
"\n",
"Now that we have our list of plaintext documents, we can write each one to a file. I've already created a subdirectory in this folder called `scifi_authors` where all the texts will go. This is simply a matter of using Python's built-in file writing capabilities."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dafydd_ab_Hugh.txt\n",
"File written!\n",
"Alexander_Abasheli.txt\n",
"File written!\n",
"Edwin_Abbott_Abbott.txt\n",
"File written!\n",
"K%C5%8Db%C5%8D_Abe.txt\n",
"File written!\n",
"Robert_Abernathy.txt\n",
"File written!\n",
"Dan_Abnett.txt\n",
"File written!\n",
"Daniel_Abraham_(author).txt\n",
"File written!\n",
"Forrest_J_Ackerman.txt\n",
"File written!\n",
"Douglas_Adams.txt\n",
"File written!\n",
"Robert_Adams_(science_fiction_writer).txt\n",
"File written!\n",
"Ann_Aguirre.txt\n",
"File written!\n",
"Jerry_Ahern.txt\n",
"File written!\n",
"Humayun_Ahmed.txt\n",
"File written!\n",
"Jim_Aikin.txt\n",
"File written!\n",
"Alan_Burt_Akers.txt\n",
"File written!\n",
"Kenneth_Bulmer.txt\n",
"File written!\n",
"Brian_Aldiss.txt\n",
"File written!\n",
"David_M._Alexander.txt\n",
"File written!\n",
"Lloyd_Alexander.txt\n",
"File written!\n",
"Roger_MacBride_Allen.txt\n",
"File written!\n"
]
}
],
"source": [
"for i,text in enumerate(all_texts): # Loop through each of our scraped texts\n",
" url = all_links[i] # Get the matching url for that file\n",
" filename = url.split('/')[-1] + \".txt\" # Do some text manipulation to make a unique filename from the URL\n",
" print(filename)\n",
" with open('scifi_authors/'+filename, 'w') as newfile: # Create a new file in our target directory\n",
" newfile.write(text) # Write text to file\n",
" print('File written!')"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Congratulations!\n",
"\n",
"You should now have every text you wanted as a plaintext file on your computer, ready to upload into DocuScope!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment