Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save cmgerber/06640792b5662fc75576 to your computer and use it in GitHub Desktop.
Save cmgerber/06640792b5662fc75576 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "**Goal** In this assignment, you'll make a first pass look at your newly adopted text collection similar to the Wolfram Alpha's view.\n\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Title, author, and other metadata**. First, print out some summary information that gives the background explaining what this collection is and where it comes from:"
},
{
"metadata": {},
"cell_type": "code",
"input": "print 'Title: Clinical Trial Database Trial Criteria User Input'\nprint 'Authors: Investigators running Interventional Clinical Trials'",
"prompt_number": 1,
"outputs": [
{
"output_type": "stream",
"text": "Title: Clinical Trial Database Trial Criteria User Input\nAuthors: Investigators running Interventional Clinical Trials\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "import nltk\nimport re\nsent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')",
"prompt_number": 2,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**First, load in the file or files below.** First, take a look at your text. An easy way to get started is to first read it in, and then run it through the sentence tokenizer to divide it up, even if this division is not fully accurate. You may have to do a bit of work to figure out which will be the \"opening phrase\" that Wolfram Alpha shows. Below, write the code to read in the text and split it into sentences, and then print out the **opening phrase**."
},
{
"metadata": {},
"cell_type": "code",
"input": "r = open('../Data/ct_criteria_colin.txt')\nr_lines = r.readlines()\nr_string = ' '.join(r_lines)",
"prompt_number": 3,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Split on ' - ' which is being used as bullet points"
},
{
"metadata": {},
"cell_type": "code",
"input": "lines_split = [re.split(' - ', line) for line in r_lines]",
"prompt_number": 5,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Then run punkt on each indivudual string. Assuming the string split on the bullet points will be the closest to sentences at this point."
},
{
"metadata": {},
"cell_type": "code",
"input": "sentence_groups = []\nfor sent_group in lines_split:\n group_holder = []\n for sent in sent_group:\n group_holder.append(sent_tokenizer.tokenize(sent))\n sentence_groups.append(group_holder)\n del group_holder",
"prompt_number": 7,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Flatten the groups so each group is a list of strings"
},
{
"metadata": {},
"cell_type": "code",
"input": "for n,group in enumerate(sentence_groups):\n sentence_groups[n] = list(itertools.chain.from_iterable(group))",
"prompt_number": 106,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "sentence_groups[0]",
"prompt_number": 107,
"outputs": [
{
"text": "['Inclusion Criteria:',\n 'Healthy male',\n '18-50 years of age',\n 'Non-smoker',\n 'Not taking any medications other than the study drug for the duration of the study.',\n 'Must be willing to use an accepted method of contraception during the study.',\n 'Exclusion Criteria:',\n 'BMI > 35',\n 'Abnormal evaluation on screening exam and labs',\n 'Known history of alcohol abuse, illicit drugs or steroids and/or use of more that 3 alcoholic beverages/day',\n 'History of current testosterone use or infertility',\n 'History of testicular disease or severe testicular trauma',\n 'History of major psychiatric disorder or sleep apnea',\n 'History of bleeding disorder or need for anticoagulation',\n 'Current smoker or utilizing nicotine patches or gum',\n 'Participation in a hormonal drug study within past month.\\n']",
"output_type": "pyout",
"metadata": {},
"prompt_number": 107
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Next, tokenize.** Look at the several dozen sentences to see what kind of tokenization issues you'll have. Write a regular expression tokenizer, using the nltk.regexp_tokenize() as seen in class, to do a nice job of breaking your text up into words. You may need to make changes to the regex pattern that is given in the book to make it work well for your text collection. \n\n*Note that this is the key part of the assignment. How you break up the words will have effects down the line for how you can manipulate your text collection. You may want to refine this code later.*"
},
{
"metadata": {},
"cell_type": "code",
"input": "pattern = r'''(?x) # set flag to allow verbose regexps\n ([A-Z]\\.)+ # abbreviations, e.g. U.S.A\n | \\w+([-‘]\\w+)* # words with optional internal hyphens\n | \\$?\\d+(\\.\\d+)?%? # currency and percentages, e.g. $12.40, 82%\n | \\.\\.\\. # ellipsis... \n | [][.,;\"'?():\\-_`]+ # these are separate tokens\n '''",
"prompt_number": 9,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Flatten list of lines that have had the bullet points removed"
},
{
"metadata": {},
"cell_type": "code",
"input": "flattened_list = [item for sublist in lines_split for item in sublist]",
"prompt_number": 13,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "token_list = nltk.regexp_tokenize(' '.join(flattened_list), pattern)\ntoken_list[:20]",
"prompt_number": 19,
"outputs": [
{
"text": "['Inclusion',\n 'Criteria',\n ':',\n 'Healthy',\n 'male',\n '18-50',\n 'years',\n 'of',\n 'age',\n 'Non-smoker',\n 'Not',\n 'taking',\n 'any',\n 'medications',\n 'other',\n 'than',\n 'the',\n 'study',\n 'drug',\n 'for']",
"output_type": "pyout",
"metadata": {},
"prompt_number": 19
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Compute word counts.** Now compute your frequency distribution using a FreqDist over the words. Let's not do lowercasing or stemming yet. You can run this over the whole collection together, or sentence by sentence. Write the code for computing the FreqDist below."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Remove one character long punctuation for word counts and distribution"
},
{
"metadata": {},
"cell_type": "code",
"input": "from string import punctuation\ntoken_list_no_punct = [token for token in token_list if token not in punctuation]",
"prompt_number": 20,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Creating a table.**\nPython provides an easy way to line columns up in a table. You can specify a width for a string such as %6s, producing a string that is padded to width 6. It is right-justified by default, but a minus sign in front of it switches it to left-justified, so -3d% means left justify an integer with width 3. *AND* if you don't know the width in advance, you can make it a variable by using an asterisk rather than a number before the '\\*s%' or the '-\\*d%'. Check out this example (this is just fyi):"
},
{
"metadata": {},
"cell_type": "code",
"input": "print '%-16s' % 'Info type', '%-16s' % 'Value'\nprint '%-16s' % 'number of words', '%-16d' % 100000\n",
"prompt_number": 16,
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Info type Value \nnumber of words 100000 \n"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Word Properties Table** Next there is a table of word properties, which you should compute (skip unique word stems, since we will do stemming in class on Wed). Make a table that prints out:\n1. number of words\n2. number of unique words\n3. average word length\n4. longest word\n\nYou can make your table look prettier than the example I showed above if you like!\n\nYou can decide for yourself if you want to eliminate punctuation and function words (stop words) or not. It's your collection! \n"
},
{
"metadata": {},
"cell_type": "code",
"input": "print '%-24s' % 'Info type', '%-16s' % 'Value'\nprint '%-24s' % 'number of words', '%-16d' % len(token_list_no_punct)\nprint '%-24s' % 'number of unique words', '%-16d' % len(set(token_list_no_punct))\nprint '%-24s' % 'average word length', '%-16d' % (sum([len(x) for x in token_list_no_punct])/float(len(token_list_no_punct)))\nprint '%-24s' % 'longest word', '%-16s' % sorted([(len(x),x) for x in token_list_no_punct], reverse=True)[0][1]",
"prompt_number": 23,
"outputs": [
{
"output_type": "stream",
"text": "Info type Value \nnumber of words 530299 \nnumber of unique words ",
"stream": "stdout"
},
{
"output_type": "stream",
"text": "22713 \naverage word length ",
"stream": "stdout"
},
{
"output_type": "stream",
"text": "5 \nlongest word ",
"stream": "stdout"
},
{
"output_type": "stream",
"text": "plasma-alanine-amino-transaminases\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Most Frequent Words List.** Next is the most frequent words list. This table shows the percent of the total as well as the most frequent words, so compute this number as well. "
},
{
"metadata": {},
"cell_type": "code",
"input": "fdist = nltk.FreqDist(word for word in token_list_no_punct)",
"prompt_number": 25,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "percent_freq_word_list = []\ntotal_word_count = float(len(token_list_no_punct))\nfor key, value in fdist.items():\n percent_freq_word_list.append((key,value, value/total_word_count))",
"prompt_number": 36,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "print '%-24s' % 'Word', '%-16s' % 'Percent of Total'\n\nfor n in range(20):\n print '%-24s' % percent_freq_word_list[n][0], '%-4.2f%%' % (percent_freq_word_list[n][2]*100)",
"prompt_number": 55,
"outputs": [
{
"output_type": "stream",
"text": "Word Percent of Total\nor 3.80%\nof 3.68%\nthe 3.01%\nto 2.07%\nand 1.50%\nwith 1.29%\nin 1.05%\na 1.01%\nfor 0.98%\nstudy 0.95%\nCriteria 0.67%\n1 0.66%\nat 0.65%\nwithin 0.65%\nprior 0.62%\nbe 0.60%\n2 0.59%\ndisease 0.58%\nPatients 0.52%\nthan 0.52%\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Most Frequent Capitalized Words List** We haven't lower-cased the text so you should be able to compute this. Don't worry about whether capitalization comes from proper nouns, start of sentences, or elsewhere. You need to make a different FreqDist to do this one. Write the code here for the new FreqDist and the List itself. Show the list here."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Only includes words that start with a capital followed by lower case"
},
{
"metadata": {},
"cell_type": "code",
"input": "title_word_list = [word for word in token_list_no_punct if word.istitle()]",
"prompt_number": 73,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "fdist_title = nltk.FreqDist(word for word in title_word_list)\npercent_freq_title_list = []\ntitle_word_count = float(sum(fdist_title.values()))\nfor key, value in fdist_title.items():\n percent_freq_title_list.append((key,value, value/title_word_count))",
"prompt_number": 81,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "print '%-24s' % 'Word', '%-16s' % 'Percent of Total'\n\nfor n in range(30):\n print '%-24s' % percent_freq_title_list[n][0], '%-4.2f%%' % (percent_freq_title_list[n][2]*100)",
"prompt_number": 85,
"outputs": [
{
"output_type": "stream",
"text": "Word Percent of Total\nCriteria 6.29%\nPatients 4.89%\nInclusion 3.26%\nExclusion 3.19%\nNo 3.07%\nHistory 2.06%\nSubjects 1.69%\nThe 1.38%\nPatient 1.23%\nAny 1.17%\nHave 1.14%\nSubject 1.11%\nA 0.90%\nAge 0.89%\nKnown 0.80%\nNot 0.77%\nL 0.69%\nUse 0.68%\nC 0.66%\nPrior 0.66%\nB 0.65%\nAt 0.63%\nWomen 0.57%\nVisit 0.57%\nOther 0.56%\nPregnant 0.55%\nCurrent 0.52%\nScreening 0.48%\nFemale 0.48%\nActive 0.44%\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Includes words where any character is capitalized"
},
{
"metadata": {},
"cell_type": "code",
"input": "caps_word_list = [word for word in token_list_no_punct if any(letter.isupper() for letter in word)]",
"prompt_number": 75,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "fdist_caps = nltk.FreqDist(word for word in caps_word_list)\npercent_freq_caps_list = []\ncaps_word_count = float(sum(fdist_caps.values()))\nfor key, value in fdist_caps.items():\n percent_freq_caps_list.append((key,value, value/caps_word_count))",
"prompt_number": 87,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "print '%-24s' % 'Word', '%-16s' % 'Percent of Total'\n\nfor n in range(30):\n print '%-24s' % percent_freq_caps_list[n][0], '%-4.2f%%' % (percent_freq_caps_list[n][2]*100)",
"prompt_number": 88,
"outputs": [
{
"output_type": "stream",
"text": "Word Percent of Total\nCriteria 4.78%\nPatients 3.72%\nInclusion 2.48%\nExclusion 2.43%\nNo 2.34%\nHistory 1.57%\nSubjects 1.28%\nThe 1.05%\nPatient 0.94%\nAny 0.89%\nHave 0.87%\nSubject 0.85%\nULN 0.79%\ndL 0.75%\nHIV 0.69%\nA 0.68%\nAge 0.68%\nKnown 0.61%\nNot 0.58%\nL 0.52%\nUse 0.52%\nC 0.50%\nPrior 0.50%\nB 0.50%\nAt 0.48%\nWomen 0.44%\nVisit 0.43%\nOther 0.43%\nPregnant 0.42%\nmL 0.41%\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Sentence Properties Table** This summarizes number of sentences and average sentence length in words and characters (you decide if you want to include stopwords/punctuation or not). Print those out in a table here."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Doing sentence stats on each individual input group.\n\nFor now just splitting on space for the average word count."
},
{
"metadata": {},
"cell_type": "code",
"input": "group_stats_list = []\ntotal_sentences = 0\nfor n,group in enumerate(sentence_groups):\n total_sentences += len(group)\n group_stats_list.append(('group'+str(n) ,len(group),\n len(' '.join(group))/len(group),\n len(' '.join(group).split())/len(group)))",
"prompt_number": 113,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Data is (group, sentence count, sentence avg by character, sentence avg by word)"
},
{
"metadata": {},
"cell_type": "code",
"input": "group_stats_list[:10]",
"prompt_number": 119,
"outputs": [
{
"text": "[('group0', 16, 46, 7),\n ('group1', 49, 40, 5),\n ('group2', 21, 82, 12),\n ('group3', 9, 74, 11),\n ('group4', 11, 78, 10),\n ('group5', 25, 60, 8),\n ('group6', 54, 87, 12),\n ('group7', 10, 91, 12),\n ('group8', 67, 81, 12),\n ('group9', 18, 72, 9)]",
"output_type": "pyout",
"metadata": {},
"prompt_number": 119
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Basic Sentence Stats"
},
{
"metadata": {},
"cell_type": "code",
"input": "print '%-38s' % 'Info type', '%-16s' % 'Value'\nprint '%-38s' % 'number of sentence', '%-16d' % (total_sentences)\nprint '%-38s' % 'average sentence length - word', '%-16d' % (len(token_list_no_punct)/total_sentences)\nprint '%-38s' % 'average sentence length - character', '%-16d' % (len(' '.join(token_list_no_punct))/total_sentences)",
"prompt_number": 120,
"outputs": [
{
"output_type": "stream",
"text": "Info type Value \nnumber of sentence 42919 \naverage sentence length - word 12 \naverage sentence length - character 79 \n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"name": "",
"signature": "sha256:4b1769b974fa557c31297a8b9876144f54ee0f1dc89452b41619ad728ea4219f"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment