Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Created August 8, 2022 15:31
Show Gist options
  • Save psychemedia/2679e2fe5590b01814d560bf28dffa29 to your computer and use it in GitHub Desktop.
Save psychemedia/2679e2fe5590b01814d560bf28dffa29 to your computer and use it in GitHub Desktop.
Attempt to generate BibTeX file for publications on archive.org
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "d0bd8036-74f0-4a06-bb39-24551ed09adc",
"metadata": {},
"source": [
"# `BibTeX` Record Generator for archive.org\n",
"\n",
"Simple script to generate a BibTex record from an `archive.org` identifier."
]
},
{
"cell_type": "markdown",
"id": "29c2c4fb-84a5-41b5-9a05-a94eb523d232",
"metadata": {},
"source": [
"## Proof of Concept\n",
"\n",
"Given an `archive.org` identifier, we can look up its metadata as:\n",
"\n",
"`https://archive.org/metadata/IDENTIFIER/metadata`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "767ede36-0d7b-403d-acf6-54d08f2e3e8b",
"metadata": {},
"outputs": [],
"source": [
"url_ = 'https://archive.org/metadata/{uid}/metadata'"
]
},
{
"cell_type": "markdown",
"id": "b377eb84-6554-44e6-bb7b-34602ff3ee34",
"metadata": {},
"source": [
"The `archive.org` metadata schema is given [here](https://archive.org/services/docs/api/metadata-schema/index.html).\n",
"\n",
"Relevant fields include:\n",
"\n",
"- `title`\n",
"- `creator`\n",
"- `publisher`\n",
"- `date` (replaces the deprecated `year`)\n",
"- `volume`\n",
"- `description`\n",
"- `issn` / `isbn`\n",
"- `subject` (subject / topic tags)\n",
"- `rights` / `possible-copyright-status`\n",
"\n",
"We can then attempt to map fields onto appropriate fields in a [`BibTeX` `book` entry record](https://www.bibtex.com/e/book-entry/). Appropriate fields might include:\n",
"\n",
"- `author`;\n",
"- `editor`;\n",
"- `title`;\n",
"- `publisher`\n",
"- `address`\n",
"- `year`\n",
"- `volume` / `number`\n",
"- `note`\n",
"- `issn` / `isbn`\n",
"\n",
"We can map many of the items directly.\n",
"\n",
"*The volume we may want to try to parse into a volume and part (which is to say, `volumne` and `number`). For now, just map the `volume`literally.*"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6dd443b3-ceeb-4e22-9584-e3ffa5a2ae8c",
"metadata": {},
"outputs": [],
"source": [
"example_id = 'dli.granth.84831'"
]
},
{
"cell_type": "markdown",
"id": "e2ec316e-4cca-449f-ae97-55c097e908de",
"metadata": {},
"source": [
"Get the `archive.org` metadata:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "92fd2458-8335-49b8-94d4-fbe0bc72c550",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'identifier': 'dli.granth.84831',\n",
" 'collection': ['digitallibraryindia', 'JaiGyan'],\n",
" 'creator': 'Groome, Francis Hindes',\n",
" 'date': '1899',\n",
" 'language': 'eng',\n",
" 'mediatype': 'texts',\n",
" 'publisher': 'Hurst and Blackett, Limited (London)',\n",
" 'scanner': 'Internet Archive Python library 1.9.0',\n",
" 'subject': ['Book', ' Customs', ' Etiquette and Folklore'],\n",
" 'title': 'Gypsy Folk-Tales',\n",
" 'uploader': 'carl@media.org',\n",
" 'publicdate': '2020-07-18 01:48:07',\n",
" 'addeddate': '2020-07-18 01:48:07',\n",
" 'identifier-access': 'http://archive.org/details/dli.granth.84831',\n",
" 'identifier-ark': 'ark:/13960/t00093j2k',\n",
" 'ppi': '300',\n",
" 'ocr': 'ABBYY FineReader 11.0 (Extended OCR)',\n",
" 'page_number_confidence': '76.84',\n",
" 'notes': '<p>This item is part of a library of books, audio, video, and other materials from and about India is curated and maintained by Public Resource. The purpose of this library is to assist the students and the lifelong learners of India in their pursuit of an education so that they may better their status and their opportunities and to secure for themselves and for others justice, social, economic and political.</p> <p>This library has been posted for non-commercial purposes and facilitates fair dealing usage of academic and research materials for private use including research, for criticism and review of the work or of other works and reproduction by teachers and students in the course of instruction. Many of these materials are either unavailable or inaccessible in libraries in India, especially in some of the poorer states and this collection seeks to fill a major gap that exists in access to knowledge.</p> <p>For other collections we curate and more information, please visit the <a href=\"https://archive.org/details/JaiGyan?and%5B%5D=mediatype%3Acollection\" rel=\"nofollow\">Bharat Ek Khoj</a> page. Jai Gyan!</p>',\n",
" 'noarchivetorrent': 'true',\n",
" 'curation': '[curator]carl@media.org[/curator][date]20220414194920[/date][state]un-dark[/state][comment]Undarkened by Carl[/comment]'}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import requests\n",
"\n",
"r = requests.get(url_.format(uid=example_id))\n",
"result = r.json()['result']\n",
"result"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b0ca65b4-fe41-4433-94e6-041a44afd2dd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'title': 'Gypsy Folk-Tales',\n",
" 'publisher': 'Hurst and Blackett, Limited (London)',\n",
" 'year': '1899',\n",
" 'author': 'Groome, Francis Hindes'}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bib_data = {}\n",
"\n",
"bib_map = {\"date\": \"year\",\n",
" \"description\": \"note\",\n",
" \"creator\": \"author\"}\n",
"\n",
"for k in ['title', 'publisher', 'description',\n",
" 'volume', 'issn', 'isbn', \"date\", \"creator\"]:\n",
" if k in result:\n",
" k_ = bib_map[k] if k in bib_map else k \n",
" bib_data[k_] = result[k]\n",
" \n",
"bib_data"
]
},
{
"cell_type": "markdown",
"id": "02c5a29e-c21c-4130-9020-f9566ac55b14",
"metadata": {},
"source": [
"For now, we're naively mapping the creator on to the author, although we might later want to try to improve *author* vs. *editor* resultion.\n",
"\n",
"Note that the `creator` metadata may be presented as a list of creators, often with birth/death dates, so we need to potentially tidy that up."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "63cfd43e-01f0-4e8b-8e96-7b146e18e3b3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Gregory, Lady,\n",
"Finn, MacCumaill, 3rd cent\n",
"Yeats, W. B. (William Butler),\n"
]
}
],
"source": [
"import re\n",
"\n",
"_example = ['Gregory, Lady, 1852-1932',\n",
" 'Finn, MacCumaill, 3rd cent',\n",
" 'Yeats, W. B. (William Butler), 1865-1939']\n",
"\n",
"for c in _example:\n",
" print(re.sub(' \\(?[0-9]+-[0-9]+\\)?', '', c))"
]
},
{
"cell_type": "markdown",
"id": "f490ff7b-6c62-4442-b7f4-3bf48e70a94e",
"metadata": {},
"source": [
"Create an identifier for the book (we may need to leaborate the to make sure it generates a unique identifier):"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "db42e495-1ac1-4148-8c5a-d7f76388b763",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'groomef1899'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bib_id = f\"{re.sub('[^09a-zA-Z]', '', result['creator']).lower()[:7]}{result['date']}\"\n",
"bib_data[\"bib_id\"] = bib_id\n",
"\n",
"bib_id"
]
},
{
"cell_type": "markdown",
"id": "c8845cba-0560-4890-a115-1352741b8014",
"metadata": {},
"source": [
"Use a heuristic to generate the publisher `address`..."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "40028a05-fa8d-4baa-84f2-5af007b5a0cb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'title': 'Gypsy Folk-Tales',\n",
" 'publisher': 'Hurst and Blackett, Limited',\n",
" 'year': '1899',\n",
" 'author': 'Groome, Francis Hindes',\n",
" 'address': 'London',\n",
" 'bib_id': 'groomef1899'}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import parse\n",
"\n",
"addr = parse.parse('{publisher} ({address})', bib_data['publisher'])\n",
"if addr:\n",
" bib_data['publisher'] = addr['publisher']\n",
" bib_data['address'] = addr['address']\n",
"\n",
"bib_data"
]
},
{
"cell_type": "markdown",
"id": "b526ea35-c3aa-4668-980c-bd4fe6fa341d",
"metadata": {},
"source": [
"We now need to render the data via an appropriate BibTeX template:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0ac17371-ab61-4cff-b097-560370824ddb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"@book{ groomef1899,\n",
" title = \"Gypsy Folk-Tales\",\n",
" author = \"Groome, Francis Hindes\",\n",
" year = \"1899\",\n",
" \n",
" publisher = \"Hurst and Blackett, Limited\",\n",
" address = \"London\",\n",
" \n",
" \n",
" \n",
"}\n"
]
}
],
"source": [
"from jinja2 import Template\n",
"\n",
"tm = Template(\"\"\"@book{ {{bib_id}},\n",
" title = \"{{title}}\",\n",
" author = \"{{author}}\",\n",
" year = \"{{year}}\",\n",
" {% if volume %}volume = \"{{volume}}\",{% endif %}\n",
" {% if publisher %}publisher = \"{{publisher}}\",{% endif %}\n",
" {% if address %}address = \"{{address}}\",{% endif %}\n",
" {% if isbn %}isbn = \"{{isbn}}\",{% endif %}\n",
" {% if issn %}issn = \"{{issn}}\",{% endif %}\n",
"}\n",
"\"\"\")\n",
"\n",
"print(tm.render(**bib_data))"
]
},
{
"cell_type": "markdown",
"id": "e1c7ca91-0f8b-45a9-8ac0-5e534add2cd4",
"metadata": {},
"source": [
"Cjeck that the record parses correctly, and then export it in a well-formatted way:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "a4ea9e0e-de6f-41cc-a2a4-75109b7fdb45",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"@book{groomef1899,\n",
" address = {London},\n",
" author = {Groome, Francis Hindes},\n",
" publisher = {Hurst and Blackett, Limited},\n",
" title = {Gypsy Folk-Tales},\n",
" year = {1899}\n",
"}\n",
"\n"
]
}
],
"source": [
"#%pip install bibtexparser\n",
"import bibtexparser\n",
"\n",
"tex_ = bibtexparser.loads(tm.render(**bib_data))\n",
"print(bibtexparser.dumps(tex_))"
]
},
{
"cell_type": "markdown",
"id": "ec138b11-7065-46d9-87fd-0916d52b88dd",
"metadata": {},
"source": [
"We can extract `archive.org` identifers from a file with the following simple pattern matcher:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "059f645c-0869-4101-acbe-14b658d112bf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['riujournalschoo01acadgoog', 'bub_gb_dE7pMtIozskC', 'popularstudiesin00lond']"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with open(\"irish-legends-finn-oisin.md\") as f:\n",
" urls = re.findall('https?://archive.org/details/([^\\s\\n]*)[\\s\\n]+', f.read())\n",
"\n",
"# Find the unique archive.org identifiers\n",
"ids_ = list({u.split(\"/\")[0] for u in urls})\n",
"ids_[:3]"
]
},
{
"cell_type": "markdown",
"id": "e978ac6e-3bc2-45ee-9893-651558a6ca40",
"metadata": {},
"source": [
"## Generate a BibTeX Record Collection\n",
"\n",
"Let's now put the pieces together to extract a list of `archive.org` identifiers from a text file, look up the metadata associated with each one, and then generate a full list of BibTeX records for them.\n",
"\n",
"The following function is derived from the skecthes shown above, repackaged as a function:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "33dd1ce0-feb9-464c-a36d-37d9619cc658",
"metadata": {},
"outputs": [],
"source": [
"# Cache requests\n",
"import requests_cache\n",
"requests_cache.install_cache('.archive_org_metadata')\n",
"\n",
"def get_metadata(uid):\n",
" \"\"\"Get metadata given an archive.org identifier.\"\"\"\n",
" r = requests.get(url_.format(uid=uid))\n",
" result = r.json()['result']\n",
" return result\n",
"\n",
"def generate_bib_record(uid):\n",
" \"\"\"Generate a bibliographic data record\n",
" from archive.org metadata.\"\"\"\n",
"\n",
" metadata = get_metadata(uid)\n",
" bib_data = {}\n",
"\n",
" bib_map = {\"date\": \"year\",\n",
" \"creator\": \"author\"}\n",
"\n",
" # Handle a list of creators\n",
" if 'creator' in metadata:\n",
" _creators = metadata['creator'] if isinstance(metadata['creator'], list) \\\n",
" else [metadata['creator']]\n",
" _clean_creators = []\n",
" for _c in _creators:\n",
" _clean_creators.append(re.sub(' \\(?[0-9]+-[0-9]+\\)?', '', _c))\n",
" metadata['creator'] = \", \".join(_clean_creators)\n",
" if 'creator' in metadata and 'date' in metadata:\n",
" # Create id\n",
" record_id = re.sub('[^09a-zA-Z]', '',\n",
" metadata['creator']).lower()[:7]\n",
" bib_id = f\"{record_id}{metadata['date']}\"\n",
" else:\n",
" bib_id = uid\n",
" \n",
" for k in ['creator', 'title', 'publisher',\n",
" 'volume', 'issn', 'isbn', \"date\"]:\n",
" if k in metadata:\n",
" k_ = bib_map[k] if k in bib_map else k \n",
" bib_data[k_] = metadata[k]\n",
"\n",
" bib_data[\"bib_id\"] = bib_id\n",
"\n",
" # Try to find publisher address using simple heuristics\n",
" if 'publisher' in bib_data:\n",
" addr = parse.parse('{publisher} ({address})',\n",
" bib_data['publisher'])\n",
" if not addr:\n",
" addr = parse.parse('{address} : {publisher}',\n",
" bib_data['publisher'])\n",
" if addr:\n",
" bib_data['publisher'] = addr['publisher']\n",
" bib_data['address'] = addr['address']\n",
" \n",
" _bibtex = bibtexparser.loads(tm.render(**bib_data))\n",
" bibtex_ = bibtexparser.dumps(_bibtex)\n",
" return bibtex_"
]
},
{
"cell_type": "markdown",
"id": "fd061933-0bc9-4743-b505-a9146ebfbc56",
"metadata": {},
"source": [
"We can now iterate through the identifiers and generate or list of BibTeX records.\n",
"\n",
"We can also add a progress bar to help keep track of how far along we are (making the `archive.org` reuests might take some time...)."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "cd98b7c4-6eb6-4cb2-8740-e330256ccb80",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a57f39e8fee349cc97c7f5f3590732ef",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/69 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"['@book{schoolo1904,\\n author = {School of Irish Learning (Dublin , Ireland), Royal Irish Academy},\\n publisher = {Royal Irish Academy},\\n title = {Ériu: The Journal of the School of Irish Learning, Dublin},\\n volume = {1, pt. 2},\\n year = {1904}\\n}\\n',\n",
" \"@book{johnoma1866,\\n author = {JOHN O'MAHONY},\\n title = {FORAS FEASA AR EIRINN DO REIR AN ATHAR, SEATHRUN CEITING, OLLAMH RE DIADHACHTA.THE HISTORY OF IRELAND, FROM THE GAELIEST PERIOD TO THE ENGLISH INBASION.},\\n year = {1866}\\n}\\n\",\n",
" '@book{popularstudiesin00lond,\\n address = {London},\\n author = {},\\n publisher = {D. Nutt},\\n title = {Popular studies in mythology, romance and folklore},\\n year = {1899}\\n}\\n',\n",
" '@book{ossiani1853,\\n address = {Dublin},\\n author = {Ossianic Society},\\n publisher = {Printed under the direction of the Council},\\n title = {Transactions of the Ossianic Society},\\n volume = {4},\\n year = {1853}\\n}\\n',\n",
" '@book{hydedou1890,\\n address = {London},\\n author = {Hyde, Douglas,, Nutt, Alfred,},\\n publisher = {Nutt},\\n title = {Beside the fire : a collection of Irish Gaelic folk stories},\\n year = {1890}\\n}\\n',\n",
" '@book{youngel1910,\\n address = {Dublin},\\n author = {Young, Ella,, Gonne, Maud,, ill},\\n publisher = {Maunsel & company},\\n title = {Celtic wonder-tales},\\n year = {1910}\\n}\\n',\n",
" \"@book{barryoc1890,\\n author = {Barry O'Connor},\\n publisher = {P. J. Kenedy},\\n title = {Turf-fire Stories and Fairy Tales of Ireland},\\n year = {1890}\\n}\\n\",\n",
" '@book{wildela1888,\\n address = {London},\\n author = {Wilde, Lady,.},\\n publisher = {Ward and Downey},\\n title = {Ancient legends, mystic charms, and superstitions of Ireland : With sketches of the Irish past},\\n year = {1888}\\n}\\n',\n",
" '@book{ossiani1853,\\n address = {Dublin},\\n author = {Ossianic Society},\\n publisher = {Printed under the direction of the Council},\\n title = {Transactions of the Ossianic Society},\\n volume = {1},\\n year = {1853}\\n}\\n',\n",
" '@book{gregory1904,\\n address = {London},\\n author = {Gregory, Lady,, Finn, MacCumaill, 3rd cent, Yeats, W. B. (William Butler),},\\n publisher = {J. Murray},\\n title = {Gods and fighting men : the story of the Tuatha de Danaan and of the Fiana of Ireland},\\n year = {1904}\\n}\\n']"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tqdm.notebook import tqdm\n",
"\n",
"records = []\n",
"\n",
"# For tqdm, ensure to update jupyterlab_widgets\n",
"for uid in tqdm(ids_):\n",
" records.append(generate_bib_record(uid))\n",
"\n",
"records[:10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "425a3fb3-efe9-4170-8739-17924c1655f5",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "f95c4d05-7ef4-48aa-8204-11feca8cfe3a",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "0eab2a83-8d15-43b0-a504-da59cb3cc67d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment