Skip to content

Instantly share code, notes, and snippets.

@jasongrout
Last active July 24, 2020 04:25
Show Gist options
  • Save jasongrout/fc07dedf2b799da2a096ccf28a0bcef1 to your computer and use it in GitHub Desktop.
Save jasongrout/fc07dedf2b799da2a096ccf28a0bcef1 to your computer and use it in GitHub Desktop.
JupyterCon 2020 Reviewer Assignments
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# JupyterCon 2020 Community Review Assignments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook assigns reviewers to proposals, trying to adhere to several principles:\n",
"\n",
"- Respect a maximum number of reviews per reviewer and minimum number of reviews per proposal\n",
"- A reviewer should only review proposals in tracks they are willing to review\n",
"- Minimize the number of tracks each reviewer reviews for, so they can compare proposals better. Hopefully most reviewers only review for a single track.\n",
"- Avoid conflicts of interest (detected as a reviewer email being a proposal author email)\n",
"- Randomize assignments, so generally two reviewers do not review the same set of proposals\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `REVIEWS_PER_PROPOSAL` effectively functions as the minimum number of reviews we try to get for each proposal. The `REVIEWS_PER_REVIEWER` is effectively the maximum number of reviews we allocate to each reviewer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"REVIEWS_PER_PROPOSAL = 6\n",
"REVIEWS_PER_REVIEWER = 7"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Input files\n",
"\n",
"`JupyterCon2020_ReviewerSignup.csv` is a CSV file from the Google review signup. It has columns for email and tracks the reviewer is willing to review.\n",
"\n",
"`reviewers.csv` is the list of reviewers exported as a CSV file from the conference submission system. It contains the reviewer id and email address. We join these two data sets on the email address.\n",
"\n",
"`proposals.csv` is the list of proposals exported as a CSV file from the conference submission system. For each proposal, it contains the proposal id, track, and list of author emails."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from dataclasses import dataclass, field\n",
"from itertools import chain\n",
"from collections import defaultdict, namedtuple\n",
"import numpy as np\n",
"from typing import List, Set\n",
"rng = np.random.default_rng(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reviewers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we read in the reviewers from our Google signup sheet. These have been imported already into the reviewing system, so we'll also need to import the list from there to get the reviewer id for each reviewer, then merge the two lists. The information from the Google signup sheet has the tracks each reviewer is willing to review for, while the cfp system export has the reviewer id we'll need to construct the assignment list at the end."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Read in the reviewer signup sheet and normalize email addresses. This introduces some duplicates\n",
"reviewer_signup = pd.read_csv('JupyterCon2020_ReviewerSignup.csv')\n",
"reviewer_signup.Email = reviewer_signup.Email.str.lower()\n",
"reviewer_signup.Email;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Read in the exported list of reviewers from the cfp system, and make sure we have exactly the same set of emails\n",
"reviewerids = pd.read_csv('reviewers.csv')\n",
"assert set(reviewerids.Email) == set(reviewer_signup.Email)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Merge the two lists on the email field. This basically adds the cfp system ids to each reviewer row from the signup sheet\n",
"reviewer_signup = reviewer_signup.merge(reviewerids, on='Email')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get a list of track names selected by reviewers, then we construct the mapping between these and our canonical track names by hand below.\n",
"# tracks = list(set(chain(*[s.split(';') for s in list(reviewers['Track you volunteer to review for (check all that apply)'])])))\n",
"reviewerTracks = {\n",
" 'Sprints': 'sprints',\n",
" 'Talks: Jupyter in Scientific Research': 'research',\n",
" 'Posters': 'posters',\n",
" 'Talks: Enterprise Jupyter Infrastructure': 'enterprise',\n",
" 'Talks: Data Science Applications': 'datascience',\n",
" 'Talks: Jupyter in Education': 'education',\n",
" 'Tutorials': 'tutorials',\n",
" 'Talks: Jupyter Community–Tools and Practices':'tools'\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we merge the reviewer information into a single list of reviewers. There were several duplicate reviewer signups in the Google signup sheet, with different tracks for each signup, so we merge things here to get a canonical list of reviewers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@dataclass\n",
"class Reviewer:\n",
" \"\"\"A reviewer\"\"\"\n",
" id: int\n",
" email: str\n",
" tracks: Set[str] = field(default_factory=set)\n",
" proposals: List[int] = field(default_factory=list)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"reviewers_merge = {}\n",
"for index,row in reviewer_signup.iterrows():\n",
" if row.ID not in reviewers_merge:\n",
" reviewers_merge[row.ID] = Reviewer(id=row.ID, email=row.Email)\n",
" reviewers_merge[row.ID].tracks.update(reviewerTracks[s] for s in row['Track you volunteer to review for (check all that apply)'].split(';') if s not in ['Sprints'])\n",
"reviewers = list(reviewers_merge.values())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For curiousity, here are the reviewers allocated to each track (of course, there is a lot of overlap since most reviewers are volunteering to review for more than one track)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sorted((t,sum(1 for r in reviewers if t in r.tracks)) for t in set(reviewerTracks.values()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Proposals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we get a list of the proposals. I originally used a named tuple, but the numpy random sampling function below (or perhaps the `.tolist` I used?) converted the named tuple into a normal tuple :(. However, this is a great excuse for learning about dataclasses :)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@dataclass\n",
"class Proposal:\n",
" \"\"\"A proposal\"\"\"\n",
" id: int\n",
" track: str\n",
" emails: List[str]\n",
" reviewers: List[int] = field(default_factory=list)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"proposals_signup = pd.read_csv('proposals.csv')\n",
"proposals_signup.head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get the list of tracks from the proposal signup, then make a mapping with our canonical track names by hand\n",
"# tracks = list(set(chain(*[s.split(';') for s in list(proposals['Kind'])])))\n",
"proposal_tracks = {\n",
" 'Jupyter in Education Talk': 'education',\n",
" 'Poster': 'posters',\n",
" 'Tutorial': 'tutorials',\n",
" 'Enterprise Jupyter Infrastructure Talk': 'enterprise',\n",
" 'Data Science Applications Talk': 'datascience',\n",
" 'Jupyter Community: Tools and Practices Talk': 'tools',\n",
" 'Jupyter in Scientific Research Talk': 'research'\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The list of proposals is grouped by track since we will be working on a track-by-track basis in our assignment algorithm below. We also find it convenient below to flatten the proposals to a huge dict of all proposals."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# {track: {proposal_id: [reviewer ids]}} - for storing the reviews assignments per proposal, grouped by track\n",
"proposals = {t: [] for t in proposal_tracks.values()}\n",
"allproposals = {}\n",
"\n",
"for index,row in proposals_signup.iterrows():\n",
" p = Proposal(id=row.Number, track=proposal_tracks[row.Kind], emails=row['All Speaker Email Addresses'].split(', '))\n",
" proposals[p.track].append(p)\n",
" allproposals[p.id] = p"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our algorithm below, we'll need to know how many reviews we still need in a track, so we have a function to compute that."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def outstanding_reviews(track):\n",
" \"The number of reviews we still need for this track\"\n",
" return len(proposals[track])*REVIEWS_PER_PROPOSAL - sum(len(p.reviewers) for p in proposals[track])\n",
" \n",
"# Print out how many reviews are needed in each track\n",
"sorted((t,outstanding_reviews(t)) for t in set(proposal_tracks.values()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assign reviewers to proposals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we get to the core assignment algorithm. We may run this assignment cell multiple times, so we are careful to not overwrite any existing review assignments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# For each reviewer that still has reviews to allocate, sorted by number of tracks the reviewer can review, so we assign more restricted reviewers first\n",
"for reviewer in sorted([r for r in reviewers if len(r.proposals)<REVIEWS_PER_REVIEWER], key=lambda r: len(r.tracks)): \n",
" # Go through each track acceptable to the reviewer that still need reviews, starting with the one that needs the most reviews\n",
" for track in sorted([t for t in reviewer.tracks if outstanding_reviews(t)>0], key=lambda x: outstanding_reviews(x), reverse=True):\n",
" # Find the proposals that still need reviewers and that don't have the reviewer as an author, and haven't been assigned to this reviewer yet\n",
" assignment_candidates = [p for p in proposals[track] if len(p.reviewers)<REVIEWS_PER_PROPOSAL \n",
" and reviewer.email not in p.emails \n",
" and p.id not in reviewer.proposals]\n",
"\n",
" # Calculate the weighting for choosing each proposal.\n",
" # This levels out assignments by making it more likely to assign a proposal that needs more reviews\n",
" weights = [REVIEWS_PER_PROPOSAL-len(p.reviewers) for p in assignment_candidates]\n",
" total = sum(weights)\n",
" \n",
" # Randomly choose as many assignments as we can for this reviewer\n",
" assignment = rng.choice(assignment_candidates,\n",
" size=min(REVIEWS_PER_REVIEWER-len(reviewer.proposals), len(assignment_candidates)),\n",
" replace=False,\n",
" p=[i/total for i in weights]).tolist()\n",
" # Record the assignment\n",
" for p in assignment:\n",
" p.reviewers.append(reviewer.id)\n",
" reviewer.proposals.append(p.id)\n",
"\n",
" # If we're done with this reviewer, go to the next reviewer, otherwise iterate to the next track\n",
" if len(reviewer.proposals) == REVIEWS_PER_REVIEWER:\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We check if there are any reviewers unassigned, or if any reviewers have not had their quota assigned. If so, we can increment `REVIEWS_PER_PROPOSAL` and rerun the assignment cell to assign extra reviews."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Calculate the number of possible reviews we have left to allocate:\n",
"reviews_left = [len(r.proposals) for r in reviewers if len(r.proposals) < REVIEWS_PER_REVIEWER]\n",
"if len(reviews_left)>0:\n",
" print(\"There are {} reviewers that do not have a full assignment load, \"\n",
" \"with {} having no assignments at all, comprising {} possible further reviews.\".format(\n",
" len(reviews_left),\n",
" len([i for i in reviews_left if i == 0]),\n",
" sum(REVIEWS_PER_REVIEWER - i for i in reviews_left)\n",
" ))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also check to see how many reviewers are reviewing more than one track. For any reviewer reviewing more than one track, we print out the list of tracks they are reviewing. Hopefully there are not many of these multi-track assignments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# For any reviewer reviewing for more than one track, print the list of tracks they are reviewing\n",
"for r in reviewers:\n",
" pt = set(allproposals[p].track for p in r.proposals)\n",
" if len(pt)>1:\n",
" print(sorted(pt))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A convenience cell to bump the reviews per proposal before running the assignment step again to assign the extra reviews. Uncomment the following line and rerun the assignments cell to allocate more reviews for reviewers that do not yet have full assignments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# REVIEWS_PER_PROPOSAL +=1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualizing the review assignments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we visualize the review assignments per track. We hope that each proposal is reviewed at least `REVIEWS_PER_PROPOSAL` times."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Seeking at least {} reviews per proposal\".format(REVIEWS_PER_PROPOSAL))\n",
"from matplotlib import pyplot as plt\n",
"for t,p in sorted(proposals.items()):\n",
" plt.title('Reviews per proposal in track {}'.format(t))\n",
" plt.xlabel('Reviews')\n",
" plt.ylabel('Frequency')\n",
" plt.bar(np.arange(0, REVIEWS_PER_PROPOSAL+1), np.bincount([len(v.reviewers) for v in p], minlength=REVIEWS_PER_PROPOSAL+1))\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For reviewers, we hope that all reviewers are fully committed with `REVIEWS_PER_REVIEWER` reviews. Because our algorithm tries to fully assign one reviewer at a time, it's likely that most reviewers have full assignments and others have none."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('{} reviewers assigned'.format(sum(1 for r in reviewers if len(r.proposals) > 0)))\n",
"plt.title('Reviews per reviewer')\n",
"plt.xlabel('Reviews')\n",
"plt.ylabel('Frequency')\n",
"plt.bar(np.arange(0, REVIEWS_PER_REVIEWER+1), np.bincount([len(r.proposals) for r in reviewers], minlength=REVIEWS_PER_REVIEWER))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check assumptions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we check some assumptions we have, like reviewers are only getting reviews in tracks they want, and they aren't reviewing their own proposals. We also print out again how many reviewers are reviewing for multiple tracks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for r in reviewers:\n",
" assert len(r.proposals) <= REVIEWS_PER_REVIEWER\n",
" for p in r.proposals:\n",
" assert allproposals[p].track in r.tracks\n",
" assert r.email not in allproposals[p].emails\n",
" pt = set(allproposals[p].track for p in r.proposals)\n",
" if len(pt)>1:\n",
" print(sorted(pt))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Write out assignment CSV file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we write out our assignments to a CSV file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assignments = []\n",
"for r in reviewers:\n",
" assignments.extend([r.id, p] for p in r.proposals)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"with open('assignments.csv', 'w', newline='') as f:\n",
" writer = csv.writer(f)\n",
" writer.writerow(['ReviewerID', 'ProposalID'])\n",
" writer.writerows(assignments)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment