Leon Yin yinleon

World Tour 2024

Duke University - Durham, North Carolina
2024-02-28, 10:05 - 11:20 AM ET
Class visit: what is investigative data journalism?

NICAR 2024 - Baltimore, Maryland

2024-03-08, 9:00 - 10:00 AM ET
Workshop: Finding and using undocumented APIs

Remembering 4 years at The Markup

2023-07-19

This is my last week at The Markup. It’s been a true privilege to practice and produce impactful hypothesis-driven journalism with first-class journalists over the past four years.

In year one of publication, Adrianne Jeffries, Sam Morris, Evelyn Larrubia and I measured Google’s self-preferential search results using a method adapted from the life sciences. Our findings were cited in congressional hearing on Big Tech and antitrust.

Aaron Sankin, Sam Morris, Evelyn Larrubia and I found that Google blocked advertisers from finding YouTube videos related to Black Lives Matter and other [social justice phrases](https://themarkup.org/google-the-giant/2021/04/09/google-blocks-advertisers-from-targeting-black-lives-mat

World Tour 2023

Tow Tea @ The Tow Center - New York, New York
2023-02-17, 5:00 - 6:30 PM ET
Workshop: Finding and using undocumented APIs

Net Inclusion - San Antonio, Texas
2023-03-01, 2:30 - 3:30 PM CT
Panel: Advancing Digital Inclusion Data Quality, Tools, and Applications
Co-paneling with David Keyes, Christine Parker, and Ryan Palmer

Auditing Algorithms in the Public Interest @ Impact Summit 2021

Machine Bias - Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner (2016)

Gender Shades - Joy Buolamwini and Timnit Gebru (2018)

Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor - Virginia Eubanks (2018)

How We Analyzed Google's Search Results - Leon Yin and Adrianne Jeffries (2020)

Apply in Pandas

Here are some handy helpers for df.apply. You will learn to use a status bar, and multiple cores.

import time
import pandas as pd

Let's use the iris dataset and perform an arbitrary function.

Multiprocessing in Pandas

If you need to read many files into one dataframe use this snippet:

from multiprocessing import Pool
from tqdm import tqdm
import pandas as pd

def file_parser_func(fn : str):
    """
 Read a file into a dataframe and return a list of dictionaries

	numpy
	tqdm
	pdf2image
	opencv-python
	pytesseract
	Pillow

	def value_counts(df: pd.DataFrame,
	col: str,
	args, *kwargs) -> pd.DataFrame:
	"""
	For a DataFrame (`df`): display normalized (percentage)
	`value_counts(normalize=True)` and regular counts
	`value_counts()` for a given `col`.
	"""
	count = df[col].value_counts(args, *kwargs).to_frame(name='count')
	perc = df[col].value_counts(normalize=True, args, *kwargs) \

	import json

	fn = 'notebook.ipynb'
	notebook = json.load(open(fn))
	notebook.keys()

	for cell in notebook['cells']:
	if cell['cell_type'] == "markdown":
	for sent in cell['source']:
	if sent == '\n':

	"""
	A simple script to make a Markdown table for a data dictionary (assumes you just have a column name and description).
	"""

	import pandas as pd

	col2description = {
	"Name": "What you can call me",
	"Id": "The identifier",
	"Nickname": "Do you have to ask?"