Skip to content

Instantly share code, notes, and snippets.

View yinleon's full-sized avatar

Leon Yin yinleon

View GitHub Profile

World Tour 2024

Duke University - Durham, North Carolina
2024-02-28, 10:05 - 11:20 AM ET
Class visit: what is investigative data journalism?

NICAR 2024 - Baltimore, Maryland

2024-03-08, 9:00 - 10:00 AM ET
Workshop: Finding and using undocumented APIs

@yinleon
yinleon / professional-update-2023-07-19.md
Last active July 19, 2023 17:47
Remembering 4 years at The Markup

Remembering 4 years at The Markup

2023-07-19

This is my last week at The Markup. It’s been a true privilege to practice and produce impactful hypothesis-driven journalism with first-class journalists over the past four years.

In year one of publication, Adrianne Jeffries, Sam Morris, Evelyn Larrubia and I measured Google’s self-preferential search results using a method adapted from the life sciences. Our findings were cited in congressional hearing on Big Tech and antitrust.

Aaron Sankin, Sam Morris, Evelyn Larrubia and I found that Google blocked advertisers from finding YouTube videos related to Black Lives Matter and other [social justice phrases](https://themarkup.org/google-the-giant/2021/04/09/google-blocks-advertisers-from-targeting-black-lives-mat

@yinleon
yinleon / world-tour-2023.md
Last active May 22, 2023 22:05
Workshops and Talks!

World Tour 2023

Tow Tea @ The Tow Center - New York, New York
2023-02-17, 5:00 - 6:30 PM ET
Workshop: Finding and using undocumented APIs

Net Inclusion - San Antonio, Texas
2023-03-01, 2:30 - 3:30 PM CT
Panel: Advancing Digital Inclusion Data Quality, Tools, and Applications
Co-paneling with David Keyes, Christine Parker, and Ryan Palmer

@yinleon
yinleon / requirements.txt
Created January 3, 2022 21:32
Tesseract ORC in Python
numpy
tqdm
pdf2image
opencv-python
pytesseract
Pillow
@yinleon
yinleon / value_counts.py
Last active March 9, 2023 16:49
Perform both a normalized and regular value_counts on a columns (`col`) in a dataframe (`df`).
def value_counts(df: pd.DataFrame,
col: str,
*args, **kwargs) -> pd.DataFrame:
"""
For a DataFrame (`df`): display normalized (percentage)
`value_counts(normalize=True)` and regular counts
`value_counts()` for a given `col`.
"""
count = df[col].value_counts(*args, **kwargs).to_frame(name='count')
perc = df[col].value_counts(normalize=True, *args, **kwargs) \
@yinleon
yinleon / notebook_markdown_to_text.py
Created March 12, 2021 18:07
Thie code snippet allows you to read the markdown cells in Jupyter notebooks and prints them. This should help when you want to spellcheck and vet the text.
import json
fn = 'notebook.ipynb'
notebook = json.load(open(fn))
notebook.keys()
for cell in notebook['cells']:
if cell['cell_type'] == "markdown":
for sent in cell['source']:
if sent == '\n':
@yinleon
yinleon / pandas_apply_tips.md
Last active March 1, 2021 23:43
Some tips for Pandas df.appy(). This includes using status bars and multiple CPU cores.

Apply in Pandas

Here are some handy helpers for df.apply. You will learn to use a status bar, and multiple cores.

import time
import pandas as pd

Let's use the iris dataset and perform an arbitrary function.

Multiprocessing in Pandas

If you need to read many files into one dataframe use this snippet:

from multiprocessing import Pool
from tqdm import tqdm
import pandas as pd

def file_parser_func(fn : str):
    """
 Read a file into a dataframe and return a list of dictionaries
@yinleon
yinleon / create_markdown_table.py
Last active March 11, 2021 00:44
A simple script to make a Markdown table for a data dictionary (assumes you just have a column name and description).
"""
A simple script to make a Markdown table for a data dictionary (assumes you just have a column name and description).
"""
import pandas as pd
col2description = {
"Name": "What you can call me",
"Id": "The identifier",
"Nickname": "Do you have to ask?"