Apply in Pandas

Here are some handy helpers for df.apply. You will learn to use a status bar, and multiple cores.

import time
import pandas as pd

Let's use the iris dataset and perform an arbitrary function.

file_name = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df = pd.read_csv(file_name)

df['sepal_length'].describe()

count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: sepal_length, dtype: float64

def add_one(row):
    """
    This adds 1 to `row` after a brief sleep.
    """
    time.sleep(.01)
    return row + 1

df['sepal_length'].apply(add_one)

0      6.1
1      5.9
2      5.7
3      5.6
4      6.0
      ... 
145    7.7
146    7.3
147    7.5
148    7.2
149    6.9
Name: sepal_length, Length: 150, dtype: float64

Status update

To get a progress bar for apply functions use tqdm.pandas() followed by df.progress_apply in leiu of df.apply.

First download tqdm: pip install tqdm

from tqdm import tqdm
tqdm.pandas()

df['sepal_length'].progress_apply(add_one)

100%|██████████| 150/150 [00:01<00:00, 97.78it/s]





0      6.1
1      5.9
2      5.7
3      5.6
4      6.0
      ... 
145    7.7
146    7.3
147    7.5
148    7.2
149    6.9
Name: sepal_length, Length: 150, dtype: float64

That was fast, it took about a second to process the entire dataframe. Let's see what happens when we make the dataframe 50x larger...

df = pd.DataFrame(df.to_dict(orient='records') * 50)
len(df)

df['sepal_length'].progress_apply(add_one)

100%|██████████| 7500/7500 [01:16<00:00, 97.60it/s]





0       6.1
1       5.9
2       5.7
3       5.6
4       6.0
       ... 
7495    7.7
7496    7.3
7497    7.5
7498    7.2
7499    6.9
Name: sepal_length, Length: 7500, dtype: float64

Using multiple-cores

For larger dataframes, it takes longer to process everything. To speed things up, we can use more than one core on your CPU. In pure Python, we'd use something like Multiprocessing. In Pandas, we can used pandarallel.

To download: pip install pandarallel

This will use 16 workers across your CPU cores:

from pandarallel import pandarallel
n_jobs = 16
pandarallel.initialize(nb_workers=n_jobs)

INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.

Note: there is a progress_bar (boolean) argument when you initialize pandarallel, but it doesn't work in Jupyter Lab.

start_time = time.time()
df['sepal_length'].parallel_apply(add_one)
end_time = time.time()
print(f"finished in {end_time - start_time:.2f} seconds")

finished in 4.85 seconds

The previous, non-parallel version took 76 seconds!

yinleon/pandas_apply_tips.md

Apply in Pandas

Status update

Using multiple-cores