Skip to content

Instantly share code, notes, and snippets.

@yinleon
Last active March 1, 2021 23:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yinleon/e80a6ad09bdf2e9ae223f4c5edd3c2c8 to your computer and use it in GitHub Desktop.
Save yinleon/e80a6ad09bdf2e9ae223f4c5edd3c2c8 to your computer and use it in GitHub Desktop.
Some tips for Pandas df.appy(). This includes using status bars and multiple CPU cores.

Apply in Pandas

Here are some handy helpers for df.apply. You will learn to use a status bar, and multiple cores.

import time
import pandas as pd

Let's use the iris dataset and perform an arbitrary function.

file_name = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df = pd.read_csv(file_name)
df['sepal_length'].describe()
count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: sepal_length, dtype: float64
def add_one(row):
    """
    This adds 1 to `row` after a brief sleep.
    """
    time.sleep(.01)
    return row + 1
df['sepal_length'].apply(add_one)
0      6.1
1      5.9
2      5.7
3      5.6
4      6.0
      ... 
145    7.7
146    7.3
147    7.5
148    7.2
149    6.9
Name: sepal_length, Length: 150, dtype: float64

Status update

To get a progress bar for apply functions use tqdm.pandas() followed by df.progress_apply in leiu of df.apply.

First download tqdm: pip install tqdm

from tqdm import tqdm
tqdm.pandas()
df['sepal_length'].progress_apply(add_one)
100%|██████████| 150/150 [00:01<00:00, 97.78it/s]





0      6.1
1      5.9
2      5.7
3      5.6
4      6.0
      ... 
145    7.7
146    7.3
147    7.5
148    7.2
149    6.9
Name: sepal_length, Length: 150, dtype: float64

That was fast, it took about a second to process the entire dataframe. Let's see what happens when we make the dataframe 50x larger...

df = pd.DataFrame(df.to_dict(orient='records') * 50)
len(df)
7500
df['sepal_length'].progress_apply(add_one)
100%|██████████| 7500/7500 [01:16<00:00, 97.60it/s]





0       6.1
1       5.9
2       5.7
3       5.6
4       6.0
       ... 
7495    7.7
7496    7.3
7497    7.5
7498    7.2
7499    6.9
Name: sepal_length, Length: 7500, dtype: float64

Using multiple-cores

For larger dataframes, it takes longer to process everything. To speed things up, we can use more than one core on your CPU. In pure Python, we'd use something like Multiprocessing. In Pandas, we can used pandarallel.

To download: pip install pandarallel

This will use 16 workers across your CPU cores:

from pandarallel import pandarallel
n_jobs = 16
pandarallel.initialize(nb_workers=n_jobs)
INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.

Note: there is a progress_bar (boolean) argument when you initialize pandarallel, but it doesn't work in Jupyter Lab.

start_time = time.time()
df['sepal_length'].parallel_apply(add_one)
end_time = time.time()
print(f"finished in {end_time - start_time:.2f} seconds")
finished in 4.85 seconds

The previous, non-parallel version took 76 seconds!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment