Skip to content

Instantly share code, notes, and snippets.

@yinleon
Created March 1, 2021 21:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yinleon/8b7555afbbeed47e439dbd2364b8d404 to your computer and use it in GitHub Desktop.
Save yinleon/8b7555afbbeed47e439dbd2364b8d404 to your computer and use it in GitHub Desktop.

Multiprocessing in Pandas

If you need to read many files into one dataframe use this snippet:

from multiprocessing import Pool
from tqdm import tqdm
import pandas as pd

def file_parser_func(fn : str):
    """
    Read a file into a dataframe and return a list of dictionaries
    """
    return pd.read_csv(fn).to_dict('records')
files = ['a.csv', 'b.csv']

data = []
with Pool(processes=8) as pool:
    for record in tqdm(pool.imap_unordered(file_parser_func, files), 
                       total=len(files)):
        data.extend(record)

df = pd.DataFrame(data)

If you have a large dataframe and you need to perform an apply function try this software:

from pandarallel import pandarallel
n_jobs = 8

pandarallel.initialize(nb_workers=n_jobs, progress_bar=True)

df.parallel_apply(some_func, axis=1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment