Skip to content
This repository has been archived by the owner on Nov 30, 2023. It is now read-only.

Is it possible to measure progress of pandas.concat? #28

Open
joshjacobson opened this issue May 3, 2017 · 6 comments
Open

Is it possible to measure progress of pandas.concat? #28

joshjacobson opened this issue May 3, 2017 · 6 comments

Comments

@joshjacobson
Copy link

I have 25,000 pandas dataframes, each with ~300 columns, each dataframe comprised of just 1 data row with column labels, with ~50% of each dataframe's column labels unique and the other 50% of column labels the same as those of some of the other dataframes.

I have a list l containing these 25,000 dataframes, such that l[i] is a dataframe.

I need to combine these, which pd.concat(l) allows me to do, as I'd like to.

That said, the operation is lasting a long time. I'd like to estimate, know, or measure how long this is going to take. I'm hoping tqdm allows for a progress bar, but I don't see how to implement. It looks like support for this isn't there?

@milesgranger
Copy link

Hi @joshjacobson

Might I suggest the following which should accomplish what pd.concat() does in one line, but will give you the ability to use this tool. Hope it helps. :)

dataframes = [pd.DataFrame({'col_{}'.format(i): np.random.randint(0, 10, size=1) for i in range(10)})
              for _ in range(10)
              ]

alltogether = pd.DataFrame()
for df in tqdm(dataframes):
    alltogether = alltogether.append(df, ignore_index=True)

@casperdcl
Copy link

@joshjacobson @milesgranger the real maintained repo is at https://github.com/tqdm/tqdm.

@swalkoAI
Copy link

swalkoAI commented Dec 3, 2019

The only problem with this solution is that appending to a DataFrame is very slow. In my case, when I have ~700,000 dataframes it takes over 6 hours while the standard pd.concat() does its job in 20 minutes.

@casperdcl
Copy link

apart from the fact that this question should've been 1) asked on stackoverflow 2) maybe opened as a feature request on the maintained repo (https://github.com/tqdm/tqdm) as requested a year ago (#28 (comment)), there's 3) alltogether = pd.concat(tqdm(dataframes)) which seems to work 😕

@austinmw
Copy link

austinmw commented Aug 30, 2020

@casperdcl That doesn't actually measure the concat progress.

@softhints
Copy link

I think that the only way to measure the progress is by using Dask as a workaround:

import pandas as pd
import numpy as np
from tqdm import tqdm
import dask.dataframe as dd

n = 450000
maxa = 700

df1 = pd.DataFrame({'lkey': np.random.randint(0, maxa, n),'lvalue': np.random.randint(0,int(1e8),n)})
df2 = pd.DataFrame({'rkey': np.random.randint(0, maxa, n),'rvalue': np.random.randint(0, int(1e8),n)})

sd1 = dd.from_pandas(df1, npartitions=3)
sd2 = dd.from_pandas(df2, npartitions=3)

from tqdm.dask import TqdmCallback
from dask.diagnostics import ProgressBar
ProgressBar().register()

with TqdmCallback(desc="compute"):
    sd1.merge(sd2, left_on='lkey', right_on='rkey').compute()

Source: Progress Bar for Merge Or Concat Operation With tqdm in Pandas

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants