Problems with creating large MultiIndex (10 million rows) in Python Pandas used to reindex large DataFrame -
my situation have dataframe multiindex including timestamp
, number (wavelength 280-4000 nm) wavelength number spacing changes every 1 nm 5 nm. need 1 nm spacing , plan , plan linearly interpolate after reindexing dataframe.
i tried create multiindex
using pd.multiindex.from_product()
, providing 2 lists of 4000 items in length each resulted in python using computer's ram. code looks like:
mindex = pd.multiindex.from_product([times_list, waves_list], names=['tmstamp', 'wvlgth'] )
from_product()
simple function don't think i'm messing think able handle larger lists i've passed it.
to try around i've used pd.multiindex()
, passed unique levels, indentical passed .from_product()
constructed labels each using code below:
times = series(df.index.get_level_values('tmstamp').values).unique() times_series = series(times) times_label_list = list() counter = 0 in times_series: temp_list = series([counter] * 3721) times_label_list.append(temp_list) counter +=1 times_label = pd.concat(times_label_list)
and
waves_levels = np.arange(280,4001,1).tolist() waves_label = np.arange(0,3721,1).tolist() * times_count
which used in
midx = pd.multiindex([times_list, waves_levels], labels=[times_label, waves_label], names=['tmstamp','wvlng'] )
and multiindex
used reindex df
ri_df = df.reindex(midx)
my questions are:
- am messing
pd.multiindex.from_product()
or can not handle being passed large lists? - is workaround valid or falling pitfalls?
thanks help!
this shouldn't problem. need more specific on times_list are.
in [2]: mi = pd.multiindex.from_product([pd.date_range('20130101',freq='s',periods=4000), ...: np.arange(280,4000)],names=['times','wl']) in [4]: mi.nbytes/(1024*1024.0) out[4]: 56.82167148590088 in [6]: len(mi) out[6]: 14880000
Comments
Post a Comment