python - Missing values replace by med/mean in conti var, by mode in categorical var in pandas dataframe -after grouping the data by a column) -
i have pandas dataframe , missing values np.nan, trying replace these missing values. last column of data " class" , need group data based on class, mean/median/mode (based on data whether data categorical/ continuos, normal/ not) of group of column , replace missing values of group of coulmn respective mean/median/mode.
this code have come , know overkill.. if :
- group col of dataframe
- get median/mode/mean of groups of cols
- replace missing of groups
- recombine them original df
it great.
but landed , finding replacement values (mean/median/mode) group wise , storing in dict, seperating nan tuples , non-nan tuples.. replacing missing values in nan tuples.. , trying join them dataframe (which donno yet how do)
def fillmissing(df, datatype): ''' args: df ( 2d array/ dict): eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad']) datatypes (dict): dictionary of attribute names of df keys , values 0/1 indicating categorical/continuous variable eg: ('attribute1':1, 'attribute2': 0) returns: dataframe wih missing values filled writes file missing values replaces. ''' datalabels = list(df.columns.values) # dictionary hold values put in place of nan replacevalues = {} eachlabel in datalabels: thisser = df[eachlabel] if datatype[eachlabel] == 1: # if continuous variable _,pval = stats.normaltest(thisser) groupedd = thisser.groupby(df['class']) innerdict ={} name, group in groupedd: if(pval < 0.5): groupmiddle = group.median() # median of group else: groupmiddle = group.mean() # mean (if group normal ) innerdict[name.strip()] = groupmiddle replacevalues[eachlabel] = innerdict else: # if series continuous # freqcount = collections.counter(thisser) groupedd = thisser.groupby(df['class']) innerdict ={} name, group in groupedd: freqc = collections.counter(group) mostfreq = freqc.most_common(1) # frequent value of attribute(grouped class) # newgroup = group.replace(np.nan , mostfreq) innerdict[name.strip()] = mostfreq[0][0].strip() replacevalues[eachlabel] = innerdict print replacevalues # replace missing values ======================= newfile = open('missingreplaced.csv', 'w') newdf = df mask=false col in df.columns: mask = mask | df[col].isnull() # dataframe of tuples contains nulls dfnulls = df[mask] dfnotnulls = df[~mask] _, row in dfnulls.iterrows(): colname in datalabels: if pd.isnull(row[colname]): if row['class'].strip() == '>50k': row[colname] = replacevalues[colname]['>50k'] else: row[colname] = replacevalues[colname]['<=50k'] newfile.write(str(row[colname]) + ",") newdf.append(row) newfile.write("\n") # here add newdf dfnotnulls finaldf return finaldf
if understand correctly, in documentation, not you'd looking if you're asking question. see note regarding mode @ bottom trickier mean , median.
df = pd.dataframe({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],) df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean())) df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median())) df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0])) df v v_mean v_med v_mode 1 1 1.000000 1 1 1 2 2.000000 2 2 1 2 2.000000 2 2 1 nan 1.666667 2 2 2 3 3.000000 3 3 2 4 4.000000 4 4 2 4 4.000000 4 4 2 nan 3.666667 4 4 note mode() may not unique, unlike mean , median , pandas returns series reason. deal that, took simplest route , added [0] in order extract first member of series.
Comments
Post a Comment