python - Missing values replace by med/mean in conti var, by mode in categorical var in pandas dataframe -after grouping the data by a column) -


i have pandas dataframe , missing values np.nan, trying replace these missing values. last column of data " class" , need group data based on class, mean/median/mode (based on data whether data categorical/ continuos, normal/ not) of group of column , replace missing values of group of coulmn respective mean/median/mode.

this code have come , know overkill.. if :

  1. group col of dataframe
  2. get median/mode/mean of groups of cols
  3. replace missing of groups
  4. recombine them original df

it great.

but landed , finding replacement values (mean/median/mode) group wise , storing in dict, seperating nan tuples , non-nan tuples.. replacing missing values in nan tuples.. , trying join them dataframe (which donno yet how do)

def fillmissing(df, datatype): ''' args:     df ( 2d array/ dict):                          eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad'])     datatypes (dict): dictionary of attribute names of df keys , values 0/1                          indicating categorical/continuous variable eg:  ('attribute1':1, 'attribute2': 0)  returns:     dataframe wih missing values filled     writes file missing values replaces.       ''' datalabels = list(df.columns.values)  # dictionary hold values put in place of nan replacevalues = {}  eachlabel in datalabels:      thisser = df[eachlabel]     if datatype[eachlabel] == 1:                        # if continuous variable          _,pval = stats.normaltest(thisser)         groupedd = thisser.groupby(df['class'])          innerdict ={}         name, group in groupedd:             if(pval < 0.5):                 groupmiddle = group.median()            # median of group             else:                 groupmiddle = group.mean()              # mean (if group normal )             innerdict[name.strip()] = groupmiddle         replacevalues[eachlabel] = innerdict      else:                                               # if series continuous         # freqcount = collections.counter(thisser)         groupedd = thisser.groupby(df['class'])         innerdict ={}         name, group in groupedd:             freqc = collections.counter(group)                   mostfreq = freqc.most_common(1)             # frequent value of attribute(grouped class)             # newgroup = group.replace(np.nan , mostfreq)             innerdict[name.strip()] = mostfreq[0][0].strip()         replacevalues[eachlabel] = innerdict print replacevalues   # replace missing values ======================= newfile = open('missingreplaced.csv', 'w') newdf = df     mask=false col in df.columns: mask = mask | df[col].isnull()  # dataframe of tuples contains nulls dfnulls = df[mask] dfnotnulls = df[~mask]   _, row in dfnulls.iterrows():     colname in datalabels:         if pd.isnull(row[colname]):             if row['class'].strip() == '>50k':                 row[colname] = replacevalues[colname]['>50k']             else:                 row[colname] = replacevalues[colname]['<=50k']         newfile.write(str(row[colname]) + ",")     newdf.append(row)     newfile.write("\n")  # here add newdf dfnotnulls finaldf  return finaldf  

if understand correctly, in documentation, not you'd looking if you're asking question. see note regarding mode @ bottom trickier mean , median.

df = pd.dataframe({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],)  df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean())) df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median())) df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0]))  df     v    v_mean  v_med  v_mode 1   1  1.000000      1       1 1   2  2.000000      2       2 1   2  2.000000      2       2 1 nan  1.666667      2       2 2   3  3.000000      3       3 2   4  4.000000      4       4 2   4  4.000000      4       4 2 nan  3.666667      4       4 

note mode() may not unique, unlike mean , median , pandas returns series reason. deal that, took simplest route , added [0] in order extract first member of series.


Comments

Popular posts from this blog

cakephp - simple blog with croogo -

How to group boxplot outliers in gnuplot -

bash - Performing variable substitution in a string -