replace - How can I swap numbers inside data block of repeating format using linux commands? -

June 15, 2013

i have huge data file, , hope swap numbers of 2nd column only, in following format file. file have 25,000,000 dataset, , 8768 lines each.

%% edited: shorter 10 line example. sorry inconvenience. typical 1 data block.

# dataset 1   #  # number of lines 10  #  # header lines  5 11 3 10 120 90 0         0.952         0.881         0.898         2.744         0.034         0.030  10 12 3 5 125 112 0         0.952         0.897         0.905         2.775         0.026         0.030  50 10 3 48 129 120 0         1.061         0.977         0.965         3.063         0.001         0.026  120 2 4 5 50 186 193 0         0.881         0.965         0.899         0.917         3.669         0.000        -0.005  125 3 4 10 43 186 183 0         0.897         0.945         0.910         0.883         3.641         0.000         0.003  186 5 4 120 125 249 280 0         0.899         0.910         0.931         0.961         3.727         0.000        -0.001  193 6 4 120 275 118 268 0         0.917         0.895         0.897         0.937         3.799         0.000         0.023  201 8 4 278 129 131 280 0         0.921         0.837         0.870         0.934         3.572         0.000         0.008  249 9 4 186 355 179 317 0         0.931         0.844         0.907         0.928         3.615         0.000         0.008  280 10 4 186 201 340 359 0         0.961         0.934         0.904         0.898         3.700         0.000         0.033 # # dataset 1   #  # number of lines 10  ...

as can see, there 7 repeating header lines in head, , 1 trailing line @ end of dataset. header , trailing lines beginning #. result, data have 7 header lines, 8768 data lines, , 1 trailing line, total 8776 lines per data block. 1 trailing line contains sinlge '#'.

i want swap numbers in 2nd columns only. first, want replace

1, 9, 10, 11 => 666 2, 6, 7, 8 => 333 3, 4, 5 => 222

of 2nd column, , then,

666 => 6 333 => 3 222 => 2

of 2nd column. hope conduct replacing repeating dataset.

i tried python, data big, makes memory error. how can perform swapping linux commands sed or awk or cat commands?

thanks

best,

this might work you, you'd have use gnu awk, it's using gensub command , $0 reassignment.

put following executable awk file ( script.awk ):

#!/usr/bin/awk -f  begin {     a[1] = a[9] = a[10] = a[11] = 6     a[2] = a[6] = a[7]  = a[8]  = 3     a[3] = a[4] = a[5]          = 2 }  function swap( c2,            val ) {     val = a[c2]     return( val=="" ? c2 : val ) }  /^( [0-9]+ )/ { $0 = gensub( /^( [0-9]+)( [0-9]+)/, "\\1 " swap($2), 1 ) }  47 # print line

here's breakdown:

begin - set array a mappings of new values.
create user defined function swap provide values 2nd column a array or value itself. c2 element passed in, while val element local variable ( becuase no 2nd argument passed in ).
when line starts space followed number , space (the pattern), use gensub replace first occurrance of first number pattern concatenated space , return swap(the action). in case, i'm using gensub's replacement text preserve first column data. second column passed swap using field data identifier of $2. using gensub should preserve formatting of data lines.
47 - expression evaluates true provides default action of printing $0, data lines might have been modified. line wasn't "data" printed out here w/o modifications.

the provided data doesn't show cases, made own test file:

# 2 skip me 9 2 not going process me  1 1 don't              change  matting  2 2    4       23242.223       data  3 3 data       that's  formatted  4 4 7  that's  formatted  5 5 data       that's  formatted  6 6 data       that's  formatted  7 7 data       that's  formatted  8 8 data       that's  formatted  9 9 data       that's  formatted  10 10 data     that's  formatted  11 11 data     that's  formatted  12 12 data     that's  formatted  13 13 data     that's  formatted  14 s data      that's  formatted # other data

running executable awk (like ./script.awk data) gives following output:

# 2 skip me 9 2 not going process me  1 6 don't              change  matting  2 3    4       23242.223       data  3 2 data       that's  formatted  4 2 7  that's  formatted  5 2 data       that's  formatted  6 3 data       that's  formatted  7 3 data       that's  formatted  8 3 data       that's  formatted  9 6 data       that's  formatted  10 6 data      that's  formatted  11 6 data      that's  formatted  12 12 data     that's  formatted  13 13 data     that's  formatted  14 s data      that's  formatted # other data

which looks alright me, i'm not 1 25 million datasets.

you'd want try on smaller sample of data first (the first few datasets?) , redirect stdout temp file perhaps like:

head -n 26328 data | ./script.awk - > tempfile

you can learn more elements used in script here:

and of course, should spend quality time reviewing awk related questions , answers on stack overflow ;)

Search This Blog

Plus Code

replace - How can I swap numbers inside data block of repeating format using linux commands? -

Comments

Post a Comment

Popular posts from this blog

How to group boxplot outliers in gnuplot -

cakephp - simple blog with croogo -

bash - Performing variable substitution in a string -