How to import csv data where some observations are on two rows -


i have dataset couple million rows. in csv format. wish import stata. can this, there problem - small percentage (but still many) of observations appear on 2 lines in csv file. of entries occur on 1 line. troublesome observations take 2 lines still follow same pattern far being delimited commas. in stata dataset, observation shows on 2 rows, both rows containing part of total data.

i used import delimited import data. there can done @ data import stage of process in stata? prefer not have deal in original csv file if possible.

***update

here example of csv file looks like:

var1,var2,var3,var4,var5  text 1,    text 2,text 3   ,text 4,text 5 text 6,text 7,text 8,text9,text10 text 11,text 1               2,text 13,text14,text15 text16,text17,text18,text19,text20  

notice there no comma @ end of line. notice problem observation begins text 11.

this how shows in stata:

    var1     var2     var3     var4     var5  1   text 1   text 2   text 3   text 4   text 5 2   text 6   text 7   text 8   text9    text10 3   text 11  text 1 4   2        text 13  text14  text15 5   text16   text17   text18   text19   text20 

that number right next text isn't mistake - illustrate data more complex shown here.

of course, how need data:

    var1     var2     var3     var4     var5  1   text 1   text 2   text 3   text 4   text 5 2   text 6   text 7   text 8   text9    text10 3   text 11  text 12  text 13  text14   text15 4   text16   text17   text18   text19   text20 

a convoluted way (comments inline):

clear set more off  *----- example data -----  // change delimiter, if necessary insheet using "~/desktop/stata_tests/test.csv", names delim(;)  list  *----- want -----  // compute number of commas gen numcom = length(var1var2var3var4var5) ///     - length(subinstr(var1var2var3var4var5, ",", "", .))  // save data tempfile orig save "`orig'"  // keep observations fine drop if numcom != 4  // save fine data tempfile origfine save "`origfine'"  *-----  // load data use "`orig'", clear  // keep offending observations drop if numcom == 4  // -reshape- gen = int((_n-1)/2) +1 bysort : gen j = _n  // check pairs add 4 commas : egen check = total(numcom) assert check == 4  // no longer necessary drop numcom check  // reshape wide reshape wide var1var2var3var4var5, i(i) j(j)  // gen definitive variable gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52 keep var1var2var3var4var5  // append new observations original ones append using "`origfine'"  // split split var1var2var3var4var5, parse(,) gen(var)  // we're "done" drop var1var2var3var4var5 numcom list 

but don't have details of data, may or may not work. it's meant rough draft. depending on memory space occupied data, , other details, may need improve parts of code made more efficient.

note: file test.csv looks like

var1,var2,var3,var4,var5  text 1,    text 2,text 3   ,text 4,text 5 text 6,text 7,text 8,text9,text10 text 11,text 1               2,text 13,text14,text15 text16,text17,text18,text19,text20 

note 2: i'm using insheet because don't have stata 13 @ moment. import delimited way go if available.

note 3: details on how counting of commas works can reviewed @ stata tip 98: counting substrings within strings, nick cox.


Comments

Popular posts from this blog

cakephp - simple blog with croogo -

How to group boxplot outliers in gnuplot -

bash - Performing variable substitution in a string -