How to import csv data where some observations are on two rows -
i have dataset couple million rows. in csv format. wish import stata. can this, there problem - small percentage (but still many) of observations appear on 2 lines in csv file. of entries occur on 1 line. troublesome observations take 2 lines still follow same pattern far being delimited commas. in stata dataset, observation shows on 2 rows, both rows containing part of total data.
i used import delimited import data. there can done @ data import stage of process in stata? prefer not have deal in original csv file if possible.
***update
here example of csv file looks like:
var1,var2,var3,var4,var5 text 1, text 2,text 3 ,text 4,text 5 text 6,text 7,text 8,text9,text10 text 11,text 1 2,text 13,text14,text15 text16,text17,text18,text19,text20 notice there no comma @ end of line. notice problem observation begins text 11.
this how shows in stata:
var1 var2 var3 var4 var5 1 text 1 text 2 text 3 text 4 text 5 2 text 6 text 7 text 8 text9 text10 3 text 11 text 1 4 2 text 13 text14 text15 5 text16 text17 text18 text19 text20 that number right next text isn't mistake - illustrate data more complex shown here.
of course, how need data:
var1 var2 var3 var4 var5 1 text 1 text 2 text 3 text 4 text 5 2 text 6 text 7 text 8 text9 text10 3 text 11 text 12 text 13 text14 text15 4 text16 text17 text18 text19 text20
a convoluted way (comments inline):
clear set more off *----- example data ----- // change delimiter, if necessary insheet using "~/desktop/stata_tests/test.csv", names delim(;) list *----- want ----- // compute number of commas gen numcom = length(var1var2var3var4var5) /// - length(subinstr(var1var2var3var4var5, ",", "", .)) // save data tempfile orig save "`orig'" // keep observations fine drop if numcom != 4 // save fine data tempfile origfine save "`origfine'" *----- // load data use "`orig'", clear // keep offending observations drop if numcom == 4 // -reshape- gen = int((_n-1)/2) +1 bysort : gen j = _n // check pairs add 4 commas : egen check = total(numcom) assert check == 4 // no longer necessary drop numcom check // reshape wide reshape wide var1var2var3var4var5, i(i) j(j) // gen definitive variable gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52 keep var1var2var3var4var5 // append new observations original ones append using "`origfine'" // split split var1var2var3var4var5, parse(,) gen(var) // we're "done" drop var1var2var3var4var5 numcom list but don't have details of data, may or may not work. it's meant rough draft. depending on memory space occupied data, , other details, may need improve parts of code made more efficient.
note: file test.csv looks like
var1,var2,var3,var4,var5 text 1, text 2,text 3 ,text 4,text 5 text 6,text 7,text 8,text9,text10 text 11,text 1 2,text 13,text14,text15 text16,text17,text18,text19,text20 note 2: i'm using insheet because don't have stata 13 @ moment. import delimited way go if available.
note 3: details on how counting of commas works can reviewed @ stata tip 98: counting substrings within strings, nick cox.
Comments
Post a Comment