haskell - How to parse into records? -
i have asked question before, learned from. discovered implementation resulted in list of strings rather list of records. file parsing has records looks like;
sp|p30375|1a01_gorgo class histocompatibility antigen gogo-a*0101 alpha chain os=gorilla gorilla gorilla pe=2 sv=1 mavmaprtlvlllsgalaltqtwagshsmryfstsvsrpgrgeprfiavgyvddtqfvrf dsdaasqrmeprapwieqegpeywdrntrnvkahsqtdrvdlgtlrgyynqsedgshtiq rmygcdvgsdgrflrgyqqdaydgkdyialnedlrswtaadmaaeitkrkweaahfaeql raylegtcvewlrrhlengketlqrtdapkthmthhavsdheailrcwalsfypaeitlt wqrdgedqtqdtelvetrpagdgtfqkwaavvvpsgqeqrytchvqheglpepltlrwep ssqptipivgiiaglvlfgaviagavvaavrwrrkssdrkggsysqaassdsaqgsdvsl tackv sp|p30443|1a01_human hla class histocompatibility antigen a-1 alpha chain os=homo sapiens gn=hla-a pe=1 sv=1 mavmaprtlllllsgalaltqtwagshsmryfftsvsrpgrgeprfiavgyvddtqfvrf dsdaasqkmeprapwieqegpeywdqetrnmkahsqtdranlgtlrgyynqsedgshtiq imygcdvgpdgrflrgyrqdaydgkdyialnedlrswtaadmaaqitkrkweavhaaeqr rvylegrcvdglrrylengketlqrtdppkthmthhpisdheatlrcwalgfypaeitlt wqrdgedqtqdtelvetrpagdgtfqkwaavvvpsgeeqrytchvqheglpkpltlrwel ssqptipivgiiaglvllgavitgavvaavmwrrkssdrkggsytqaassdsaqgsdvsl tackv
just before sp there ">", planned use records divide point. so, how can end with:
[[>sp|p30375|1a01_gorgo class histocompatibility antigen gogo-a*0101 alpha chain os=gorilla gorilla gorilla pe=2 sv=1 mavmaprtlvlllsgalaltqtwagshsmryfstsvsrpgrgeprfiavgyvddtqfvrf dsdaasqrmeprapwieqegpeywdrntrnvkahsqtdrvdlgtlrgyynqsedgshtiq rmygcdvgsdgrflrgyqqdaydgkdyialnedlrswtaadmaaeitkrkweaahfaeql raylegtcvewlrrhlengketlqrtdapkthmthhavsdheailrcwalsfypaeitlt wqrdgedqtqdtelvetrpagdgtfqkwaavvvpsgqeqrytchvqheglpepltlrwep ssqptipivgiiaglvlfgaviagavvaavrwrrkssdrkggsysqaassdsaqgsdvsl tackv] [>sp|p30443|1a01_human hla class histocompatibility antigen a-1 alpha chain os=homo sapiens gn=hla-a pe=1 sv=1 mavmaprtlllllsgalaltqtwagshsmryfftsvsrpgrgeprfiavgyvddtqfvrf dsdaasqkmeprapwieqegpeywdqetrnmkahsqtdranlgtlrgyynqsedgshtiq imygcdvgpdgrflrgyrqdaydgkdyialnedlrswtaadmaaqitkrkweavhaaeqr rvylegrcvdglrrylengketlqrtdppkthmthhpisdheatlrcwalgfypaeitlt wqrdgedqtqdtelvetrpagdgtfqkwaavvvpsgeeqrytchvqheglpkpltlrwel ssqptipivgiiaglvllgavitgavvaavmwrrkssdrkggsytqaassdsaqgsdvsl tackv]]
using parsec? code started out with; how parse uniprot-file parsec?
as far understand problem need parse records separated '>'. records string containing characters '>' , looking this:
import control.applicative ((*>)) import text.parsec import text.parsec.bytestring (parser,parsefromfile) type record = string parserfile :: filepath -> io [record] parserfile filename = r <- parsefromfile parserecords filename case r of left msg -> error . show $ msg right xs -> return xs parserecords :: parser [record] parserecords = many1 $ (char '>') *> (many1 $ noneof ['>'])
the "parsefromfile" function read data using efficient binary representation , takes argument parser analyze stream of bytestrings resulting reading file.
now, records begin '>' symbol, therefore need parser match '>' symbol @ beggining , store rest of symbols in list until next '>' symbol.
Comments
Post a Comment