python - NLTK, Ngrams and concordance--Multiple words -
i working on newspaper archives, on banking related coverage. problem names such bank of america merrill lynch, morgan stanley , jp morgan reported differently different countries: bankam, bofa, baml, or ms, jpm, j.p. morgan, jp. morgan. using regexp tokenizer pre-processing. how build kind of equivalence/look table? citigroup same thing (in news reporting) citibank, citi, citi group , citi bank. appreciated. @ jksnw: dictionary maps 1 word many. in case, need map many variations 1 'proper noun'. means need read {bank of america merrill lynch} 1 nnp , on flip side read {ms} nnp --in proper context--as equivalent morgan stanley.
Comments
Post a Comment