python - Efficiently working with a large list of strings while generating dicts -


i'm new python , new working large amounts of data. i'm working on little fun project, upscale of i've done before in language.

for now, i'm loading sizeable (100mb+) text document, breaking words , determining frequencies of words follow each prefix (each prefix being 1 or more words). simple , fun implement in python, ended along lines of:

def prefix(symbols):     relationships = {}      in reversed(range(len(symbols))):         prefix = seperator.join(symbols[i:i+samples])         suffix = none          if i+samples < len(symbols):             suffix = symbols[i+samples]          if prefix not in relations:             relations[prefix] = {}          if suffix not in relations[prefix]:             relations[prefix][suffix] = 1         else:             relations[prefix][suffix] += 1      return relations 

(the function name, argument , use of global "samples" temporary while work out kinks)

this works well, taking 20 seconds or process 60mb plaintext dump of top articles wikipedia. stepping sample size (the "samples" variable) 1 4 or 5 however, increases memory usage (as expected -- there ~10 million words , each, "samples" many words sliced , joined new string). approaches , reaches memory limit of 2 gigabytes.

one method i've applied alleviate problem delete initial words memory iterate, no longer needed (the list of words constructed part of function i'm not modifying passed in).

def prefix(symbols):     relationships = {}      n = len(symbols)     in range(n):         prefix = seperator.join(symbols[0:samples])          suffix = none         if samples < len(symbols):             suffix = symbols[samples]          if prefix not in relationships:             relationships[prefix] = {}          if suffix not in relationships[prefix]:             relationships[prefix][suffix] = 1         else:             relationships[prefix][suffix] += 1          del symbols[0]     return relationships 

this help, not much, , @ cost of performance.

so i'm asking if kind of approach efficient or recommended, , if not, more suitable? may missing method avoid redundantly creating strings , copying lists, seeing of new me. i'm considering:

  • chunking symbols/words list, processing, dumping relationships disk , combining them after fact
  • working redis opposed keeping relationships within python whole time @ all

cheers advice or assistance!

the main advice when working large strings in python in case need lot of changes, is:

  1. convert string list
  2. do work list
  3. when finished, convert strings back

the reason string in python immutable. every action symbols[i:i+samples] example, forces python allocate new memory, copy needed string, , return new string object. because of that, when need many changes string (change index, splitting), better work lists, mutable.


Comments

Popular posts from this blog

Payment information shows nothing in one page checkout page magento -

tcpdump - How to check if server received packet (acknowledged) -