python - Efficiently working with a large list of strings while generating dicts -


i'm new python , new working large amounts of data. i'm working on little fun project, upscale of i've done before in language.

for now, i'm loading sizeable (100mb+) text document, breaking words , determining frequencies of words follow each prefix (each prefix being 1 or more words). simple , fun implement in python, ended along lines of:

def prefix(symbols):     relationships = {}      in reversed(range(len(symbols))):         prefix = seperator.join(symbols[i:i+samples])         suffix = none          if i+samples < len(symbols):             suffix = symbols[i+samples]          if prefix not in relations:             relations[prefix] = {}          if suffix not in relations[prefix]:             relations[prefix][suffix] = 1         else:             relations[prefix][suffix] += 1      return relations 

(the function name, argument , use of global "samples" temporary while work out kinks)

this works well, taking 20 seconds or process 60mb plaintext dump of top articles wikipedia. stepping sample size (the "samples" variable) 1 4 or 5 however, increases memory usage (as expected -- there ~10 million words , each, "samples" many words sliced , joined new string). approaches , reaches memory limit of 2 gigabytes.

one method i've applied alleviate problem delete initial words memory iterate, no longer needed (the list of words constructed part of function i'm not modifying passed in).

def prefix(symbols):     relationships = {}      n = len(symbols)     in range(n):         prefix = seperator.join(symbols[0:samples])          suffix = none         if samples < len(symbols):             suffix = symbols[samples]          if prefix not in relationships:             relationships[prefix] = {}          if suffix not in relationships[prefix]:             relationships[prefix][suffix] = 1         else:             relationships[prefix][suffix] += 1          del symbols[0]     return relationships 

this help, not much, , @ cost of performance.

so i'm asking if kind of approach efficient or recommended, , if not, more suitable? may missing method avoid redundantly creating strings , copying lists, seeing of new me. i'm considering:

  • chunking symbols/words list, processing, dumping relationships disk , combining them after fact
  • working redis opposed keeping relationships within python whole time @ all

cheers advice or assistance!

the main advice when working large strings in python in case need lot of changes, is:

  1. convert string list
  2. do work list
  3. when finished, convert strings back

the reason string in python immutable. every action symbols[i:i+samples] example, forces python allocate new memory, copy needed string, , return new string object. because of that, when need many changes string (change index, splitting), better work lists, mutable.


Comments

Popular posts from this blog

tcpdump - How to check if server received packet (acknowledged) -