python - Efficiently working with a large list of strings while generating dicts -

May 15, 2014

i'm new python , new working large amounts of data. i'm working on little fun project, upscale of i've done before in language.

for now, i'm loading sizeable (100mb+) text document, breaking words , determining frequencies of words follow each prefix (each prefix being 1 or more words). simple , fun implement in python, ended along lines of:

def prefix(symbols):     relationships = {}      in reversed(range(len(symbols))):         prefix = seperator.join(symbols[i:i+samples])         suffix = none          if i+samples < len(symbols):             suffix = symbols[i+samples]          if prefix not in relations:             relations[prefix] = {}          if suffix not in relations[prefix]:             relations[prefix][suffix] = 1         else:             relations[prefix][suffix] += 1      return relations

(the function name, argument , use of global "samples" temporary while work out kinks)

this works well, taking 20 seconds or process 60mb plaintext dump of top articles wikipedia. stepping sample size (the "samples" variable) 1 4 or 5 however, increases memory usage (as expected -- there ~10 million words , each, "samples" many words sliced , joined new string). approaches , reaches memory limit of 2 gigabytes.

one method i've applied alleviate problem delete initial words memory iterate, no longer needed (the list of words constructed part of function i'm not modifying passed in).

def prefix(symbols):     relationships = {}      n = len(symbols)     in range(n):         prefix = seperator.join(symbols[0:samples])          suffix = none         if samples < len(symbols):             suffix = symbols[samples]          if prefix not in relationships:             relationships[prefix] = {}          if suffix not in relationships[prefix]:             relationships[prefix][suffix] = 1         else:             relationships[prefix][suffix] += 1          del symbols[0]     return relationships

this help, not much, , @ cost of performance.

so i'm asking if kind of approach efficient or recommended, , if not, more suitable? may missing method avoid redundantly creating strings , copying lists, seeing of new me. i'm considering:

chunking symbols/words list, processing, dumping relationships disk , combining them after fact
working redis opposed keeping relationships within python whole time @ all

cheers advice or assistance!

the main advice when working large strings in python in case need lot of changes, is:

convert string list
do work list
when finished, convert strings back

the reason string in python immutable. every action symbols[i:i+samples] example, forces python allocate new memory, copy needed string, , return new string object. because of that, when need many changes string (change index, splitting), better work lists, mutable.

Search This Blog

Plus Code

python - Efficiently working with a large list of strings while generating dicts -

Comments

Post a Comment

Popular posts from this blog

java - Intellij IDEA shortcut How to add new element (ex. class or package)? -

How to group boxplot outliers in gnuplot -

cakephp - simple blog with croogo -