python - Efficiently working with a large list of strings while generating dicts -
i'm new python , new working large amounts of data. i'm working on little fun project, upscale of i've done before in language.
for now, i'm loading sizeable (100mb+) text document, breaking words , determining frequencies of words follow each prefix (each prefix being 1 or more words). simple , fun implement in python, ended along lines of:
def prefix(symbols): relationships = {} in reversed(range(len(symbols))): prefix = seperator.join(symbols[i:i+samples]) suffix = none if i+samples < len(symbols): suffix = symbols[i+samples] if prefix not in relations: relations[prefix] = {} if suffix not in relations[prefix]: relations[prefix][suffix] = 1 else: relations[prefix][suffix] += 1 return relations
(the function name, argument , use of global "samples" temporary while work out kinks)
this works well, taking 20 seconds or process 60mb plaintext dump of top articles wikipedia. stepping sample size (the "samples" variable) 1 4 or 5 however, increases memory usage (as expected -- there ~10 million words , each, "samples" many words sliced , joined new string). approaches , reaches memory limit of 2 gigabytes.
one method i've applied alleviate problem delete initial words memory iterate, no longer needed (the list of words constructed part of function i'm not modifying passed in).
def prefix(symbols): relationships = {} n = len(symbols) in range(n): prefix = seperator.join(symbols[0:samples]) suffix = none if samples < len(symbols): suffix = symbols[samples] if prefix not in relationships: relationships[prefix] = {} if suffix not in relationships[prefix]: relationships[prefix][suffix] = 1 else: relationships[prefix][suffix] += 1 del symbols[0] return relationships
this help, not much, , @ cost of performance.
so i'm asking if kind of approach efficient or recommended, , if not, more suitable? may missing method avoid redundantly creating strings , copying lists, seeing of new me. i'm considering:
- chunking symbols/words list, processing, dumping relationships disk , combining them after fact
- working redis opposed keeping relationships within python whole time @ all
cheers advice or assistance!
the main advice when working large strings in python in case need lot of changes, is:
- convert string list
- do work list
- when finished, convert strings back
the reason string in python immutable. every action symbols[i:i+samples]
example, forces python allocate new memory, copy needed string, , return new string object. because of that, when need many changes string (change index, splitting), better work lists, mutable.
Comments
Post a Comment