Persistent Memoization In Python
Solution 1:
sqlite3 out of the box provides ACID. File locking is prone to race-conditions and concurrency problems that you won't have using sqlite3.
Basically, yeah, sqlite3 is more than what you need, but it's not a huge burden. It can run on mobile phones, so it's not like you're committing to running some beastly software. It's going to save you time reinventing wheels and debugging locking issues.
Solution 2:
I assume you want to continue to memoize the results of the function in RAM, probably in a dictionary, but use the persistence to reduce the "warmup" time of the application. In this case you're not going to be randomly accessing items directly in the backing store so a database might indeed be overkill (though as synthesizerpatel notes, maybe not as much as you think).
Still, if you want to roll your own, a viable strategy might be to simply load the dictionary from a file at the beginning of your run before starting any threads. When a result isn't in the dictionary, then you need to write it to the file after adding it to the dictionary. You can do this by adding it to a queue and using a single worker thread that flushes items from the queue to disk (just appending them to a single file would be fine). You might occasionally add the same result more than once, but this is not fatal since it'll be the same result each time, so reading it back in twice or more will do no real harm. Python's threading model will keep you out of most kinds of concurrency trouble (e.g., appending to a list is atomic).
Here is some (untested, generic, incomplete) code showing what I'm talking about:
import cPickle as pickle
import time, os.path
cache = {}
queue = []
# run at script start to warm up cache
def preload_cache(filename):
if os.path.isfile(filename):
with open(filename, "rb") as f:
while True:
try:
key, value = pickle.load(f), pickle.load(f)
except EOFError:
break
cache[key] = value
# your memoized function
def time_consuming_function(a, b, c, d):
key = (a, b, c, d)
if key in cache:
return cache[key]
else:
# generate the result here
# ...
# add to cache, checking to see if it's already there again to avoid writing
# it twice (in case another thread also added it) (this is not fatal, though)
if key not in cache:
cache[key] = result
queue.append((key, result))
return result
# run on worker thread to write new items out
def write_cache(filename):
with open(filename, "ab") as f:
while True:
while queue:
key, value = queue.pop() # item order not important
# but must write key and value in single call to ensure
# both get written (otherwise, interrupting script might
# leave only one written, corrupting the file)
f.write(pickle.dumps(key, pickle.HIGHEST_PROTOCOL) +
pickle.dumps(value, pickle.HIGHEST_PROTOCOL))
f.flush()
time.sleep(1)
If I had time, I'd turn this into a decorator... and put the persistence into a dict
subclass... the use of global variables is also sub-optimal. :-) If you use this approach with multiprocessing
you'd probably want to use a multiprocessing.Queue
rather than a list; you can then use queue.get()
as a blocking wait for a new result in the worker process that writes to the file. I've not used multiprocessing
, though, so take this bit of advice with a grain of salt.
Post a Comment for "Persistent Memoization In Python"