Searching A List Of Words From A Large File In Python
Solution 1:
you can use intersection
from set theory to check whether the list of words and words from a line have anything in common.
list_of_words=[]
sett=set(list_of_words)
withopen(inputfile) as f1,open(outputfile,'w') as f2:
for line in f1:
iflen(set(line.split()).intersection(sett))>=1:
passelse:
f2.write(line)
Solution 2:
If the source file contains only words separated by whitespace, you can use sets:
words = set(your_words_list)
for line in infile:
if words.isdisjoint(line.split()):
outfile.write(line)
Note that this doesn't handle punctuation, e.g. given words = ['foo', 'bar']
a line like foo, bar,stuff
won't be removed. To handle this, you need regular expressions:
rr = r'\b(%s)\b' % '|'.join(your_words_list)
for line in infile:
ifnot re.search(rr, line):
outfile.write(line)
Solution 3:
The lines and words in the big file need to somehow be sorted, in which case you can implement binary search. It does not seem like they are so the best you can do is linear search by checking to see if each word in the list is in a given line.
Solution 4:
contents = file.read()
words = the_list.sort(key=len, reverse=True)
stripped_contents = re.replace(r'^.*(%s).*\n'%'|'.join(words),'',contents)
something like that should work... not sure if it will be faster than going through line by line
[edit] this is untested code and may need some slight tweaks
Solution 5:
You can not delete the lines in-place, you need to rewrite a second file. You may overwrite the old one afterwards (see shutil.copy
for this).
The rest reads like pseudo-code:
forbidden_words = set("these words shall not occur".split())
withopen(inputfile) as infile, open(outputfile, 'w+') as outfile:
outfile.writelines(line for line in infile
ifnotany(word in forbidden_words for word in line.split()))
See this question for approaches how to get rid of punctuation-induced false-negatives.
Post a Comment for "Searching A List Of Words From A Large File In Python"