Skip to content Skip to sidebar Skip to footer

Searching A List Of Words From A Large File In Python

I am new python. I have a list of words and a very large file. I would like to delete the lines in the file that contain a word from the list of words. The list of words is given

Solution 1:

you can use intersection from set theory to check whether the list of words and words from a line have anything in common.

list_of_words=[]
sett=set(list_of_words)
withopen(inputfile) as f1,open(outputfile,'w') as f2:
    for line in f1:
        iflen(set(line.split()).intersection(sett))>=1:
            passelse:
            f2.write(line)

Solution 2:

If the source file contains only words separated by whitespace, you can use sets:

words = set(your_words_list)
for line in infile:
    if words.isdisjoint(line.split()):
        outfile.write(line)

Note that this doesn't handle punctuation, e.g. given words = ['foo', 'bar'] a line like foo, bar,stuff won't be removed. To handle this, you need regular expressions:

rr = r'\b(%s)\b' % '|'.join(your_words_list)
for line in infile:
    ifnot re.search(rr, line):
        outfile.write(line)

Solution 3:

The lines and words in the big file need to somehow be sorted, in which case you can implement binary search. It does not seem like they are so the best you can do is linear search by checking to see if each word in the list is in a given line.

Solution 4:

contents = file.read()
words = the_list.sort(key=len, reverse=True)
stripped_contents = re.replace(r'^.*(%s).*\n'%'|'.join(words),'',contents)

something like that should work... not sure if it will be faster than going through line by line

[edit] this is untested code and may need some slight tweaks

Solution 5:

You can not delete the lines in-place, you need to rewrite a second file. You may overwrite the old one afterwards (see shutil.copyfor this).

The rest reads like pseudo-code:

forbidden_words = set("these words shall not occur".split())

withopen(inputfile) as infile, open(outputfile, 'w+') as outfile:
  outfile.writelines(line for line in infile
      ifnotany(word in forbidden_words for word in line.split()))

See this question for approaches how to get rid of punctuation-induced false-negatives.

Post a Comment for "Searching A List Of Words From A Large File In Python"