Skip to content Skip to sidebar Skip to footer

Removing Words That Appear More Than X% In A Corpus Python

I am dealing with a large corpus in the form of a list of tokens/words. The corpus contains ~1900,000 words, I have run a code to get the most frequent words and now the corpus has

Solution 1:

As the error suggests, you have no terms remain after transformation. That is, every word appears more than 95% or less then 5%. For example:

corpus = ['This is a good sentence',
          'This is a good sentence',
          'This is a good sentence',
          'This is a good sentence',
          'This is a good sentence']
cv = CountVectorizer(min_df=0.05, max_df=0.95) 
X = cv.fit_transform(corpus)

will raise the same error. But this does not make sense when your corpus has 1900,000 words. Perhaps you can check whether your corpus is a valid argument of CountVectorizer. See more detail in https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html .


Solution 2:

Here is some code that computes the frequency fraction of each token in a list. You can use the fraction to perform your filtering.

import nltk

def get_frac_dist(token_list):
    '''
    Computes frequency count and fraction of individual words in a list.

    Parameters
    ----------
    token_list : list
        List of all (non-unique) tokens from some corpus

    Returns
    -------
    dict
        Dictionary of { token, (count, fraction) } pairs.
    '''

    token_list = nltk.word_tokenize(content)
    total_token_count = len(token_list)

    freq_dict = nltk.FreqDist(token_list)

    frac_dict = {}
    for token, count in freq_dict.items():
        frac_dict[token] = (count, count / total_token_count)

    return frac_dict

Here's how you can use it:

  1. Open a file, tokenize it with nltk.word_tokenize(), and pass the token list into get_frac_dist().
with open('speech.txt') as f:
    content = f.read()

token_list = nltk.word_tokenize(content)
frac_dist = get_frac_dist(token_list)
  1. The get_frac_dist() returns a dictionary. The key is a token, and the value is a tuple containing (frequency count, frequency fraction). Let's take a look at the first 10 items in the dictionary.
i = 0
for k, v in frac_dist.items():

    i += 1
    if i > 10:
        break

    token = k
    count = v[0]
    frac = v[1]
    print('%-20s: %5d, %0.5f' % (token, count, frac))

prints out:

President           :     2, 0.00081
Pitzer              :     1, 0.00040
,                   :   162, 0.06535
Mr.                 :     3, 0.00121
Vice                :     1, 0.00040
Governor            :     1, 0.00040
Congressman         :     2, 0.00081
Thomas              :     1, 0.00040
Senator             :     1, 0.00040
Wiley               :     1, 0.00040
  1. Now, if you want to get the words with the fraction greater than or less than some value, run a loop or a list comprehension using < or > appropriately.
l = [ (k, v[0], v[1]) for k, v in frac_dist.items() if v[1] > 0.04 ]
print(l)

prints out:

[(',', 162, 0.0653489310205728), ('and', 104, 0.04195240016135539), ('the', 115, 0.0463896732553449)]

Be aware that your desire to get the words that occur more than 95% of the time may not return any valid answer. You should plot out the distribution of words and see what their counts are.


Post a Comment for "Removing Words That Appear More Than X% In A Corpus Python"