Removing Words That Appear More Than X% In A Corpus Python
Solution 1:
As the error suggests, you have no terms remain after transformation. That is, every word appears more than 95% or less then 5%. For example:
corpus = ['This is a good sentence',
'This is a good sentence',
'This is a good sentence',
'This is a good sentence',
'This is a good sentence']
cv = CountVectorizer(min_df=0.05, max_df=0.95)
X = cv.fit_transform(corpus)
will raise the same error. But this does not make sense when your corpus has 1900,000 words. Perhaps you can check whether your corpus is a valid argument of CountVectorizer
. See more detail in https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html .
Solution 2:
Here is some code that computes the frequency fraction of each token in a list. You can use the fraction to perform your filtering.
import nltk
def get_frac_dist(token_list):
'''
Computes frequency count and fraction of individual words in a list.
Parameters
----------
token_list : list
List of all (non-unique) tokens from some corpus
Returns
-------
dict
Dictionary of { token, (count, fraction) } pairs.
'''
token_list = nltk.word_tokenize(content)
total_token_count = len(token_list)
freq_dict = nltk.FreqDist(token_list)
frac_dict = {}
for token, count in freq_dict.items():
frac_dict[token] = (count, count / total_token_count)
return frac_dict
Here's how you can use it:
- Open a file, tokenize it with
nltk.word_tokenize()
, and pass the token list intoget_frac_dist()
.
with open('speech.txt') as f:
content = f.read()
token_list = nltk.word_tokenize(content)
frac_dist = get_frac_dist(token_list)
- The
get_frac_dist()
returns a dictionary. The key is a token, and the value is a tuple containing (frequency count, frequency fraction). Let's take a look at the first 10 items in the dictionary.
i = 0
for k, v in frac_dist.items():
i += 1
if i > 10:
break
token = k
count = v[0]
frac = v[1]
print('%-20s: %5d, %0.5f' % (token, count, frac))
prints out:
President : 2, 0.00081
Pitzer : 1, 0.00040
, : 162, 0.06535
Mr. : 3, 0.00121
Vice : 1, 0.00040
Governor : 1, 0.00040
Congressman : 2, 0.00081
Thomas : 1, 0.00040
Senator : 1, 0.00040
Wiley : 1, 0.00040
- Now, if you want to get the words with the fraction greater than or less than some value, run a loop or a list comprehension using
<
or>
appropriately.
l = [ (k, v[0], v[1]) for k, v in frac_dist.items() if v[1] > 0.04 ]
print(l)
prints out:
[(',', 162, 0.0653489310205728), ('and', 104, 0.04195240016135539), ('the', 115, 0.0463896732553449)]
Be aware that your desire to get the words that occur more than 95% of the time may not return any valid answer. You should plot out the distribution of words and see what their counts are.
Post a Comment for "Removing Words That Appear More Than X% In A Corpus Python"