擅长:python、mysql、java
<p>您可以定义一组特殊组合,并在标记化之前预处理短语:</p>
<pre><code>import nltk
def preprocess_text(original_text):
specials = {"computer vision": "computer_vision",
"fake news": "fake_news"}
out = original_text.lower()
for k in specials:
out = out.replace(k, specials[k])
return out
def main():
txt = preprocess_text("Computer vision has nothing to do wiht fake news")
tokens = nltk.word_tokenize(txt)
nltk.FreqDist(tokens).tabulate()
if __name__ == "__main__":
main()
</code></pre>
<p>不过,最好有一个专门的标记化。在</p>