Python - 使用点对点互信息进行情感分析

12 投票

3 回答

23067 浏览

提问于 2025-04-17 20:26

from __future__ import division
import urllib
import json
from math import log


def hits(word1,word2=""):
    query = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s"
    if word2 == "":
        results = urllib.urlopen(query % word1)
    else:
        results = urllib.urlopen(query % word1+" "+"AROUND(10)"+" "+word2)
    json_res = json.loads(results.read())
    google_hits=int(json_res['responseData']['cursor']['estimatedResultCount'])
    return google_hits


def so(phrase):
    num = hits(phrase,"excellent")
    #print num
    den = hits(phrase,"poor")
    #print den
    ratio = num / den
    #print ratio
    sop = log(ratio)
    return sop

print so("ugly product")

我需要这段代码来计算点对点互信息，这可以用来把评论分成正面或负面。简单来说，我在用Turney（2002）提到的技术：http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf，这是一个用于情感分析的无监督分类方法的例子。

在论文中解释说，一个短语的语义倾向是负面的，如果这个短语和“差”这个词的关联更强；而如果和“优秀”这个词的关联更强，那就是正面的。

上面的代码计算了一个短语的语义倾向。我用谷歌来计算搜索结果的数量，然后算出这个倾向。（因为AltaVista现在已经不存在了）

计算出来的值非常不稳定，没有遵循特定的模式。例如，SO("丑陋的产品")的值是2.85462098541，而SO("美丽的产品")的值是1.71395061117。前者本应该是负面的，后者则是正面的。

这段代码有什么问题吗？有没有更简单的方法可以用Python库，比如NLTK，来计算一个短语的SO（使用PMI）？我试过NLTK，但没找到任何明确的方法来计算PMI。

机器学习 nltk 计算方法情感分析点对点互信息无监督分类语义倾向关键词关联

3 个回答

要解释为什么你的结果会不稳定，首先要知道，谷歌搜索并不是一个可靠的单词频率来源。谷歌返回的频率只是估算值，当你查询多个单词时，这些估算值特别不准确，甚至可能互相矛盾。这并不是在批评谷歌，而是说它并不适合用来统计频率。因此，你的实现可能没问题，但基于这些结果，得到的结论可能还是不太合理。

如果想更深入了解这个问题，可以看看亚当·基尔加里夫的文章《谷歌学是糟糕的科学》。

回答于 2025-04-17 由 Python大师

分享举报

Python库DISSECT里有一些方法，可以用来计算共现矩阵的点对点互信息（Pointwise Mutual Information）。

举个例子：

#ex03.py
#-------
from composes.utils import io_utils
from composes.transformation.scaling.ppmi_weighting import PpmiWeighting

#create a space from co-occurrence counts in sparse format
my_space = io_utils.load("./data/out/ex01.pkl")

#print the co-occurrence matrix of the space
print my_space.cooccurrence_matrix

#apply ppmi weighting
my_space = my_space.apply(PpmiWeighting())

#print the co-occurrence matrix of the transformed space
print my_space.cooccurrence_matrix

这里是GitHub上关于PMI方法的代码。

参考文献：Georgiana Dinu, Nghia The Pham, 和 Marco Baroni。 2013年。DISSECT: 语义组合工具包。在2013年ACL系统演示会的会议记录中，保加利亚索非亚

相关内容：计算两个字符串之间的点对点互信息

回答于 2025-04-17 由 Python大师

分享举报

一般来说，计算PMI（点互信息）是有点复杂的，因为这个公式会根据你想要考虑的n-gram的大小而变化：

从数学角度来看，对于二元组（bigrams），你可以简单地考虑：

log(p(a,b) / ( p(a) * p(b) ))

在编程方面，假设你已经计算了你文本中所有单元组（unigrams）和二元组（bigrams）的频率，那么你可以这样做：

def pmi(word1, word2, unigram_freq, bigram_freq):
  prob_word1 = unigram_freq[word1] / float(sum(unigram_freq.values()))
  prob_word2 = unigram_freq[word2] / float(sum(unigram_freq.values()))
  prob_word1_word2 = bigram_freq[" ".join([word1, word2])] / float(sum(bigram_freq.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

这是一个来自MWE库的代码片段，但它还处于开发前期阶段（https://github.com/alvations/Terminator/blob/master/mwe.py）。不过要注意，这个是用于并行多词表达（MWE）提取的，所以这里有个方法可以“破解”它来提取单语的多词表达：

$ wget https://dl.dropboxusercontent.com/u/45771499/mwe.py
$ printf "This is a foo bar sentence .\nI need multi-word expression from this text file.\nThe text file is messed up , I know you foo bar multi-word expression thingy .\n More foo bar is needed , so that the text file is populated with some sort of foo bar bigrams to extract the multi-word expression ." > src.txt
$ printf "" > trg.txt
$ python
>>> import codecs
>>> from mwe import load_ngramfreq, extract_mwe

>>> # Calculates the unigrams and bigrams counts.
>>> # More superfluously, "Training a bigram 'language model'."
>>> unigram, bigram, _ , _ = load_ngramfreq('src.txt','trg.txt')

>>> sent = "This is another foo bar sentence not in the training corpus ."

>>> for threshold in range(-2, 4):
...     print threshold, [mwe for mwe in extract_mwe(sent.strip().lower(), unigram, bigram, threshold)]

[out]:

-2 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
-1 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
0 ['this is', 'foo bar', 'bar sentence']
1 ['this is', 'foo bar', 'bar sentence']
2 ['this is', 'foo bar', 'bar sentence']
3 ['foo bar', 'bar sentence']
4 []

想了解更多细节，我觉得这篇论文是一个快速且简单的多词表达提取入门：“扩展对数似然度量以改善搭配识别”，可以查看一下 http://goo.gl/5ebTJJ

回答于 2025-04-17 由 Python大师

分享举报

Python - 使用点对点互信息进行情感分析

3 个回答

撰写回答