假设是,我把tweet分为6个不同的极性。阳性、中度阳性、高度阳性、阴性、中度阴性和高度阴性。你知道吗
因为,这个过程经历了NLP(使用NLTK)的步骤,所以我需要一个句子一个tweet地执行。你知道吗
问题:
这些极性是由词性标注后的模式来定义的。其中一种模式包括:动词+动词+形容词includes in D(干旱相关术语)和in F(常用词)
我需要那些频繁的单词,把这句话变成这6个极性中的任何一个,保存到我的数据帧中。你知道吗
片段:
这是我试过的
for (w1, tag1), (w2, tag2), (w3, tag3) in nltk.trigrams(PoS_TAGS):
if tag1.startswith("RB") and tag2.startswith("RB") and tag3.startswith("JJ"):
tri_pairs.append((w1, w2, w3))
if tri_pairs[0] or tri_pairs[1] or tri_pairs[2] in D:
print("[True]: Tri Pairs are found in Drought Rel. Term")
for j in range(len(F)):
if tri_pairs[0] or tri_pairs[1] or tri_pairs[2] in F[j]:
print("[True]: Tri Pairs are found in Frequent Wordset")
if RES is "Positive":
RES = "Highly Positive"
elif RES is "Negative":
RES = "Highly Negative"
print "="*25,F[j]
FW_list.append(F[j])
else:
print"[False]: Doesn't Match with Frequent Wordset\n"
else:
print"[False]: Tri Pairs Matched Nowhere in D\n"
else:
print "[TriPair(F)]: Pattern for Adverb, Adverb, Adjective did not match.\n Looking for Bi-Pair Patterns\n"
print(tri_pairs)
print(">"*13,FW)
正如您所注意到的,我尝试了使用列表甚至内部循环的大多数方式打印。两人都没有归还任何有用的东西。类似地,其他两种模式决定了遗漏的极性。你知道吗
我还编写了代码将其添加到dataframe中:
fuzzy_df = fuzzy_df.append({'Tweets': tweets[i], 'Classified': RES, 'FreqWord': FW}, ignore_index=True)
但到目前为止,该列的csv返回为空。你知道吗
我已经可以经常用词了。具体如下:
>>> F
['drought', 'water', 'love', 'rain', 'year', 'famine', 'farmers', 'crops', 'south', 'http', 'europe', 'scarcity', 'near', 'thought', 'ever', 'devastates', 'feed', 'message', 'eduaubdedubu', 'instant', 'italy', 'severe', 'by', 'beaches', 'wildfires', 'heat', 'us']
CSV如下所示:
Tweets,Classified,FreqWord
real time strategy password wastelands depletion groundwater skyrocketing debts make years anantapur drought worse,Negative,
calm director day science meetings nasal talk cutting edge remote sensing research drought veg fluorescence calm love,Positive,
love thought drought,Positive,
neville rooney end ever tons trophy drought,Positive,
lakes drought,Positive,
lakes fan joint trailblazers dot forget play drought,Positive,
reign mother kerr funny none tried make come back drought,Positive,
wonder could help thai market b post reuters drought devastates south europe crops,Negative,
输入文件:
tweets,polarity
real time strategy password wastelands depletion groundwater skyrocketing debts make years anantapur drought worse,Positive
calm director day science meetings nasal talk cutting edge remote sensing research drought veg fluorescence calm love,Positive
hate thought drought,Negative
尽管如此,我上面显示的输出是标记化的,停止字也被删除了。你知道吗
预期输出文件:
Tweets,Classified,FreqWord
real time strategy password wastelands depletion groundwater skyrocketing debts make years anantapur drought worse,Negative,water
calm director day science meetings nasal talk cutting edge remote sensing research drought veg fluorescence calm love,Positive,drought
love thought drought,Positive,drought
neville rooney end ever tons trophy drought,Positive,rain
lakes drought,Positive,drought
lakes fan joint trailblazers dot forget play drought,Positive,farmer
reign mother kerr funny none tried make come back drought,Positive,crops
wonder could help thai market b post reuters drought devastates south europe crops,Negative,crops
FW = ''
for i in range(len(tweets)):
sent = nltk.word_tokenize(tweets[i])
PoS_TAGS = nltk.pos_tag(sent)
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
one_sentence = tweets.iloc[i]
scores = sia.polarity_scores(text=one_sentence)
print "POS:", scores.get('pos')
print "NEG:", scores.get('neg')
print "NEU:", scores.get('neu')
POS = scores.get('pos')
NEG = scores.get('neg')
NEU = scores.get('neu')
RES = str()
if POS > NEG:
RES = 'Positive'
elif NEG > POS:
RES = 'Negative'
elif NEU >= 0.5 or POS > NEU:
RES = 'Positive'
elif NEU < 0.5:
RES = 'Negative'
# -------------------------------------------------------- PATTERN ADVERB, ADVERB, ADJECTIVE (Down)
tri_pairs = list()
for (w1, tag1), (w2, tag2), (w3, tag3) in nltk.trigrams(PoS_TAGS):
if tag1.startswith("RB") and tag2.startswith("RB") and tag3.startswith("JJ"):
tri_pairs.append((w1, w2, w3))
if tri_pairs[0] or tri_pairs[1] or tri_pairs[2] in D:
print("[True]: Tri Pairs are found in Drought Rel. Term")
# TRIGGER AREA
for j in range(len(F)):
if tri_pairs[0] or tri_pairs[1] or tri_pairs[2] in F[j]:
print("[True]: Tri Pairs are found in Frequent Wordset")
if RES is "Positive":
RES = "Highly Positive"
FW = F[j]
#fuzzy_df['FreqWord'].map(lambda x: next((y for y in x.split() if y in F), 'Not Found'))
elif RES is "Negative":
RES = "Highly Negative"
FW = F[j]
else:
print"[False]: Doesn't Match with Frequent Wordset\n"
else:
print"[False]: Tri Pairs Matched Nowhere in D\n"
else:
print "[TriPair(F)]: Pattern for Adverb, Adverb, Adjective did not match.\n Looking for Bi-Pair Patterns\n"
print(tri_pairs)
# -------------------------------------------------------- PATTERN ADVERB, ADJECTIVE (Down)
bi_pairs = list()
for (w1, tag1), (w2, tag2) in nltk.bigrams(PoS_TAGS):
if tag1.startswith("RB") and tag2.startswith("JJ"):
bi_pairs.append((w1, w2))
if bi_pairs[0] or bi_pairs[1] in D:
print("[True]: Bi Pairs are found in Drought Rel. Term")
for k in range(len(F)):
if bi_pairs[0] or bi_pairs[1] is F[k]:
print("[True]: Bi Pairs are found in Frequent Wordset")
if RES is "Positive":
RES = "Moderately Positive"
FW = F[k]
elif RES is "Negative":
RES = "Moderately Negative"
FW = F[k]
else:
print("[False]: Bi Pairs found missing in Freq. Wordset")
else:
print("[False]: Bi Pairs Matched Nowhere in D")
else:
print("[BiPair(F)]: Pattern Not Matched, Looking for Mono Pattern")
print(bi_pairs)
# -------------------------------------------------------- PATTERN ADJECTIVE (Down)
for w, tag in PoS_TAGS:
print w, " - ", tag
if tag.startswith("JJ"):
if w in D:
print("Matched with D")
for l in range(len(F)):
if w is F[l]:
print("Matched with F")
if RES is "Positive":
RES = "Positive"
FW = F[l]
elif RES is "Negative":
RES = "Negative"
FW = F[l]
else:
print("Unmatched in F")
FW = F[l] in sent
else:
print("Unmatched in D")
else:
print w, "is not an ADJECTIVE"
# -------------------------------------------------------- MAKING ENTRY OF RECORDS OF TWEETS and POLARITY RESULT
fuzzy_df = fuzzy_df.append({'Tweets': tweets[i], 'Classified': RES, 'FreqWord': FW}, ignore_index=True)
# ADDING RECORDS IN DATAFRAME
fuzzy_df.to_csv("fuzzy.csv", index=False)
你想这么做吗?你知道吗
首先从定义的单词列表中创建一个简单的
Counter()
对象(即dictionary
)。你知道吗然后对tweet的每一行应用
Counter()
交集来创建df['FreqCounter']
列。你知道吗最后,从
df['FreqCounter']
中提取唯一键集以填充df['FreqWord']
如果您不需要每行tweet的
dictionary
中单词的计数器,您可以简单地使用一个集合,即如果您想从
df['FreqCounter']
中找出最常用的单词,那么:使用这个最小的示例,您也可以尝试以下简单的方法:
next
将遍历一行tweet,检查tweet中是否有任何常用词可用,如果没有则返回Not Found
。你知道吗相关问题 更多 >
编程相关推荐