所以我第一次使用N-grams。我所做的就是用一个包含多行和多列的df。我删除了停止语并标记了它们。 我的代码是这样的
from nltk.corpus import stopwords
stop = stopwords.words('english')
# Exclude stopwords with Python's list comprehension and pandas.DataFrame
testdf['issues_without_stopwords'] = testdf['issue'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop) if x[0]]))
testdf['questions_without_stopwords'] = testdf['question'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
# Remove Punctuations and Tokenize
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
testdf['questions_tokenized'] = testdf['question'].apply(lambda x: tokenizer.tokenize(x))
testdf['issue_tokenized'] = testdf['issue'].apply(lambda x: tokenizer.tokenize(x))
testdf["Concate"] = testdf['issue_tokenized']+ testdf['questions_tokenized']
#Create your n-grams (1st method)
def find_ngrams(input_list, n):
return list(zip(*[input_list[i:] for i in range(n)]))
df1 = testdf["Concate"].apply(lambda x: find_ngrams(x, 4))
from itertools import tee, islice
from collections import Counter
#Create your n-grams and count them in cell (2nd method)
def ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
df2 = Counter(ngrams(df2["value"], 4))
然后我把它们转换成4克。在
这是我的原始样本数据:
^{pr2}$我想要的是一个包含所有n克的列和另一个列的freq
N - grams Freq
[(n, gram, talha)] 2
[(talha, software, python)] 1
我还需要删除所有重复的n个gram,例如[(n,gram,talha)]和[(talha,gram,n)]应计为2,但显示一次(我只想澄清一下,我知道我在lol之前说过freq)。在
编辑:为了避免混淆,我现在得到的是:
Concate
0 [('Menstrual', 'health', 'How', 'to'), ('health', 'How', 'to', 'get'), ('How', 'to', 'get', 'my')]
1 [('stomach', 'pain', 'any', 'advise')]
2 [('Vaping', 'with', 'nicotine', 'before'), ('with', 'nicotine', 'before', 'tonsillectomy')]
3 [('Mental', 'health', 'Ive', 'been'), ('health', 'Ive', 'been', 'feeling'), ('Ive', 'been', 'feeling', 'sad'), ('been', 'feeling', 'sad', 'most'), ('feeling', 'sad', 'most', 'of'), ('sad', 'most', 'of', 'the'), ('most', 'of', 'the', 'time'), ('of', 'the', 'time', 'and')]
4 [('Kidney', 'stone', 'I', 'was'), ('stone', 'I', 'was', 'diagnosed'), ('I', 'was', 'diagnosed', 'with'), ('was', 'diagnosed', 'with', 'one')]
目前没有回答
相关问题 更多 >
编程相关推荐