Python语法频率计数

2条回答

网友

1楼 · 编辑于 2024-04-25 10:21:54

这很c风格，但很管用。想法是跟踪每个文档的“current”双元组，确保每个文档只添加一次（cur_bigrams = set()），并且在每个文档之后，增加一个全局频率计数器（bigram_freq），如果它在当前文档中。然后用bigram_freq中的信息（跨文档的全局计数器）构建一个新的数据帧。在

bigram_freq = {}
for doc in df["text_column"]:
    cur_bigrams = set()
    words = doc.split(" ")
    bigrams = zip(words, words[1:])
    for bigram in bigrams:
        if bigram not in cur_bigrams: # Add bigram, but only once/doc
            cur_bigrams.add(bigram)
    for bigram in cur_bigrams:
        if bigram in bigram_freq:
            bigram_freq[bigram] += 1
        else:
            bigram_freq[bigram] = 1

result_df = pd.DataFrame(columns=["2_gram", "count"])
row_list = []
for bigram, freq in bigram_freq.items():
    row_list.append([bigram[0] + " " + bigram[1], freq])
for i in range(len(row_list)):
    result_df.loc[i] = row_list[i]

print(result_df)

输出：

^{pr2}$

您可能可以使用功能性更强的样式和/或列表理解功能，对代码进行适当的删减。我把这个练习留给读者。在

网友

2楼 · 编辑于 2024-04-25 10:21:54

Pythonic的答案（写得很一般，所以可以应用到文件/数据帧/任何东西）：

c=collections.Counter()
for i in fh:
  x = i.rstrip().split(" ")
  c.update(set(zip(x[:-1],x[1:])))

现在c保持每2克的频率。在

说明：

每一行都按空格split排列成一个列表。在
然后zip()返回长度为2（2克）的元组上的迭代器。在
迭代器被送入set()，以消除冗余。在
然后将集合输入collections.Counter()对象，该对象跟踪每个元组出现的次数。您需要import collections才能使用它。在
现在很容易列出计数器的内容或将其转换为您喜欢的任何其他格式（例如dataframe）。在

是的，Python很棒。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python语法频率计数

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >