使用pandas和matplotlib计算词频

3 投票

2 回答

22641 浏览

提问于 2025-04-17 21:38

我想知道怎么用pandas和matplotlib从一个csv文件中绘制作者列的词频直方图。我的csv文件格式是：id, 作者, 标题, 语言。有时候作者列里会有多个作者，用空格分开。

file = 'c:/books.csv'
sheet = open(file)
df = read_csv(sheet)
print df['author']

数据可视化数据分析 CSV文件处理词频分析

2 个回答

你可以使用 value_counts 来统计每个名字出现的次数：

In [11]: df['author'].value_counts()
Out[11]: 
peter       3
bob         2
marianne    1
dtype: int64

Series（序列）和 DataFrames（数据框）都有一个 hist 方法，可以用来绘制直方图：

In [12]: df['author'].value_counts().hist()

回答于 2025-04-17 由 Python大师

分享举报

使用 collections.Counter 来创建直方图数据，具体可以参考这里的例子 here，也就是说：

from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Read CSV file, get author names and counts.
df = pd.read_csv("books.csv", index_col="id")
counter = Counter(df['author'])
author_names = counter.keys()
author_counts = counter.values()

# Plot histogram using matplotlib bar().
indexes = np.arange(len(author_names))
width = 0.7
plt.bar(indexes, author_counts, width)
plt.xticks(indexes + width * 0.5, author_names)
plt.show()

假设你有这个测试文件：

$ cat books.csv 
id,author,title,language
1,peter,t1,de
2,peter,t2,de
3,bob,t3,en
4,bob,t4,de
5,peter,t5,en
6,marianne,t6,jp

上面的代码会生成如下图表：

enter image description here

编辑：

你添加了一个额外的条件，就是作者这一列可能包含多个用空格分开的名字。下面的代码可以处理这个情况：

from itertools import chain

# Read CSV file, get 
df = pd.read_csv("books2.csv", index_col="id")
authors_notflat = [a.split() for a in df['author']]
counter = Counter(chain.from_iterable(authors_notflat))
print counter

对于这个例子：

$ cat books2.csv 
id,author,title,language
1,peter harald,t1,de
2,peter harald,t2,de
3,bob,t3,en
4,bob,t4,de
5,peter,t5,en
6,marianne,t6,jp

它会输出

$ python test.py 
Counter({'peter': 3, 'bob': 2, 'harald': 2, 'marianne': 1})

请注意，这段代码之所以能工作，是因为字符串是可迭代的。

这段代码基本上不依赖于 pandas，除了处理 CSV 文件的部分，这部分生成了一个叫 df 的数据框。如果你需要 pandas 默认的绘图样式，相关的建议可以在提到的帖子中找到。

回答于 2025-04-17 由 Python大师

分享举报

使用pandas和matplotlib计算词频

2 个回答

撰写回答