在Python中将dataframe转换成具有两个列变量的频率列表

2条回答

网友

1楼 · 编辑于 2024-06-16 19:19:36

你的问题至少可以分成三部分：

如何分组和透视表？在
如何合并表？在
loc在做什么？在

一般说明

Pandas为某些操作提供了加速，所以在使用循环之前尝试库实现（见下文）

旋转

1.与普通熊猫：

df = pd.DataFrame({"det":["a","the","a","a","a", "the"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
"you will need a dummy variable:"
df["counts"] = 1
"you probably need to reset the index"
df_counts = df.groupby(["det","word"]).agg("count").reset_index()
#   det word  counts
#0    a  cat       2
#1    a  pet       3
#2  the  pet       1
"and pivot it"
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det   a  the
#word        
#cat   2    0
#pet   3    1

两列示例：

^{pr2}$

2.使用Counter

df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
acounter = Counter( (tuple(x) for x in df.as_matrix()) )
#Counter({('a', 'cat'): 2, ('a', 'pet'): 2, ('the', 'pet'): 2})
df_counts = pd.DataFrame(list(zip([y[0] for y in acounter.keys()], [y[1] for y in acounter.keys()], acounter.values())), columns=["det", "word", "counts"])
#   det word  counts
#0    a  cat       2
#1  the  pet       1
#2    a  pet       3
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det   a  the
#word        
#cat   2    0
#pet   3    1

在我的例子中，这个比纯pandas快一点（分组时每个循环52.6µs vs 92.9µs；不计算旋转）

3.据我所知，这是一个自然语言处理问题。您可以尝试将所有数据组合成一个字符串，并使用sklearn中的^{}并设置ngram_range=(1, 2)。比如：

df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})

from sklearn.feature_extraction.text import CountVectorizer
listofpairs = []
for _, row in df.iterrows():
    listofpairs.append(" ".join(row))

countvect = CountVectorizer(ngram_range=(2,2), min_df = 0.0, token_pattern='(?u)\\b\\w+\\b')
sparse_counts = countvect.fit_transform(listofpairs)

print("* input list:\n",listofpairs)
print("* array of counts:\n",sparse_counts.toarray())
print("* vocabulary [order of columns in the sparse array]:\n",countvect.vocabulary_)

counter_keys = [x[1:] for x  in sorted([ tuple([v] + k.split(" ")) for k,v in countvect.vocabulary_.items()])]
counter_values = np.sum(sparse_counts.toarray(), 0)

df_counts = pd.DataFrame([(x[0], x[1], y) for x,y in  zip(counter_keys, counter_values)], columns=["det", "word", "counts"])

合并

两种选择： 1concat df1.设置索引（“word”） df2.set_索引（“word”）数据输出=帕金森病（[df1，df2]，轴=1）

2.merge

`loc`

它用一个参数索引行或用两个参数索引row,column。它使用行/列名称或布尔索引（如您对行的情况）。在

如果每个性别只有一篇文章，可以使用直接比较而不是in操作，这可能会加快速度：

df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"

与

indices_neutral = df["precedingWord"]=="de" 
df.loc[indices, "gender"] = "neuter"

或者更短但可读性较差

df.loc[df["precedingWord"]=="de", "gender"] = "neuter"

网友

2楼 · 编辑于 2024-06-16 19:19:36

更新：这是一个^{}：

In [11]: df1 = pd.crosstab(df['node'], df['precedingWord'])

In [12]: df1
Out[12]:
precedingWord  a  few  some  the
node
banana         2    0     0    1
coconut        0    1     1    1

In [13]: df2 = pd.crosstab(df['node'], df['comp'])

这显然是一个更干净（对大数据更有效的算法）。在

然后用axis=1的concat将它们粘在一起（即，添加更多列，而不是添加更多行）。在

^{pr2}$

我可能会这样（作为一个多重索引），如果你想让它变平，就不要传递键（尽管可能会出现重复单词的问题）：

In [15]: pd.concat([df1, df2], axis=1)
Out[15]:
         a  few  some  the  lal  lel  lil
node
banana   2    0     0    1    1    2    0
coconut  0    1     1    1    1    1    1

旁白：当列名存在时，如果concat不要求显式地传递列名（作为键kwarg），那就更好了。。。

原始答案

可用于value_counts：

In [21]: g = df.groupby("node")

In [22]: g["comp"].value_counts()
Out[22]:
node     comp
banana   lel     2
         lal     1
coconut  lal     1
         lel     1
         lil     1
dtype: int64

In [23]: g["precedingWord"].value_counts()
Out[23]:
node     precedingWord
banana   a                2
         the              1
coconut  few              1
         some             1
         the              1
dtype: int64

把这个放在一个单独的框架里有点棘手：

In [24]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)])
Out[24]:
          a  few  lal  lel  lil  some  the
node
banana  NaN  NaN    1    2  NaN   NaN  NaN
coconut NaN  NaN    1    1    1   NaN  NaN
banana    2  NaN  NaN  NaN  NaN   NaN    1
coconut NaN    1  NaN  NaN  NaN     1    1

In [25]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)]).fillna(0)
Out[25]:
         a  few  lal  lel  lil  some  the
node
banana   0    0    1    2    0     0    0
coconut  0    0    1    1    1     0    0
banana   2    0    0    0    0     0    1
coconut  0    1    0    0    0     1    1

在执行concat之前，可以将列映射到det1、det2等，例如，如果将映射作为字典。

In [31]: res = g["comp"].value_counts().unstack(1)

In [32]: res
Out[32]:
comp     lal  lel  lil
node
banana     1    2  NaN
coconut    1    1    1

In [33]: res.columns = res.columns.map({"lal": "det1", "lel": "det2", "lil": "det3"}.get)

In [34]: res
Out[34]:
         det1  det2  det3
node
banana      1     2   NaN
coconut     1     1     1

或者，你也可以使用列表理解法（如果你没有听写或者没有特定的标签）：

In [41]: res = g["comp"].value_counts().unstack(1)

In [42]: res.columns = ['det%s' % i for i, _ in enumerate(df.columns)]

一般说明

旋转

合并

`loc`

原始答案

相关问题更多 >

编程相关推荐

热门问题

热门文章