如何计算变量值在几天内的共现率以生成邻接列表？

#Import packages import pandas as pd import numpy as np #Read in data file df = pd.read_csv(r'C:\Users\james\Desktop\Documents\Downloads\Cybersecurity\cybertime.csv') df.head #Create bigrams of themes by days, based on cooccurrences weighted by frequencies. #Iterate rows until new date is found, then compute weighted cooccurrences. #Weights are products of theme A frequency (freq) and theme B frequency. #Output the adjacency list.

2条回答

网友

1楼 · 编辑于 2024-04-23 18:23:02

首先，您可以选择从包含GDELT-Global\u Knowledge\u Graph\u CategoryList中未包含的主题的初始csv文件中筛选出所有行：

df = pd.read_csv('cybertime.csv')
gdelt = pd.read_csv('GDELT-Global_Knowledge_Graph_CategoryList.csv')
df.drop(df.loc[~df.theme.isin(gdelt.Name)].index, inplace=True)   # optional

接下来，您可以调整您的数据帧，得到一个包含30行（每天一行）和194列（每个主题一列）的矩阵。如果你不过滤你会得到一个30x1028的数据帧。你知道吗

从这一点上，你可以做一个转置矩阵与原始矩阵的矩阵积：它将给出一个194x194矩阵，其中包含一对事件频率的乘积之和（如果未滤波，则与上面的1028x1028相同）

您只需取消分割（melt）该矩阵即可获得相邻列表。你知道吗

代码可以是：

df2 = df.pivot(index='date', columns='theme', values='freq').fillna(0)

df3 = pd.DataFrame(np.transpose(df2.values) @ df2.values,
                   index=df2.columns, columns = df2.columns)

df4 = df3.rename_axis('theme_A').reset_index().melt(
    id_vars=['theme_A'], var_name='theme_B', value_name='weight')

网友

2楼 · 编辑于 2024-04-23 18:23:02

您可以尝试将自定义函数与groupBy一起使用，并与pandas数据帧一起应用。见here

或者做：

df.groupby(['date', 'theme'])['frequency'].apply(lambda x : x.astype(int).sum()

相关问题更多 >

编程相关推荐

热门问题

热门文章