在json fi中分组月份

2024-05-12 18:13:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含报纸文章的json文件。每行包含文章的日期、标题和正文。我想创建一个特定关键字出现在文本中的月份计数。到目前为止,我只能打印整个日期,但我希望检索到月份计数:例如,而不是一月,一月,一月;将产生计数的内容:一月=3或类似的内容。到目前为止,我的代码如下:

# import json module for parsing
import json
import re

# define a list of keywords
keywords = ('tax', 'Tax', 'policy', 'Policy',  'regulation', 'Regulation', 
 'spending', 'Spending', 'budget', 'Budget', 'oil', 'Oil',
 'Holyrood', 'holyrood', 'Scottish parliament', 'Scottish Parliament', 'scottish parliament' )

with open('Aberdeen2005.json') as json_file:

    # read json file line by line
    for line in json_file.readlines():
        json_dict = json.loads(line)

        if any(keyword in json_dict["body"].lower() for keyword in keywords):
            print(json_dict['date'].split()[0])

Tags: inimportjson内容forline文章keyword
2条回答

这里只是一个示例,因为您没有提供JSON文件的样子

import re

months = ('January', 
         'February', 
         'March', 
         'April',
         'May', 
         'June', 
         'July',
         'August',
         'September',
         'October',
         'November',
         'December')

file_content = '''
December 29, 2005 Thursday
December 15, 2005 Thursday
April 21, 2005
April 6, 2005
January 19, 2005
January 19, 2005
January 11, 2005
'''

d = {m:0 for m in months}

for line in file_content.splitlines():
    if line != '':
        # filter out empty strings from the split
        data = list(filter(lambda x: x != '', re.split('[,\s+]', line)))
        d[data[0]] += 1 # Grouping

print(d)
print(d['January'])

输出

{'August': 0, 'July': 0, 'November': 0, 'December': 2, 'April': 2, 'May': 0, 'October': 0, 'January': 3, 'September': 0, 'June': 0, 'March': 0, 'February': 0}
3

你可以用熊猫试试这个:

import pandas
import json

# note if this actually works your json file is not correctly formed
df = pandas.DataFrame([json.loads(l) for l in open('Aberdeen2005.json')])

# Parse dates and set index
df.date = pandas.to_datetime(df.date)
df.set_index('date', inplace=True)

# match keywords
matchingbodies = df[df.body.str.contains("|".join(keywords))].body

# Count by month
counts = matchingbodies.groupby(lambda x: x.month).agg(len)

相关问题 更多 >