从文本fi解析数据

2024-06-16 10:54:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经建立了一个联系表单,为每个注册用户发送电子邮件。我的问题更多地涉及到将一些文本数据解析为csv格式。我在邮箱里收到了多个用户的信息,我把这些信息复制到了一个文本文件中。数据如下所示。在

Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o  b
Contact No.: 12346971239
Coming: Yes

Name: testuser3
Email: testuser3@gmail.com
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes

Name: testuser4
Email: tuser4@yahoo.com
Cluster Name: Mediterranea
Contact No.: 7892174896
Coming: Yes

Name: tuser5
Email: tuserner5@gmail.com
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2

Name: Test User
Email: testuser@yahoo.co.in
Cluster Name: RD
Contact No.: 09833123445
Coming: Yes
Members Participating: 2

如图所示,数据包含一些公共字段和一些不存在的字段,我正在寻找如何解析这些数据的解决方案/建议,因此在“名称”标题下,我将收集该列下的名称信息,其他字段的名称信息也类似。对于标题为“会员参与”的数据,我可以选择数字添加到同一标题下的Excel表中,如果用户没有看到这些信息,可以为空。在


Tags: 数据no用户name名称com信息标题
3条回答

您可以使用记录之间的空行来表示记录结束。然后逐行处理输入文件并构造字典列表。最后将字典写入CSV文件。在

from csv import DictWriter
from collections import OrderedDict

with open('input') as infile:
    registrations = []
    fields = OrderedDict()
    d = {}
    for line in infile:
        line = line.strip()
        if line:
            key, value = [s.strip() for s in line.split(':', 1)]
            d[key] = value
            fields[key] = None
        else:
            if d:
                registrations.append(d)
                d = {}
    else:
        if d:    # handle EOF
            registrations.append(d)


# fieldnames = ['Name', 'Email', 'Cluster Name', 'Contact No.', 'Coming', 'Members Participating']
fieldnames = fields.keys()

with open('registrations.csv', 'w') as outfile:
    writer = DictWriter(outfile, fieldnames=fields)
    writer.writeheader()
    writer.writerows(registrations)

此代码尝试自动收集字段名,并将使用与在输入中首次看到唯一键相同的顺序。如果在输出中需要特定的字段顺序,可以通过取消对相应行的注释来完成。在

在示例输入上运行此代码会产生以下结果:

^{pr2}$

让我们把问题分解成更小的子问题:

  1. 将大块文本拆分为单独的注册
  2. 将这些注册转换为字典
  3. 将词典列表写入CSV

首先,让我们将注册数据块分成不同的元素:

DATA = '''
Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o  b
Contact No.: 12346971239
Coming: Yes

Name: testuser3
Email: testuser3@gmail.com
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes
'''

def parse_registrations(data):
    data = data.strip()
    return data.split('\n\n')

此函数提供每个注册的列表:

^{pr2}$

接下来,我们可以将这些子字符串转换为(key,value)对的列表:

>>> [field.split(': ', 1) for field in regs[0].split('\n')]
[['Name', 'testuser2'], ['Email', 'testuser2@gmail.com'], ['Cluster Name', 'o  b'], ['Contact No.', '12346971239'], ['Coming', 'Yes']]

dict()函数可以将(键、值)对的列表转换为字典:

>>> dict(field.split(': ', 1) for field in regs[0].split('\n'))
{'Coming': 'Yes', 'Cluster Name': 'o  b', 'Name': 'testuser2', 'Contact No.': '12346971239', 'Email': 'testuser2@gmail.com'}

我们可以将这些字典传递到一个csv.DictWriter中,以CSV形式写入记录,并为任何丢失的值提供默认值。在

>>> w = csv.DictWriter(open("/tmp/foo.csv", "w"), fieldnames=["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"])
>>> w.writeheader()
>>> w.writerow({'Name': 'Steve'})
12

现在,让我们把这些结合起来!在

import csv

DATA = '''
Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o  b
Contact No.: 12346971239
Coming: Yes

Name: tuser5
Email: tuserner5@gmail.com
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2
'''

COLUMNS = ["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"]

def parse_registration(reg):
    return dict(field.split(': ', 1) for field in reg.split('\n'))

def parse_registrations(data):
    data = data.strip()
    regs = data.split('\n\n')
    return [parse_registration(r) for r in regs]

def write_csv(data, filename):
    regs = parse_registrations(data)
    with open(filename, 'w') as f:
        writer = csv.DictWriter(f, fieldnames=COLUMNS)
        writer.writeheader()
        writer.writerows(regs)

if __name__ == '__main__':
    write_csv(DATA, "/tmp/test.csv")

输出:

$ python3 write_csv.py

$ cat /tmp/test.csv
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o  b,12346971239,Yes,
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2

下面的程序可以满足您的要求。总体战略:

  • 首先读取所有电子邮件文件,然后“手动”解析数据,然后
  • 然后使用csv.DictWriter.writerows()将数据写入CSV文件。在

import sys
import pprint
import csv

# Usage:
# python cfg2csv.py input1.cfg input2.cfg ...
# The data is combined and written to 'output.csv'

def parse_file(data):
    total_result = []
    single_result = []
    for line in data:
        line = line.strip()
        if line:
            single_result.append([item.strip() for item in line.split(':', 1)])
        else:
            if single_result:
                total_result.append(dict(single_result))
            single_result = []
    if single_result:
        total_result.append(dict(single_result))
    return total_result

def read_file(filename):
    with open(filename) as fp:
        return parse_file(fp)

# First parse the data:
data = sum((read_file(filename) for filename in sys.argv[1:]), [])
keys = set().union(*data)

# Next write the data to a CSV file
with open('output.csv', 'w') as fp:
    writer = csv.DictWriter(fp, sorted(keys))
    writer.writeheader()
    writer.writerows(data)

相关问题 更多 >