如何根据字段合并两个CSV文件并保持每条记录的属性数量相同?
我正在尝试根据每个文件中的特定字段合并两个CSV文件。
file1.csv
id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"
file2.csv
id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False
这是我正在使用的代码:
import csv
from collections import OrderedDict
with open('file2.csv','r') as f2:
reader = csv.reader(f2)
fields2 = next(reader,None) # Skip headers
dict2 = {row[0]: row[1:] for row in reader}
with open('file1.csv','r') as f1:
reader = csv.reader(f1)
fields1 = next(reader,None) # Skip headers
dict1 = OrderedDict((row[0], row[1:]) for row in reader)
result = OrderedDict()
for d in (dict1, dict2):
for key, value in d.iteritems():
result.setdefault(key, []).extend(value)
with open('merged.csv', 'wb') as f:
w = csv.writer(f)
for key, value in result.iteritems():
w.writerow([key] + value)
我得到的输出是这样的,合并得不错,但并不是所有行的属性数量都一样:
1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure
file2
中不会有每个file1
中的id
的记录。我希望输出的合并文件中,file2
的空字段也能显示出来。例如,id
为1的记录应该是这样的:
1,True,7,Purple,,,
我该如何为那些在file2
中没有数据的记录添加空字段,以确保合并后的CSV中所有记录的属性数量都是一样的呢?
3 个回答
1
使用字典中的字典,然后更新它。可以这样做:
import csv
from collections import OrderedDict
with open('file2.csv','r') as f2:
reader = csv.reader(f2)
lines2 = list(reader)
with open('file1.csv','r') as f1:
reader = csv.reader(f1)
lines1 = list(reader)
dict1 = {row[0]: dict(zip(lines1[0][1:], row[1:])) for row in lines1[1:]}
dict2 = {row[0]: dict(zip(lines2[0][1:], row[1:])) for row in lines2[1:]}
#merge
updatedDict = OrderedDict()
mergedAttrs = OrderedDict.fromkeys(lines1[0][1:] + lines2[0][1:], "?")
for id, attrs in dict1.iteritems():
d = mergedAttrs.copy()
d.update(attrs)
updatedDict[id] = d
for id, attrs in dict2.iteritems():
updatedDict[id].update(attrs)
#out
with open('merged.csv', 'wb') as f:
w = csv.writer(f)
for id, rest in sorted(updatedDict.iteritems()):
w.writerow([id] + rest.values())
9
你可以使用pandas
来完成这个任务:
import pandas
csv1 = pandas.read_csv('filea1.csv')
csv2 = pandas.read_csv('file2.csv')
merged = csv1.merge(csv2, on='id')
merged.to_csv("output.csv", index=False)
我还没有测试过这个代码,但它应该能让你朝着正确的方向前进,等我有机会试一下再说。这个代码其实很简单明了;首先,你需要导入pandas
库,这样你才能使用它。接着,使用pandas.read_csv
来读取两个csv文件,然后用merge
方法将它们合并在一起。on
参数指定了哪个列作为“关键字”。最后,合并后的csv文件会被写入到output.csv
中。
9
如果我们不使用 pandas
这个库,我会把代码改成下面这样:
import csv
from collections import OrderedDict
filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
with open(filename, "rb") as fp: # python 2
reader = csv.DictReader(fp)
fieldnames.extend(reader.fieldnames)
for row in reader:
data.setdefault(row["id"], {}).update(row)
fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
writer = csv.writer(fp)
writer.writerow(fieldnames)
for row in data.itervalues():
writer.writerow([row.get(field, '') for field in fieldnames])
这样做会得到:
id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope,,,
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure,,,
为了对比,使用 pandas
的话,代码大概是这样的:
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)
在我看来,这样的写法简单多了,这样你就可以花更多时间处理数据,而不是花时间去重复造轮子。