<p>最简单的方法是对CSV文件集进行两次读取:一次读取所有数字化磁带的列表,第二次读取不在数字化列表中的所有磁带的唯一列表:</p>
<pre><code># build list of digitized tapes
digitized = []
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
next(reader) # skip header
for row in reader:
if row[0] and ((row[1], row[2]) not in digitized):
digitized.append((row[1], row[2]))
# build list of non-digitized tapes
digitize_me = []
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
header = next(reader)[1:3] # skip / save header
for row in reader:
if not row[0] and ((row[1], row[2]) not in digitized + digitize_me):
digitize_me.append((row[1], row[2]))
# write non-digitized tapes to 'digitize.csv`
with open("digitize.csv","wb") as result:
writer = csv.writer(result)
writer.writerow(header)
for tape in digitize_me:
writer.writerow(tape)
</code></pre>
<p><em>输入文件1:</em></p>
<pre><code>Date Digitized,Series,Episode Number,Title,Format
01-01-2016,Series A,101,,VHS
,Series A,101,,Beta
,Series C,101,,Beta
,Series D,102,,VHS
,Series B,101,,U-Matic
</code></pre>
<p><em>输入文件2:</em></p>
<pre><code>Date Digitized,Series,Episode Number,Title,Format
,Series B,101,,VHS
,Series D,101,,Beta
01-01-2016,Series C,101,,VHS
</code></pre>
<p><strong>输出:</strong></p>
<pre><code>Series,Episode Number
Series D,102
Series B,101
Series D,101
</code></pre>
<hr/>
<p>根据OP评论</p>
<pre><code>header = next(reader)[1:3] # skip / save header
</code></pre>
<p>有两个目的:</p>
<ol>
<li>假设每个<code>csv</code>文件都以一个头文件开头,我们不希望
读取标题行,就好像它包含了关于我们磁带的数据,所以我们
在这种意义上需要“跳过”标题行</li>
<li>但我们还想保存标题的相关部分,以备将来使用
我们编写输出<code>csv</code>文件。我们希望那个文件有一个头
也。因为我们只写<code>series</code>和<code>episode
number</code>,它们是<code>row</code>字段<code>1</code>和<code>2</code>,所以我们只分配那个片,
i、 例如,<code>[1:3]</code>,将头行的</li>
</ol>
<p>让一行代码服务于两个完全不相关的目的并不是真正的标准,这就是为什么我对它进行了注释。当<code>header</code>只需要分配一次时,它还会多次分配给<code>header</code>(假设有多个输入文件)。也许写这一节的更简洁的方法是:</p>
<pre><code># build list of non-digitized tapes
digitize_me = []
header = None
for name in names:
with open("%s_.csv" % name, "rb") as source:
reader = csv.reader(source)
if header:
next(reader) # skip header
else:
header = next(reader)[1:3] # read header
for row in reader:
...
</code></pre>
<p>问题是哪种形式更具可读性。无论哪种方法都很接近,但我认为将5行合并为一行,可以将重点放在代码中更突出的部分。下次我可能会用另一种方式。你知道吗</p>