<h2>修复文件:</h2>
<ul>
<li>使用<code>m = re.findall('(?<=[a-zA-Z])\s+\\n[a-zA-Z]', text)</code>查找类似<code>,green \ngrape</code>的情况
<ul>
<li>模式将找到<code>alpha \nalpha</code>并忽略<code>alpha \nnumeric</code></li>
<li><code>m</code>将是所有匹配项的列表(例如<code>[' \ng']</code>)</li>
<li><code>.replace(' \ng', ' g')</code>,结果是<code>,green grape</code></li>
</ul></li>
<li>用<a href="https://docs.python.org/3/library/pathlib.html" rel="nofollow noreferrer">^{<cd9>}</a>查找所有文件
<ul>
<li><code>.rglob</code>查找所有子目录。如果所有文件都在一个目录中,请使用<code>.glob</code></li>
<li><code>pathlib</code>将路径视为对象而不是字符串。因此,<code>pathlib</code>对象有许多方法。你知道吗</li>
<li><code>.stem</code>返回文件名</li>
<li><code>.suffix</code>返回文件扩展名(例如<code>.csv</code>)</li>
</ul></li>
<li>这不会覆盖现有文件。它将创建一个新文件,在名称中添加<code>_fixed</code>。你知道吗</li>
</ul>
<pre class="lang-py prettyprint-override"><code>import re
from pathlib import Path
# list of all the files
files = list(Path(r'c:\some_path').rglob('*.csv'))
# iterate through each file
for file in files:
# create new filename name_fixed
new_file = file.with_name(f'{file.stem}_fixed{file.suffix}')
# read all the text in as a string
text = file.read_text()
# find and fix the sections that need fixing
m = re.findall('(?<=[a-zA-Z])\s+\\n[a-zA-Z]', text)
for match in m:
text = text.replace(match, f' {match[-1:]}')
text_list = text.split('\n')
text_list = [x.strip() for x in text_list]
# write the new file
with new_file.open('w', newline='') as f:
w = csv.writer(f, delimiter=',')
w.writerows([x.split(',') for x in text_list])
</code></pre>
<h2>示例:</h2>
<h3>在<code>.csv</code>中包含以下内容:</h3>
<pre class="lang-py prettyprint-override"><code>orderid,fruit,count,person
3523,apple,84,peter
2522,green
grape, 99, mary
1299, watermelon, 93, paul
3523,apple,84,peter
2522,green
banana, 99, mary
1299, watermelon, 93, paul
3523,apple,84,peter
2522,green
apple, 99, mary
1299, watermelon, 93, paul
</code></pre>
<h3>新文件:</h3>
<pre class="lang-py prettyprint-override"><code>orderid,fruit,count,person
3523,apple,84,peter
2522,green grape, 99, mary
1299, watermelon, 93, paul
3523,apple,84,peter
2522,green banana, 99, mary
1299, watermelon, 93, paul
3523,apple,84,peter
2522,green apple, 99, mary
1299, watermelon, 93, paul
</code></pre>
<h2>创建数据帧:</h2>
<pre class="lang-py prettyprint-override"><code>import pandas as pd
new_files = list(Path(f'c:\some_path').glob('*_fixed.csv'))
df = pd.concat([pd.read_csv(f) for f in new_files])
</code></pre>