消除CSV fi中不需要的换行符

superhero_df = pd.read_csv("superheroes.csv", sep=' *; *', skiprows=12, names=["location", "superhero", "superpower"], index_col=False, engine="python") superhero_df = superhero_df.replace('\r\n','', regex=True)

3条回答

网友

1楼 · 编辑于 2024-05-23 21:52:36

下面的正则表达式在每三个字段之后消除不需要的换行符和其他空格。它假定字段没有任何内部分号：

print(re.sub(r'([^;]*);\s*([^;]*);\s*([^;]*);\s+', r'\1;\2;\3\n', 
      line, flags=re.M))
#New York City; Iron Man;no superpowers
#Metropolis;Superman;superpowers
#New York City;Spider-Man;superpowers
#Gotham;Batman;no superpowers
#New York City;Doctor Strange;superpowers

在使用Pandas之前，可以在循环中运行它来预处理文件。你知道吗

网友

2楼 · 编辑于 2024-05-23 21:52:36

如果我是你，我会在一个新的文本文件中重写整个数据，只需对源文本文件进行简单的迭代，然后将结果文件加载到Pandas中，无需re：

with open('source.txt') as fin, open('target.txt', 'w') as fout:
    lc = 0
    for line in fin:
        lc += line.count(';')
        if  lc < 3:
            fout.write(line[:-1])
        else:
            fout.write(line)
            lc = 0

结果：

# New York City; Iron Man; no superpowers;
# Metropolis; Superman; superpowers;
# New York City;Spider-Man;superpowers;
# Gotham; Batman; no superpowers;
# New York City; Doctor Strange; superpowers;

解读熊猫：

pd.read_csv('target.txt', header=None, sep=';', usecols=range(3))

#                0                1                2
# 0  New York City         Iron Man   no superpowers
# 1     Metropolis         Superman      superpowers
# 2  New York City       Spider-Man      superpowers
# 3         Gotham           Batman   no superpowers
# 4  New York City   Doctor Strange      superpowers

注意：usecols是唯一需要的，因为后面有分号。通过使用导入可以避免这种情况

with open('source.txt') as fin, open('target.txt', 'w') as fout:
    lc = 0
    for line in fin:
        lc += line.count(';')
        if  lc < 3:
            fout.write(line.strip())
        else:
            fout.write(line.strip()[:-1] + '\n')
            lc = 0

解读熊猫：

pd.read_csv('target.txt', header=None, sep=';')

#                0                1                2
# 0  New York City         Iron Man   no superpowers
# 1     Metropolis         Superman      superpowers
# 2  New York City       Spider-Man      superpowers
# 3         Gotham           Batman   no superpowers
# 4  New York City   Doctor Strange      superpowers

网友

3楼 · 编辑于 2024-05-23 21:52:36

那怎么办：

^([^;]+);[\r\n]*([^;]+);[\r\n]*([^;]+);

替换为：

\1;\2;\3;

regex101

run here

import re

regex = r"^([^;]+);[\r\n]*([^;]+);[\r\n]*([^;]+);"

test_str = ("New York City; Iron Man; no superpowers;\n"
    "Metropolis; Superman; superpowers;\n"
    "New York City;\n"
    "Spider-Man;\n"
    "superpowers;\n"
    "Gotham; Batman; no superpowers;\n"
    "New York City; Doctor Strange; superpowers;\n\n")

subst = "\\1;\\2;\\3;"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.DOTALL)

if result:
    print (result)

相关问题更多 >

编程相关推荐

热门问题

热门文章