Python: re.sub没有任何变化

2 投票

2 回答

596 浏览

提问于 2025-04-17 22:53

我有以下这段代码：

def gettextbyxpath(tree, xpath):
    node = tree.xpath(xpath)[0]
    try:
        text = etree.tostring(node, method="text", encoding='UTF-8').strip()
        text = re.sub(' +',' ', text)
        text = re.sub('\n+','\n', text)
        text = re.sub('\n \n','\n', text)
    except:
        text = 'ERROR'
    return text

在最后一行，我试图去掉那些只有一个空格的行。实际上，这种行在真实数据中有很多。

当我单独运行上面的代码时，它工作得很好，但在实际代码中，最后一行根本没有任何作用！我试着比较了有和没有这行代码生成的文件——没有任何区别。

示例输入：

        Brand：

   777,Royal Lion



    Main Products:

           battery, 777, carbon zinc, paper jacket,

我想去掉行与行之间的垂直空白。

有没有人知道我的代码为什么会这样表现？

正则表达式文本处理数据清洗空白行编码调试行间距

2 个回答

下面的代码应该能去掉制表符、换行符和空格，除了单个空格以外的所有空格都会被去掉。

import re

a ="""
 Brand：

 777,Royal Lion



 Main Products:

 battery, 777, carbon zinc, paper jacket,
"""
p = re.compile(r'[\n\t]+|[ ]{2,}')
print p.sub('',a)

回答于 2025-04-17 由 Python大师

分享举报

关于你代码的表现为什么是这样的：你从第二次调用 re.sub 得到的 text 值，并不包含你在最后一次调用 re.sub 时想要替换的模式。

>>> text = re.sub('\n+', '\n', text) # 2nd call to re.sub
>>> text
>>> 'Brand：\n 777,Royal Lion\n Main Products:\n battery, 777, carbon zinc, paper jacket,'

所以，你需要在最后一次调用 re.sub 时，把模式中的第二个 \n 去掉：

text = re.sub('\n ','\n', text)

这样做会得到：

Brand：
777,Royal Lion
Main Products:
battery, 777, carbon zinc, paper jacket,

另一种解决方案

def gettextbyxpath(tree, xpath):
    node = tree.xpath(xpath)[0]
    try:
        text = etree.tostring(node, method="text", encoding='UTF-8').strip()
        text = '\n'.join(line.strip() for line in text.split('\n') if line.strip())
    except:
        text = 'ERROR'
    return text

输出结果

Brand：
777,Royal Lion
Main Products:
battery, 777, carbon zinc, paper jacket,

这个方法和之前的不同之处在于，我们不是连续地用 re.sub 进行替换，而是先用 \n 来分割 etree.tostring 的输出。然后，我们会过滤掉那些在调用 .strip() 后变成空字符串的行。这样，我们就只留下了那些有实际内容的行，并且去掉了左右两边的空白。最后，我们用一个换行符（\n）把这些行连接起来，得到最终结果。

回答于 2025-04-17 由 Python大师

分享举报

Python: re.sub没有任何变化

2 个回答

撰写回答