在Python中搜索和替换分号符号

0 投票

2 回答

557 浏览

提问于 2025-04-18 15:35

OS: CentOS 6.5
Python version: 2.7.5

我有一个文件，里面有一些信息。我想把里面的分符号（¢）替换成前面加上$0的样子。

Alpha $1.00
Beta  ¢55  <<<< note
Charlie $2.00
Delta  ¢23  <<<< note

我希望它看起来像这样：

Alpha $1.00
Beta  $0.55  <<<< note
Charlie $2.00
Delta  $0.23  <<<< note

所以在命令行中，这段代码可以正常工作：

sed 's/¢/$0./g' *file name*

但是用Python写的代码却不行：

import subprocess
hello = subprocess.call('cat datafile ' + '| sed "s/¢/$0./g"',shell=True)
print hello

每次我尝试粘贴¢符号时，似乎都会出错。

稍微好一点的是，当我在Python中打印分符号的Unicode时，结果如下：

print(u"\u00A2")
Â¢

当我查看我的数据文件时，它实际上显示的是¢符号，但缺少了Â。<< 不确定这是否有帮助

我觉得在用Unicode进行替换时，¢前面的那个符号让我无法进行搜索和替换。

尝试使用Unicode时出现的错误代码：

hello = subprocess.call(u"cat datafile | sed 's/\uxA2/$0./g'",shell=True)
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 25-26: truncated \uXXXX escape

把uxA2修正为u00A2后，我得到了这个：

sed: -e expression #1, char 7: unknown option to `s'
1

有什么想法或建议吗？

在这两个例子中，我都遇到了下面的错误：

[root@centOS user]# python test2.py
Traceback (most recent call last):
  File "test2.py", line 3, in <module>
    data = data.decode('utf-8')             # decode immediately to Unicode
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 6: invalid start byte

[root@centOS user]# python test1.py
Traceback (most recent call last):
  File "test1.py", line 11, in <module>
    hello_unicode = hello_utf8.decode('utf-8')
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 6: invalid start byte

这是文件的内容：

[root@centOS user]# cat datafile
alpha ¢79

这是数据文件在Nano编辑器中的内容：

alpha ï¿½79

这是数据文件在Vim编辑器中的内容：

[root@centOS user]# vim fbasdf
alpha Â¢79
~

再次感谢大家的帮助

答案！！

Rob和Thomas提供的SED输出可以正常工作。文件格式保存为charset=iso-8859-1。我无法在文档中搜索utf-8格式的字符。

识别出的文件字符集：

file -bi datafile
text/plain; charset=iso-8859-1

使用以下代码来更改文件：

iconv -f iso-8859-1 -t utf8 datafile > datafile1

文本处理 unicode 文件格式命令行工具编码问题字符替换编辑器使用数据文件

2 个回答

另外，把你的字符串改成unicode字符串，并把cent符号替换成\u00A2。

下面是修正后的代码：

import subprocess
hello = subprocess.call(u"cat datafile | sed \"s#\u00A2#$0.#g\"",shell=True)
print hello

回答于 2025-04-18 由 Python大师

分享举报

借用Thomas的回答，并对此进行扩展：

import subprocess

# Keep all strings in unicode as long as you can.
cmd_unicode = u"sed 's/\u00A2/$0./g' < datafile"

# only convert them to encoded byte strings when you send them out
# also note the use of .check_output(), NOT .call()
cmd_utf8 = cmd_unicode.encode('utf-8')
hello_utf8 = subprocess.check_output(cmd_utf8, shell=True)

# Decode any incoming byte string to unicode immediately on receipt
hello_unicode = hello_utf8.decode('utf-8')

# And you have your answer
print hello_unicode

上面的代码展示了一种叫做“Unicode三明治”的用法：外面是字节，里面是Unicode。想了解更多，可以查看这个链接：http://nedbatchelder.com/text/unipain.html

对于这个简单的例子，你完全可以在Python中完成所有操作：

with open('datafile') as datafile:
    data = datafile.read()              # Read in bytes
data = data.decode('utf-8')             # decode immediately to Unicode
data = data.replace(u'\xa2', u'$0.')    # Do all operations in Unicode
print data                              # Implicit encode during output

回答于 2025-04-18 由 Python大师

分享举报

在Python中搜索和替换分号符号

2 个回答

撰写回答