如何在Python中根据特殊字符删除Unicode字符串的一部分

1 投票

1 回答

773 浏览

数据工程师

提问于 2025-04-16 05:09

首先，简单总结一下：

Python版本：3.1

系统：Linux（Ubuntu）

我正在尝试通过Python和BeautifulSoup来获取一些数据。

不幸的是，我要处理的一些表格中有些单元格包含这样的文本：

789.82 ± 10.28

为了让这个工作顺利进行，我需要解决两个问题：

第一，我该如何处理像“±”这样的“奇怪”符号？

第二，我该如何去掉包含“±”及其右边所有内容的部分？

目前我遇到的错误是：SyntaxError: Non-ASCII character '\xc2' in file ......

谢谢你的帮助。

[编辑]：

# dataretriveal from html files from DETHERM
# -*- coding: utf8 -*-

import sys,os,re
from BeautifulSoup import BeautifulSoup


sys.path.insert(0, os.getcwd())

raw_data = open('download.php.html','r')
soup = BeautifulSoup(raw_data)


for numdiv in soup.findAll('div', {"id" : "sec"}):
    currenttable = numdiv.find('table',{"class" : "data"})
    if currenttable:
        numrow=0
        for row in currenttable.findAll('td', {"class" : "dataHead"}):
            numrow=numrow+1

        for col in currenttable.findAll('td'):
            col2 = ''.join(col.findAll(text=True))
            if col2.index('±'):
                col2=col2[:col2.indeindex('±')]
            print(col)
        print(numrow)
        ref=numdiv.find('a')
        niceref=''.join(ref.findAll(text=True))
        print(niceref)

现在这段代码后面出现了一个错误：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

ASCII这个问题是从哪里冒出来的呢？

Linux error handling unicode data extraction beautifulsoup text processing string manipulation special characters

1 个回答

你需要确保你的Python文件是用utf-8编码的。否则，这个问题就很简单了：

>>> s = '789.82 ± 10.28'
>>> s[:s.index('±')]
'789.82 '
>>> s.partition('±')
('789.82 ', '±', ' 10.28')

回答于 2025-04-16 由 Python大师

分享举报

如何在Python中根据特殊字符删除Unicode字符串的一部分

1 个回答

撰写回答