Python的strip函数不正常工作
我正在用Python从一个网站上抓取一些数据。
我想做两件事:
我想跳过前面两个词“Dubai”和“UAE”,因为这两个词在每次抓取的结果中都是常见的。
我想把最后两个词分别保存到两个不同的变量里,并去掉多余的空格。
try: area= soup.find('div', 'location') area_result= str(area.get_text().strip().encode("utf-8")) print "Area: ",area_result except StandardError as e: area_result="Error was {0}".format(e) print area_result
area_result: 包含以下数据:
'UAE \xe2\x80\xaa>\xe2\x80\xaa\n \n Dubai \xe2\x80\xaa>\xe2\x80\xaa\n \n Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n \n Executive Towers \n \n\n\n \n\n\n\t \n\t \n\t \n\t\n\n\n \n ;\n \n \n \n 1.4 km from Burj Khalifa Tower'
我希望上面的结果显示为(注意>
在Executive Towers
和1.4 km..
之间)
Executive Towers > 1.4 km from Burj Khalifa Tower
2 个回答
2
area_result = area_result.replace("UAE", "")
area_result = area_result.replace("Dubai", "")
area_result = area_result.strip()
import re
area_result = re.sub('\s+',' ',area_result)
area_result = area_result.replace("UAE > Dubai >", "")
area_result = area_result.strip()
使用正则表达式:
0
import string
def cleanup(s, remove=('\n', '\t')):
newString = ''
for c in s:
# Remove special characters defined above.
# Then we remove anything that is not printable (for instance \xe2)
# Finally we remove duplicates within the string matching certain characters.
if c in remove: continue
elif not c in string.printable: continue
elif len(newString) > 0 and c == newString[-1] and c in ('\n', ' ', ',', '.'): continue
newString += c
return newString
在这里加点东西来清理你的代码吗?
最终的结果是:
>>> s = 'UAE \xe2\x80\xaa>\xe2\x80\xaa\n \n Dubai \xe2\x80\xaa>\xe2\x80\xaa\n \n Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n \n Executive Towers \n \n\n\n \n\n\n\t \n\t \n\t \n\t\n\n\n \n ;\n \n \n \n 1.4 km from Burj Khalifa Tower'
>>> cleanup(s)
'UAE > Dubai > Business Bay > Executive Towers 1.4 km from Burj Khalifa Tower'
这里有一个不错的SO参考,关于字符串库。
回到问题上,我看到用户不想要前两个块(在>
之间),那么简单地做:
area_result = cleanup(area_result).split('>')[3].replace(';', '>')