从使用urllib2或BeautifulSoup抓取的字符串返回小写ASCII字符串

2 投票

3 回答

1842 浏览

提问于 2025-04-17 11:06

我正在使用urllib2从网页获取数据。所有页面的内容都是英文，所以处理非英文文本的问题不存在。不过，这些页面是经过编码的，有时会包含一些HTML实体，比如英镑符号£或者版权符号等。

我想检查页面的某些部分是否包含特定的关键词，但我希望这个检查不区分大小写（这很明显）。

那么，最好的方法是什么，把返回的页面内容转换成全小写字母呢？

def get_page_content_as_lower_case(url):
    request = urllib2.Request(url)
    page = urllib2.urlopen(request)
    temp = page.read()

    return str(temp).lower() # this dosen't work because page contains utf-8 data

[[更新]]

我不一定要用urllib2来获取数据，实际上我可以使用BeautifulSoup，因为我需要从页面中的特定元素获取数据，而BeautifulSoup在这方面表现得更好。我已经修改了标题以反映这一点。

然而，问题依然存在，获取的数据是某种非ASCII编码（应该是）UTF-8。我检查过其中一个页面，发现它的编码是iso-8859-1。

由于我只关心英文内容，我想知道如何能得到一个小写的ASCII字符串版本，从页面获取的数据中提取出来，这样我就可以进行不区分大小写的关键词测试。

我假设我只关注英文（来自讲英语的网站）会减少编码的选择？我对编码了解不多，但我认为有效的选择有：

ASCII
iso-8859-1
utf-8

这个假设合理吗？如果合理的话，是否有办法写一个“健壮”的函数，接受一个包含英文文本的编码字符串，并返回它的小写ASCII字符串版本呢？

字符串处理编码转换 html实体 beautifulsoup 数据抓取关键词匹配 ascii编码大小写转换

3 个回答

或者使用Requests库：

page_text = requests.get(url).text
lowercase_text = page_text.lower()

(Requests会自动解码响应内容。)

正如@tchrist所说，.lower()对于unicode文本并不能解决问题。

你可以看看这个替代的正则表达式实现，它支持unicode的大小写不敏感比较：http://code.google.com/p/mrab-regex-hg/

还有一些大小写转换的表格可以使用：http://unicode.org/Public/UNIDATA/CaseFolding.txt

回答于 2025-04-17 由 Python大师

分享举报

不区分大小写的字符串搜索比单纯地把字母变成小写要复杂得多。举个例子，一个德国用户希望在搜索词 Straße 时，能同时匹配到 STRASSE 和 Straße。但是，'STRASSE'.lower() == 'strasse' 这个判断是错误的，因为你不能简单地把双s替换成ß——在某些情况下，比如 Trasse 这个词里就没有ß。其他语言（特别是土耳其语）也会有类似的问题。

所以，如果你想支持英语以外的其他语言，最好使用一个可以正确处理大小写转换的库，比如 Matthew Barnett的 regexp。

说到这里，提取页面内容的方法是：

import contextlib
def get_page_content(url):
  with contextlib.closing(urllib2.urlopen(url)) as uh:
    content = uh.read().decode('utf-8')
  return content
  # You can call .lower() on the result, but that won't work in general

回答于 2025-04-17 由 Python大师

分享举报

BeautifulSoup内部将数据存储为Unicode格式，所以你不需要手动处理字符编码的问题。

如果你想在文本中查找关键词（不区分大小写），注意这不包括属性值或标签名称：

#!/usr/bin/env python
import urllib2
from contextlib import closing 

import regex # pip install regex
from BeautifulSoup import BeautifulSoup

with closing(urllib2.urlopen(URL)) as page:
     soup = BeautifulSoup(page)
     print soup(text=regex.compile(ur'(?fi)\L<keywords>',
                                   keywords=['your', 'keywords', 'go', 'here']))

示例（Unicode单词由@tchrist提供）

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex
from BeautifulSoup import BeautifulSoup, Comment

html = u'''<div attr="PoSt in attribute should not be found">
<!-- it must not find post inside a comment either -->
<ol> <li> tag names must not match
<li> Post will be found
<li> the same with post
<li> and poﬆ
<li> and poﬅ
<li> this is ignored
</ol>
</div>'''

soup = BeautifulSoup(html)

# remove comments
comments = soup.findAll(text=lambda t: isinstance(t, Comment))
for comment in comments: comment.extract()

# find text with keywords (case-insensitive)
print ''.join(soup(text=regex.compile(ur'(?fi)\L<opts>', opts=['post', 'li'])))
# compare it with '.lower()'
print '.lower():'
print ''.join(soup(text=lambda t: any(k in t.lower() for k in ['post', 'li'])))
# or exact match
print 'exact match:'
print ''.join(soup(text=' the same with post\n'))

输出结果

 Post will be found
 the same with post
 and poﬆ
 and poﬅ

.lower():
 Post will be found
 the same with post

exact match:
 the same with post

回答于 2025-04-17 由 Python大师

分享举报

从使用urllib2或BeautifulSoup抓取的字符串返回小写ASCII字符串

3 个回答

示例（Unicode单词由@tchrist提供）

输出结果

撰写回答