如何去除页面标题标签中的换行符和行分隔符？（Google App Engine - Python）

2 投票

2 回答

4620 浏览

提问于 2025-04-16 14:22

我有一段代码用来提取网页标题：

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
title = str(soup.html.head.title.string).lstrip("\r\n").rstrip("\r\n")

有些网站在标题标签前后会加上换行符或者空格（为什么会这样呢？），为了去掉这些多余的空白，我添加了：

.lstrip("\r\n").rstrip("\r\n")

这段代码在比如说 http://www.readwriteweb.com/ 这个网站上能正常工作，但在 http://poundwire.com/ 上就不行。你能告诉我为什么一个能工作而另一个不行吗？

更新

根据Steve Jessop的评论，我使用了 replace 方法，似乎有效：

title = str(soup.html.head.title.string).replace("\t", "").replace("\r", "").replace("\n", "")

如果有更好的方法，请告诉我。谢谢。

更新 2

我找到了一条答案，看起来更好：

title = " ".join(str(soup.html.head.title.string).split())

字符串处理网页标题数据清洗网页解析行分隔符 HTML标签网站兼容性去除空白

2 个回答

在poundwire网站上，<title>标签里面有一个制表符（tab字符）。还有一些空格（你可能在“查看源代码”时会看到的缩进），这些你可能也想去掉。

就像samplebias说的，使用strip()可以去掉字符串两端的空白。而且，找一个带有“可见空白”模式的文本编辑器，打开这个模式，以后永远不要关掉它，真的很有用 :-)

顺便说一下，如果你在使用Google App Engine，那就意味着你在用Python 2.5，这也意味着str是非Unicode字符串类型。BeautifulSoup会尽力把输入转换成Unicode，所以当你遇到一个标题里包含非ASCII字符的页面时，抛出异常就显得有点可惜了。

[编辑：第三种情况

$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib
>>> soup = BeautifulSoup(urllib.urlopen('http://code.google.com/p/google-refine/'))
>>> soup.html.head.title.string
u'\\n google-refine -\\n \\n \\n Google Refine, a power tool for working with messy data (formerly Freebase Gridworks) - Google Project Hosting\\n '
>>>

所以，结尾处的空格意味着你的rstrip没有去掉结尾附近的\n。

回答于 2025-04-16 由 Python大师

分享举报

试试用 str(title).strip() 这个方法，它可以去掉字符串开头和结尾的所有空白字符。

回答于 2025-04-16 由 Python大师

分享举报

如何去除页面标题标签中的换行符和行分隔符？（Google App Engine - Python）

2 个回答

撰写回答