无法在BeautifulSoup中美化抓取的html

2 投票

3 回答

2459 浏览

提问于 2025-04-15 17:45

我有一个小脚本，它使用 urllib2 来获取一个网站的内容，找到所有的链接标签，然后在最上面和最下面加上一小段HTML，最后我想把它美化一下。但是它总是返回一个错误，提示TypeError: sequence item 1: expected string, Tag found。我查了很多地方，但真的找不到问题出在哪里。像往常一样，非常感谢任何帮助。

import urllib2
from BeautifulSoup import BeautifulSoup
import re

reddit = 'http://www.reddit.com'
pre = '<html><head><title>Page title</title></head>'
post = '</html>'
site = urllib2.urlopen(reddit)
html=site.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a')
tags.insert(0,pre)
tags.append(post)
soup1 = BeautifulSoup(''.join(tags))
print soup1.prettify()

这是错误的追踪信息：

Traceback (most recent call last): File "C:\Python26\bea.py", line 21, in <module>
        soup1 = BeautifulSoup(''.join(tags))
TypeError: sequence item 1: expected string, Tag found

错误处理编程调试网页抓取 html解析标签处理数据美化

3 个回答

乔纳森的回答里有一点语法错误，这里是正确的：

    soup1 = BeautifulSoup(''.join([unicode(tag) for tag in tags]))

回答于 2025-04-15 由 Python大师

分享举报

soup1 = BeautifulSoup(''.join(unicode(tag) for tag in tags))

当然可以！请把你想要翻译的内容发给我，我会帮你把它变得更简单易懂。

回答于 2025-04-15 由 Python大师

分享举报

这个方法对我有效：

soup1 = BeautifulSoup(''.join(str(t) for t in tags))

这个使用pyparsing的解决方案也能输出不错的结果：

from pyparsing import makeHTMLTags, originalTextFor, SkipTo, Combine

# makeHTMLTags defines HTML tag patterns for given tag string
aTag,aEnd = makeHTMLTags("A")

# makeHTMLTags by default returns a structure containing
# the tag's attributes - we just want the original input text
aTag = originalTextFor(aTag)
aEnd = originalTextFor(aEnd)

# define an expression for a full link, and use a parse action to
# combine the returned tokens into a single string
aLink = aTag + SkipTo(aEnd) + aEnd
aLink.setParseAction(lambda tokens : ''.join(tokens))

# extract links from the input html
links = aLink.searchString(html)

# build list of strings for output
out = []
out.append(pre)
out.extend(['  '+lnk[0] for lnk in links])
out.append(post)

print '\n'.join(out)

打印结果是：

<html><head><title>Page title</title></head>
  <a href="http://www.reddit.com/r/pics/" >pics</a>
  <a href="http://www.reddit.com/r/reddit.com/" >reddit.com</a>
  <a href="http://www.reddit.com/r/politics/" >politics</a>
  <a href="http://www.reddit.com/r/funny/" >funny</a>
  <a href="http://www.reddit.com/r/AskReddit/" >AskReddit</a>
  <a href="http://www.reddit.com/r/WTF/" >WTF</a>
  .
  .
  .
  <a href="http://reddit.com/help/privacypolicy" >Privacy Policy</a>
  <a href="#" onclick="return hidecover(this)">close this window</a>
  <a href="http://www.reddit.com/feedback" >volunteer to translate</a>
  <a href="#" onclick="return hidecover(this)">close this window</a>
</html>

回答于 2025-04-15 由 Python大师

分享举报

无法在BeautifulSoup中美化抓取的html

3 个回答

撰写回答