将BeautifulSoup输出写入.txt时的问题

0 投票

3 回答

1407 浏览

提问于 2025-04-17 21:18

我正在用beautifulsoup抓取网页文章。输出的内容打印得很正确，但没有把完整的内容写入文件。一旦遇到引号里的句子，似乎就会中断。下面是相关的代码。任何建议都会非常有帮助。

可以用这个网址来复现结果： http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306

import urllib2
from bs4 import BeautifulSoup
import re
import codecs

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

# Make sure file is clear for new content
open('ctp_output.txt', 'w').close()

# Open txt document for output
txt = open('ctp_output.txt', 'w')

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# retrieve all of the paragraph tags
with open('ctp_output.txt', 'w'):
    for tag in soup.find_all('p'):
        txt.write(tag.text.encode('utf-8') + '\n' + '\n')

# Close txt file with new content added
txt.close()

文本处理网页抓取 html解析引号处理 beautifulsoup 编码问题文件写入数据输出

3 个回答

如果你在使用Hugh Bothwell的修改后，遇到了这个错误：UnicodeEncodeError: 'ascii' codec can't encode characters in position 21-23: ordinal not in range(128)，那么你可以尝试以下方法。

使用 codecs.open() 或 io.open() 来打开文本文件，并指定合适的文本编码（也就是在代码里写上 encoding="..."），而不是用 open() 来打开一个字节文件。

fp=codecs.open('ctp_output.txt', 'w',encoding="utf-8")

然后直接写入内容。

fp.write("what you want to write")

你需要先 import codecs 才能使用这个方法。

回答于 2025-04-17 由 Python大师

分享举报

有几个问题：

for tag in tags:
    f.write(tag.get_text() + '\n' + '\n')

需要进一步缩进（它应该是with open('ctp_output.txt', 'w') as f:的子部分）；

txt.close()

是多余的——因为with语句已经确保文件会被关闭；

我没有看到输出中有什么缺失的内容——你能指出一句消失的句子吗？

编辑：这看起来像是Python3的问题——在Python 2.7.5中运行得很好。

编辑2：通过使用str.decode()修复了这个问题：

你的代码可以简化为

from bs4 import BeautifulSoup

import sys
if sys.hexversion < 0x3000000:
    # Python 2.x
    from urllib2 import urlopen
    inp = raw_input
else:
    # Python 3.x
    from urllib.request import urlopen
    inp = input

def get_paras(url):
    page = urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(page)
    return [para.get_text() for para in soup('p')]

def write_lst(f, lst, fmt="{}\n\n".format):
    for item in lst:
        f.write(fmt(item))

def main():
    url   = inp("Please enter a fully qualified URL: ")
    fname = inp("Please enter the output file name: ")

    with open(fname, "w") as outf:
        write_lst(outf, get_paras(url))

if __name__=="__main__":
    main()

回答于 2025-04-17 由 Python大师

分享举报

我猜你应该把你想写的内容用utf8格式编码：

to_write = tag.get_text() + "\n"
f.write(to_write.encode("utf-8"))

这就是我最近遇到的问题。

回答于 2025-04-17 由 Python大师

分享举报

将BeautifulSoup输出写入.txt时的问题

3 个回答

撰写回答