清理python中的刮取文本

2022-12-01 05:28:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我是python新手,刚刚开始学习使用BeautifulSoup(在Jupyter笔记本中)进行web抓取。我从“古腾堡计划”上刮下一本书,想做翻译。然而,他很难清理文本,然后进行翻译

我想去掉那些在文本开头的东西(例如BODY{color:Black;background:White;…),然后使用googleapi翻译整个文本

非常感谢您对这两方面的帮助/建议。我的代码如下。翻译代码无效,并返回以下错误“WriteError:[Errno 32]断管”

#Store url
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
html = r.text
print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")
type(soup)
#Scrape entire text using 'get' and print it
text = soup.get_text()
print(text)
#translate text using google API translator
init the Google API translator
translator = Translator()
translation = translator.translate(text,dest="ar")
print(translation)

Tags: the代码text文本apiurlgethtmltranslationtranslateusingprinttranslatorsoupbeautifulsoup
1条回答
网友
1楼 · 发布于 2022-12-01 05:28:37

当您想刮取文本数据时,您可以从元素中找到,文本是用bs4模块中的find_all方法写入p标记的,因此您可以从中获取文本数据

from bs4 import BeautifulSoup
import requests
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
response=requests.get(url)
html = response.text
# print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html.parser")
paragraph=soup.find_all("p")
for para in paragraph:
    print(para.text)

Output:
"Christmas won't be Christmas without any presents," grumbled Jo, lying
on the rug.
...