Webscraping URL构造

s = name[:3] + '_' + name[3:] url0 = 'https://en.wikipedia.org/wiki/' + s url = requests.get(url0).text soup = BeautifulSoup(url,"lxml") soup.prettify() table = soup.find('table',{'class':'infobox'}) tags = table.find_all('tr')

2条回答

网友

1楼 · 编辑于 2024-06-08 14:04:43

如果只在从文件源读取时发生，则名称中必须有一些特殊（Unicode）或空白字符，如果您使用的是PyCharm，那么您可以进行一些调试，或者使用pprint（）或repr（）方法打印名称字符串（仅在从文件中读取名称字符串之后），以查看导致该问题的字符，让我们举一个示例代码，其中normalprint函数不会显示特殊字符，但pprint会显示。。。你知道吗

from bs4 import BeautifulSoup
from pprint import pprint
import requests

# Suppose this is a article id fetched from the file
article_id = "NGC2808   "

# Print will not show any special character
print(article_id)

# Even you can print this special character using repr() method
print(repr(article_id))

# Pprint shows a the character code in place of special character
pprint(article_id)

# Now this code will produce an error
article_id_mod = article_id[:3] + '_' + article_id[3:]
url = 'https://en.wikipedia.org/wiki/' + article_id_mod

response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")

table = soup.find('table',{'class':'infobox'})
if table:
    tags = table.find_all('tr')
    print(tags)

现在要解决同样的问题，您可以执行以下操作：

如果字符串的开头/结尾有多余的空格，请使用strip（）方法
article\u id=物品_id条（）
如果有特殊字符：使用适当的正则表达式，或使用vscode/sublime/notepad++等编辑器打开文件，然后使用find/replace选项。

网友
2楼 · 编辑于 2024-06-08 14:04:43

提供一个minimal reproducible example和一个错误消息的副本在这里会有很大的帮助，并且可能会对您的问题有更深入的了解。你知道吗
不过，以下几点对我很有用：
name = "NGC2808" s = name[:3] + '_' + name[3:] url = 'https://en.wikipedia.org/wiki/' + s temp = requests.get(url).text print(temp)
由于问题更改而编辑：
您提供的错误表明beautiful soup在get请求返回的文档中找不到任何表。您是否检查了传递给该请求的url以及返回的内容？你知道吗
从目前的情况来看，我可以得到一个标签列表（如您所希望的），包括以下内容：
import requests from bs4 import BeautifulSoup import lxml name = "NGC2808" s = name[:3] + '_' + name[3:] url = 'https://en.wikipedia.org/wiki/' + s temp = requests.get(url).text soup = BeautifulSoup(temp,"lxml") soup.prettify() table = soup.find('table',{'class':'infobox'}) tags = table.find_all('tr') print(tags)
行s = name[:3] + '_' + name[3:]的缩进方式很奇怪，这表明示例顶部缺少细节。有这样的上下文可能会很有用，因为不管涉及什么逻辑，都会导致向get请求传递一个格式错误的url。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章