一周前刚开始编程,使用BeautifulSoup和https://cagematch.net使用scraper在Python中获取复杂的元数据。你知道吗
这是我的密码:
from BeautifulSoup import BeautifulSoup
import urllib2
link = "https://www.cagematch.net/?id=8&nr=12&page=4"
print link
url = urllib2.urlopen(link) #Cagematch URL for PWG Events
content = url.read()
soup = BeautifulSoup(content)
events = soup.findAll("tr", { "class" : "TRow" }) #Captures all event classes into a list, each event on site is separated by '<tr class="TRow">'
for i in events[1:12]: #For each event, only searches over a years scope
data = i.findAll("td", { "class" : "TCol TColSeparator"}) #Captures each class on an event into a list item, separated by "<td class="TCol TColSeparator>"
date = data[0].text #Grabs Date of show, date of show is always first value of "data" list
show = data[1].text #Grabs name of show, name of show is always second value of "data" list
status = data[2].text #Grabs event type, if "Event (Card)" show hasn't occurred, if "Event" show has occurred.
print date, show, status
if status == "Event": #If event has occurred, get card data
print "Event already taken place"
link = 'https://cagematch.net/' + data[4].find("a", href=True)['href']
print content
所以这个想法是:
1的工作非常完美,它去的网站刚刚好,得到了它需要的。2没有。你知道吗
我在if语句中重新声明了我的“link”变量。link变量更改为正确的链接。但是,当再次尝试打印内容时,它仍然从我最初声明链接时转到原始页面。你知道吗
如果我重新声明所有的变量,它是工作的,但肯定有另一种方法吗?你知道吗
仅通过重新定义
link
变量不会触发页面内容的更改-您必须从新链接请求并下载页面:其他注意事项:
您使用的是非常过时的} 4 :
BeautifulSoup
版本3。更新到^{并将导入更改为:
您可以通过切换到
requests
并对同一域的多个请求重用同一会话来提高性能建议使用^{} 连接URL的各个部分
相关问题 更多 >
编程相关推荐