卡在这上面了

z = 0 atags = [] listurl = [] #import modules import urllib from bs4 import BeautifulSoup import re newurl = "https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Desmond.html" while z < 7: url = newurl z = z + 1 html = urllib.urlopen(url).read() soup = BeautifulSoup(html) soup.find_all("url") a = soup.find_all('a') for x in a: atags.append(str(x)) url_end_full = atags[19] url_end = re.findall(r'"(.*?)"', url_end_full) url_end = str(url_end[0]) newurl = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/' + url_end str(newurl) listurl.append(newurl) url = newurl print url

1条回答

网友

1楼 · 发布于 2024-06-16 13:02:14

有几个问题。你知道吗

atags[19]不是第18项，而是第20项（lst[0]是列表中的第一项）。
soup.find_all("url")什么也不做；把它处理掉。
你不需要re。
返回的链接是相对的；您正在对基本路径进行硬连接以使它们成为绝对的。在这种情况下，它是有效的，但这是运气的问题；用urljoin做正确的事情。
虽然str(link)确实可以得到url，但“适当的”方法是通过索引到属性中，即link['href']。

经过明智的清理

from bs4 import BeautifulSoup
import sys

# version compatibility shim
if sys.hexversion < 0x3000000:
    # Python 2.x
    from urlparse import urljoin
    from urllib import urlopen
else:
    # Python 3.x
    from urllib.parse import urljoin
    from urllib.request import urlopen

START_URL = "https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Desmond.html"
STEPS = 7
ITEM = 18

def get_soup(url):
    with urlopen(url) as page:
        return BeautifulSoup(page.read(), 'lxml')

def main():
    url = START_URL
    for step in range(STEPS):
        print("\nStep {}: looking at '{}'".format(step, url))
        # get the right item (Python arrays start indexing at 0)
        links = get_soup(url).find_all("a")
        rel_url = links[ITEM - 1]["href"]
        # convert from relative to absolute url
        url = urljoin(url, rel_url)
        print("  go to '{}'".format(url))

if __name__=="__main__":
    main()

如果我做得对的话，结果是known_by_Gideon.html

相关问题更多 >

编程相关推荐

热门问题

热门文章