Python鳕鱼的麻烦

def get_next_target(page): start_link = page.find('<a href=') while True: if start_link == -1: x, y = None, 0 return x, y break start_quote = page.find('"', start_link) end_quote = page.find('"', start_quote + 1) url = page[start_quote + 1:end_quote] return url, end_quote

1条回答

网友

1楼 · 发布于 2024-04-23 18:04:08

def get_next_target(page, start=0):
    """ function find link in part of page """
    start_link = page[start:].find('<a href=')
    if start_link == -1:
        x, y = None, None
        return x, y
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    return url, end_quote

def find_all(page):
    """ function find all links"""
    length = len(page)
    current_position = 0  # we start with full page
    urls = []
    while current_position < length:
        # get url and set current_positon, so next we gonna search 
        # only part of page
        url, current_position = get_next_target(page, current_position)
        urls.append(url)
        if current_position is None:
            return urls
    return urls

但我建议使用正则表达式，比如：

def find_all(page):
    import re
    return re.findall('<a href="(.+)"', page)

编辑： 但这两种解决方案都无法检测到以下链接：

<a  href="some/page">, or <a tilte="ti" href="some/page" >

为此，需要重新创建正则表达式。这是最好的选择。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python鳕鱼的麻烦

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >