用Python拉屎。对包含空格的字符串使用select（）方法

<a class="overlay no-outline" href="/photos/28716729@N06/2834595694/" tabindex="0" role="heading" aria-level="3" aria-label="puppy by mpappas83" data-rapid_p="61" id="yui_3_16_0_1_1477971884605_5513"></a>

2条回答

网友

1楼 · 编辑于 2024-06-02 06:33:22

import os
from lxml import html
import requests
import fnmatch

class HtmlRat:

    def __init__(self):
        pass

    def req_page(self, url):
        page = requests.get(url)
        return page

    def tag_data(self, txpath):
        tag_val = tree1.xpath(txpath + "/text()")
        val = ''.join(tag_val).strip(' ')
        val = val.split(' ')
        return val

def link_grabber(url, pattern):
    markup = HtmlRat()
    tree1 = markup.req_page(url)
    for tree in tree1:
        tre = tree.split(" ")
        for t in tre:
            if fnmatch.fnmatch(t, pattern):
                print t

flickr = link_grabber("https://www.flickr.com/search/?text=cars", 'href="*"')
superstreet = link_grabber("http://www.superstreetonline.com/features/1610-2013-scion-fr-s-multipurposed/", 'href="*.jpg"')

# from here you can split it by = to get the links it self.

这应该管用。但是当我们阅读网页的源代码时，链接就不在了。很确定它们是通过后端生成的。检查pexels或其他一些网站使用的代码，你应该是好的。你知道吗

网友

2楼 · 编辑于 2024-06-02 06:33:22

以下方法应有帮助：

import bs4

html = """<a class="overlay no-outline" href="/photos/28716729@N06/2834595694/" tabindex="0" role="heading" aria-level="3" aria-label="puppy by mpappas83" data-rapid_p="61" id="yui_3_16_0_1_1477971884605_5513"></a>"""
soup = bs4.BeautifulSoup(html, "html.parser")

for link in soup.select("a.overlay.no-outline"):
    print link['href']

其中显示：

/photos/28716729@N06/2834595694/

中间的空格用来表示应用了两个不同的类，BeautifulSoup documentation确实有一节介绍了如何使用上述方法解决这个问题。查找文本“如果要搜索匹配两个或多个CSS类的标记”。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

用Python拉屎。对包含空格的字符串使用select（）方法

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >