如何用漂亮的汤从标题中提取网址?

2024-03-28 20:35:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个我感兴趣的链接列表:

lis = ['https://example1.com', 'https://example2.com', ..., 'https://exampleN.com']

在那些链接有几个网址,我想提取一些具体的内部网址。此类URL的格式如下:

<a href="https://interesting-linkN.com" target="_blank" title="Url to news"> News JPG </a>

我如何检查lis的所有元素并返回lis的已访问链接,只返回pandas数据帧中标题为Url to news的url?,像这样(**):

^{pr2}$

注意,对于没有lis的元素,我希望返回NaN。在

我试了this然后:

def extract_jpg_url(a_link):
    page = requests.get(a_link)
    tree = html.fromstring(page.content)
    # here is the problem... not all interesting links have this xpath, how can I select by title?
    #(apparently all the jpg urls have this form: title="Url to news")
    interesting_link = tree.xpath(".//*[@id='object']//tbody//tr//td//span//a/@href")
    if len(interesting_link) == 0:
        return'NaN'
    else:
        return 'image link ', interesting_link
then:

    df['news_link'] = df['urls_from_lis'].apply(extract_jpg_url)

但是,后一种方法耗时太长,lis的所有元素都与给定的xpath匹配(检查注释)我能得到什么(**)?在


Tags: tohttpscomurl元素title链接link
1条回答
网友
1楼 · 发布于 2024-03-28 20:35:26

这不会返回您想要的结果(NaN),但它将给您提供如何使此工作简单高效的总体思路。在

from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import requests

def extract_urls(link):
    r = requests.get(link)
    html = r.text
    soup = BeautifulSoup(html, "html.parser")
    results = soup.findAll('a', {'title': 'Url to news'})
    results = [x['href'] for x in results]
    return (link, results)

links = [
    "https://example1.com",
    "https://example2.com",
    "https://exampleN.com", ]

p = ThreadPool(10)
r = p.map(extract_urls, links)

for url, results in r:
    print(url, results)

相关问题 更多 >