是否有一种方法可以解析父网页中多个页面的数据?

2024-05-21 01:40:55 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我一直在去一个网站获取NDC代码https://ndclist.com/?s=Solifenacin,我需要获取10位数的NDC代码,但在当前网页上只有8位数的NDC代码,如下图所示

8 digit NDC code

因此,我点击带下划线的NDC代码。并获取此网页

10 digit NDC code

因此,我将这两个NDC代码复制并粘贴到excel工作表中,并对我显示的第一个网页上的其余代码重复此过程。但是这个过程需要很长时间,我想知道Python中是否有一个库可以为我复制和粘贴10位数的NDC代码,或者将它们存储在一个列表中,然后我可以在第一页上完成所有8位数的NDC代码后打印列表。是否有一个更好的图书馆来实现这个过程

编辑<<<&书信电报; 我实际上需要深入到另一个层次,我一直在努力解决这个问题,但我一直失败了,很显然,网页的最后一个层次是这个愚蠢的html表,我只需要表中的一个元素。这是你点击二级代码后的最后一个网页。 last level page

这是我的代码,但是一旦我运行它,它将返回一个tr和None对象

url ='https://ndclist.com/?s=Trospium'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
    link_url = a['href']
    print('Processin link {}...'.format(link_url))

    soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
    for b in soup2.select('#product-packages a'):
        link_url2 = b['href']
        print('Processing link {}... '.format(link_url2))
        soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
        for link in soup3.findAll('tr', limit=7)[1]:
            print(link.name)
            all_data.append(link.name)

print('Trospium')
print(all_data)

Tags: 代码lturl网页dataget过程html
1条回答
网友
1楼 · 发布于 2024-05-21 01:40:55

是的,在这种情况下,BeautifulSoup是理想的选择。此脚本将打印页面中的所有10位代码:

import requests
from bs4 import BeautifulSoup

url = 'https://ndclist.com/?s=Solifenacin'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
    link_url = a['href']
    print('Processin link {}...'.format(link_url))

    soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
    for link in soup2.select('#product-packages a'):
        print(link.text)
        all_data.append(link.text)

# In all_data you have all codes, uncoment to print them:
# print(all_data)

印刷品:

Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56
0093-5263-98
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56
0093-5264-98
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19
Processin link https://ndclist.com/ndc/27241-037...
27241-037-03
27241-037-09

... and so on.

编辑:(我也得到描述的版本):

import requests
from bs4 import BeautifulSoup

url = 'https://ndclist.com/?s=Solifenacin'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
    link_url = a['href']
    print('Processin link {}...'.format(link_url))

    soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
    for code, desc in zip(soup2.select('a > h4'), soup2.select('a + p.gi-1x')):
        code = code.get_text(strip=True).split(maxsplit=1)[-1]
        desc = desc.get_text(strip=True).split(maxsplit=2)[-1]
        print(code, desc)
        all_data.append((code, desc))

# in all_data you have all codes:
# print(all_data)

印刷品:

Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5263-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5264-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19 90 TABLET, FILM COATED in 1 BOTTLE

...and so on.

相关问题 更多 >