美丽的汤提取化学名称

2024-06-12 11:44:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从下面的URL中提取化学名称(全部大写)

https://www.legislation.gov.au/Details/F2020L01255

我对附表4所示的化学品感兴趣

import requests
import re
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

url = 'https://www.legislation.gov.au/Details/F2020L01255'
headers = {"Accept-Language": "EN-AU, en;q=0.5"}
results = requests.get(url, headers=headers)

soup = BeautifulSoup(results.text, "html.parser")

chemicals = []

chems_div = soup.find_all('div', class_='WordSection7')

我被困在这里了。化学名称用class='MsoNormal'和lang='EN-AU'包裹在P标签和Span标签周围


Tags: fromhttpsimport名称getwwwdetailsrequests
1条回答
网友
1楼 · 发布于 2024-06-12 11:44:44

试试这个:

import requests
from bs4 import BeautifulSoup

url = 'https://www.legislation.gov.au/Details/F2020L01255'
headers = {"Accept-Language": "EN-AU, en;q=0.5"}
results = requests.get(url, headers=headers)

soup = BeautifulSoup(results.text, "html.parser")
chems_div = soup.find('div', class_='WordSection7')
all_spans = [
    t.getText(strip=True) for t in
    chems_div.find_all("span", {"lang": "EN-AU"})
]

print([w for w in all_spans if w.isupper() and w != "SCHEDULE 4"])

输出:

['ABACAVIR.', 'ABATACEPT.', 'ABIRATERONE ACETATE.', 'ABCIXIMAB.', 'ABEMACICLIB.', 'ACALABRUTINIB.', 'ACAMPROSATE CALCIUM.', 'ACARBOSE.', 'ACEBUTOLOL.', 'ACEPROMAZINE.', 'ACETARSOL.', 'ACETAZOLAMIDE.', 'ACETOHEXAMIDE.', 'ACETYL ISOVALERYLTYLOSIN.', 'ACETYLCARBROMAL.', 'ACETYLCHOLINE.', 'ACETYLDIGITOXIN.', 'ACETYLMETHYLDIMETHYLOXIMIDOPHENYLHYDRAZINE.', 'ACETYLSTROPHANTHIDIN.', 'ACIPIMOX.', '# ACITRETIN.', 'ACLIDINIUM BROMIDE.', 'ACOKANTHERA OUABAIO.', 'ACOKANTHERA SCHIMPERI.', 'ACRIVASTINE.', 'ADALIMUMAB.', 'ADAPALENE.', 'ADEFOVIR.', 'ADIPHENINE.', 'ADONIS VERNALIS.', 'ADRAFINIL.', 'AFAMELANOTIDE.', 'AFATINIB DIMALEATE.'
and so on...

相关问题 更多 >