我想转到this页,在“活动”选项卡中删除所有行:
Antibacterial Activities
1 Flora:E.coli MIC:5.59µg/ml (2.0005µM) Method:MIC :
2 Flora:A.salmonicida subsp salmonicida MIC:11.18µg/ml (4.001µM) Method:MIC :
3 Flora:V.anguillarum MIC:2.79µg/ml (0.998461µM) Method:MIC :
4 Flora:S.typhimurium MIC:11.18µg/ml (4.001µM) Method:MIC :
5 Flora:B.subtilis MIC:5.59µg/ml (2.0005µM) Method:MIC :
6 Flora:L.ivanovii MIC:11.18µg/ml (4.001µM) Method:MIC :
我用beautifulsoup尝试了不同的方法,因为我用selenium做的太多了。我尝试的不同方法:
import requests
from bs4 import BeautifulSoup as bs
html_page = urlopen('http://biotechlab.fudan.edu.cn/database/lamp/detail.php?id=L01A003388')
soup = bs(html_page)
#method 1
li = soup.select('ul')
print(li)
#method 2
for ultag in soup.find_all('ul',{'class':"ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom ui-accordion-content-active"}):
for litag in ultag.find_all('li'):
print(litag.text)
#method 3
for ul in soup.findAll('ul', class_="ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom ui-accordion-content-active"):
for link in ul.findAll('a'):
print(link.text)
方法1打印所有UL,而我只希望这部分以更好的格式返回:
<ul><li><strong> Antibacterial Activities</strong></li><li> 1 Flora:E.coli MIC:5.59µg/ml (2.0005µM) Method:MIC :</li><li> 2 Flora:A.salmonicida subsp salmonicida MIC:11.18µg/ml (4.001µM) Method:MIC :</li><li> 3 Flora:V.anguillarum MIC:2.79µg/ml (0.998461µM) Method:MIC :</li><li> 4 Flora:S.typhimurium MIC:11.18µg/ml (4.001µM) Method:MIC :</li><li> 5 Flora:B.subtilis MIC:5.59µg/ml (2.0005µM) Method:MIC :</li><li> 6 Flora:L.ivanovii MIC:11.18µg/ml (4.001µM) Method:MIC :</li></ul>
方法2和方法3都返回了打印到屏幕上的“soup=bs(html页面)”
如果有人能告诉我哪里出了问题/如何提取感兴趣的数据,我将不胜感激。顺便说一句,我只是在学习,我本来是想用硒来做这件事的,但我一直在努力,这就是为什么我搬到了beautifulsoup
在bs4.7.1+中,您可以使用
:contains
以Activity
选项卡为目标,使用adjacent sibling combinator获取下一个div
,使用type selector和descendant combinator获取子li
。我使用re
对输出进行一些字符串清理输出:
相关问题 更多 >
编程相关推荐