使用BeautifulSoup进行文本提取

2024-03-29 14:05:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我有如下html数据:

<!DOCTYPE html>
<html>
<head>

    <script type="text/blzscript">
    </script>
    <title></title>
</head>
<body>
    <p class="status-box">In some countries, this medicine may only be approved for veterinary use.</p>
    <h3>Scheme</h3>
    <p>Rec.INN</p>
    <h3>CAS registry number (Chemical Abstracts Service)</h3>
    <p>0000850-52-2</p>
    <h3>Chemical Formula</h3>
    <p>C21-H26-O2</p>
    <h3>Molecular Weight</h3>
    <p>310</p>
    <h3>Therapeutic Category</h3>
    <p>Progestin</p>
    <h3>Chemical Names</h3>
    <p>17α-Allyl-17-hydroxyesta-4,9,11-trien-3-one (WHO)</p>
    <p>Estra-4,9,11-trien-3-one, 17β-hydroxy-17-(2-propenyl)- (USAN)</p>
    <h3>Foreign Names</h3>
    <ul>
        <li>Altrenogestum (Latin)</li>
        <li>Altrenogest (German)</li>
        <li>
            <a href="altr%C3%A9nogest.html">Altrénogest</a> (French)
        </li>
        <li>Altrenogest (Spanish)</li>
    </ul>
    <h3>Generic Names</h3>
    <ul>
        <li>Altrenogest (OS: BAN, USAN)</li>
        <li>
            <a href="altr%C3%A9nogest.html">Altrénogest</a> (OS: DCF)
        </li>
        <li>A 35957 (IS)</li>
        <li>A 41300 (IS)</li>
        <li>RH 2267 (IS)</li>
        <li>RU 2267 (IS: RousselUclaf)</li>
    </ul>
    <h3>Brand Names</h3>
    <div class='contentAdRight' id='third_ad_unit'>
        <div class='adsense-ad adsense-ad-text-image-flash-html adsense-ad-300 adsense-ad-300x600 adsense-ad-international'>
            <script type="text/blzscript">
            google_ad_client="pub-3964816748264478";google_ad_channel="";google_ad_format="300x600_pas_abgc";google_ad_width="300";google_ad_height="600";google_ad_type="text,image,flash,html";google_color_border="FFFFFF";google_color_bg="FFFFFF";google_color_link="0000FF";google_color_text="000000";google_color_url="008000";google_analytics_domain_name="drugs.com";
            </script>
            <h1></h1>
        </div>
    </div>
</body>
</html>

我想提取:

外国名称、通用名称和品牌名称: 我试过了

test = soup.select('h1')[0].text.strip()
print(test)

但这不是给我想要的,我也试图提取脚本,但他们都没有给我的要求结果


Tags: textdivnamesishtmltypegooglescript