无法使用BeautifulSoup和Request从span标签中提取文本

-1 投票
3 回答
71 浏览
提问于 2025-04-13 01:03

我正在尝试从一个在线论坛上抓取帖子。这个论坛的链接是 https://csn.cancer.org/categories/prostate。所有的帖子似乎都在标签里。

我用下面的代码来抓取帖子。

import requests
from bs4 import BeautifulSoup as bs

url = f"https://csn.cancer.org/categories/prostate"
response = requests.get(url)

soup = bs(response.text, 'html.parser')
ps = soup.findAll('span',attrs ={ 'class':"css-yjh1h7-TruncatedText-styles-truncated"})
for p in ps:
    print(p.text) 

但是我什么都没抓到。从标签中('class':"css-yjh1h7-TruncatedText-styles-truncated"),我无法提取到任何内容。

让我感到困惑的是,如果我从网站上复制一部分HTML代码,然后这样做:

html ='''  <div class="css-e9znss-ListItem-styles-item"><div class="css-1flthw6-ListItem-styles-iconContainer"><div class="css-n9xrp8-ListItem-styles-icon"><div class="css-7yur4m-DiscussionList-classes-userIcon"><a aria-current="false" href="https://csn.cancer.org/profile/336163/PhoenixM" tabindex="0" data-link-type="legacy"><div class="css-1eztffh-userPhotoStyles-medium css-11wwpgq-userPhotoStyles-root isOpen"><img title="PhoenixM" alt="User: &quot;PhoenixM&quot;" height="200" width="200" src="https://w6.vanillicon.com/v2/62fd0812499add3e38f8e90eee3af967.svg" class="css-10y567c-userPhotoStyles-photo" loading="lazy"></div></a></div></div></div><div class="css-2guvez-ListItem-styles-contentContainer"><div class="css-1kxjkhx-ListItem-styles-titleContainer"><h3 class="css-glebqx-ListItem-styles-title"><a aria-current="false" href="https://csn.cancer.org/discussion/327814/brachytherapy" class="css-ojxxy9-ListItem-styles-titleLink-DiscussionList-classes-title" tabindex="0" data-link-type="legacy"><span class="css-yjh1h7-TruncatedText-styles-truncated">Brachytherapy</span></a></h3></div><div class="css-1y6ygw7-ListItem-styles-metaWrapContainer"><div class="css-5swiwf-ListItem-styles-metaDescriptionContainer"><p class="css-1ggegep-ListItem-styles-description"><span class="css-yjh1h7-TruncatedText-styles-truncated">I have recently been diagnosed with locally advanced prostate cancer. Gleason 9 stage 4a. My cancer has spread outside my prostate to a very enlarged lymph node in my pelvic region. I’m currently taki…</span></p><div class="css-1uyxq88-Metas-styles-root css-h3lbxm-ListItem-styles-metasContainer"><div class="css-1a607mt-Metas-styles-meta">135 views</div><div class="css-1a607mt-Metas-styles-meta">6 comments</div><div class="css-1a607mt-Metas-styles-meta">0 point</div><div class="css-1a607mt-Metas-styles-meta">Started by <a aria-current="false" href="https://csn.cancer.org/profile/336163/PhoenixM" class="css-1unw87s-Metas-styles-metaLink" tabindex="0" data-link-type="legacy">PhoenixM</a></div><div class="css-1a607mt-Metas-styles-meta">Most recent by <a aria-current="false" href="https://csn.cancer.org/profile/285710/Steve1961" class="css-1unw87s-Metas-styles-metaLink" tabindex="0" data-link-type="legacy">Steve1961</a></div><div class="css-1a607mt-Metas-styles-meta"><time datetime="2024-03-22T01:59:58+00:00" title="Thursday, March 21, 2024 at 9:59 PM">Mar 21, 2024</time></div><div class="css-1a607mt-Metas-styles-meta"><a aria-current="false" href="https://csn.cancer.org/categories/prostate" class="css-1unw87s-Metas-styles-metaLink" tabindex="0" data-link-type="legacy"> Prostate Cancer </a></div></div></div></div></div><div class="css-1pv9k2p-ListItem-styles-actionsContainer"></div></div> '''

from bs4 import BeautifulSoup as bs
import requests

# Parse the HTML with BeautifulSoup
soup = bs(html, 'html.parser')

ps = soup.findAll('span',attrs ={ 'class':"css-yjh1h7-TruncatedText-styles-truncated"})

for p in ps:
    print(p.text)

我就能提取到帖子内容。有人能帮我理解为什么我从网站链接抓不到东西吗?我哪里做错了?谢谢大家。

3 个回答

0

这里的重点是,你通过浏览器的开发者工具查看动态渲染的网页源代码,但 requests 只能处理静态的响应。

所以,首先要检查 responsesoup,看看是否所有你期待的元素都在,或者它们是否以你预期的形式出现。

静态版本只返回一个无序的链接和段落列表,因此你需要相应地调整你的本地化策略。

示例
import requests
from bs4 import BeautifulSoup as bs

url = f"https://csn.cancer.org/categories/prostate"
response = requests.get(url)

soup = bs(response.text, 'html.parser')

data = []
for e in soup.select('ul.linkList li'):
    data.append({
        'url':e.a.get('href'),
        'title':e.a.text,
        'content':e.p.text
    })

data

输出:

[{'url': 'https://csn.cancer.org/discussion/324401/some-tips-from-csn',
  'title': 'Some tips from CSN',
  'content': 'Welcome to the CSN! Below are some tips to help you get started following and posting on the boards. You can follow your preferred discussion board to get notifications of new topics and comments. To follow a board, navigate to your preferred board and then select the bell at the top, followed by your notification…'},
 {'url': 'https://csn.cancer.org/discussion/327774/second-biopsy-results',
  'title': 'Second Biopsy results',
  'content': 'Last November at 67 after an MRI my first fusion biopsy was done transrectally at local hospital. It indicated (as far as I could figure) a single tumor residing in the right side of my prostate. I had a NM bone scan in January which did not show any bony metastases. Also a Prolaris test which indicated for single modal…'},
 ...
{'url': 'https://csn.cancer.org/discussion/327732/psa-value-at-0-54-on-1-24-24',
  'title': 'PSA Value at 0.54 on 1/24/24',
  'content': 'I had Prostate Cancer in 2018 and had radiation treatments all was good but now my value is all over the place since then on 11/17/21 V @ 0.7, on 3/15/22, 1.84 on 3/30/23 0.38, on 8/7/23 0.26, 1/24/24 0.54 is this common?'}]
0

所需的数据在 noscript 标签里面。这就是我们无法直接解析它的原因。我找到了一个规律,并把响应内容分开,以获取所需的 HTML 页面源代码。下面的代码片段运行得很好。

import requests
from bs4 import BeautifulSoup as bs

url = f"https://csn.cancer.org/categories/prostate"
response = requests.get(url)

try:
        # Extracting html page source present in the noscript tag to parse and extract the required data
        noscript_value = response.text.split('noscript><div class="pageBox"')[1].split("</div></noscript>")[0].strip()
except Exception as e:
        print(f"Error: {e}")
        noscript_value = None

soup = bs(noscript_value, 'html.parser')

ps = soup.findAll('li')
for p in ps:
        print(p.text.strip())

输出:

Some tips from CSN
Welcome to the CSN! Below are some tips to help you get started following and posting on the boards. You can follow your preferred discussion board to get notifications
 of new topics and comments. To follow a board, navigate to your preferred board and then select the bell at the top, followed by your notification…
Second Biopsy results
Last November at 67 after an MRI my first fusion biopsy was done transrectally at local hospital. It indicated (as far as I could figure) a single tumor residing in the
 right side of my prostate. I had a NM bone scan in January which did not show any bony metastases. Also a Prolaris test which indicated for single modal…
Brachytherapy
I have recently been diagnosed with locally advanced prostate cancer. Gleason 9 stage 4a. My cancer has spread outside my prostate to a very enlarged lymph node in my p
elvic region. I’m currently taking Orgovyx +Abiraterone and my Radiologist is recommending IMRT. Ive been doing research that suggests brachytherapy…
.
.
.
Benign biopsy. Confirm MDX test results: 21%< =G6; 15%>=G7
I had a MRI fusion transperineal biopsy. 26 cores benign. My urologist ordered a Confirm MDX test. Results show: Likelihood of prostate cancer on repeat biopsy: 36% 21% likelihood of detecting Gleason score < = 6 cancer 15% likelihood of detecting Gleason score >=7 cancer It is recommending for a second biopsy. I guess I…
PSA Value at 0.54 on 1/24/24
I had Prostate Cancer in 2018 and had radiation treatments all was good but now my value is all over the place since then on 11/17/21 V @ 0.7, on 3/15/22, 1.84 on 3/30/23 0.38, on 8/7/23 0.26, 1/24/24 0.54 is this common?
1

我建议从页面内部找到的嵌入式JSON字符串中加载数据:

import json

import requests
from bs4 import BeautifulSoup

url = "https://csn.cancer.org/categories/prostate"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

data = soup.select_one('[data-react="DiscussionListModule"]')["data-props"]
data = json.loads(data)

print(json.dumps(data, indent=4))

输出结果是:

{
    "apiParams": {
        "rand": "5Y5ZJ48BKR"
    },
    "discussions": [
        {
            "discussionID": 324401,
            "type": "discussion",
            "name": "Some tips from CSN",
            "excerpt": "Welcome to the CSN! Below are some tips to help you get started following and posting on the boards. You can follow your preferred discussion board to get notifications of new topics and comments. To follow a board, navigate to your preferred board and then select the bell at the top, followed by your notification\u2026",
            "categoryID": 126,
            "dateInserted": "2021-12-01T22:15:59+00:00",
            "dateUpdated": "2023-12-18T14:32:44+00:00",
            "dateLastComment": "2021-12-01T22:15:59+00:00",
            "insertUserID": 231489,
            "insertUser": {
                "userID": 231489,
                "name": "CSNSupportTeam",
                "url": "https://csn.cancer.org/profile/CSNSupportTeam",
                "photoUrl": "https://us.v-cdn.net/6035652/uploads/userpics/OIBYPVTDVA21/nMI8TZCFNO605.jpg",
                "dateLastActive": "2024-03-22T20:35:00+00:00",
                "banned": 0,
                "punished": 0,
                "private": false,
                "label": ""
            },

...

撰写回答