如何在Python中解析RSS源中的HTML标签

1 投票
1 回答
1481 浏览
提问于 2025-04-18 09:01

我有一个小工具,用来把RSS源的内容转换成纯文本。下面是一些示例代码:

#!/usr/bin/python

# /usr/lib/xscreensaver/phosphor -scale 3 -program 'python newsfeed.py | tee /dev/stderr | festival --tts'

import sys
import os
import feedparser
from subprocess import call

def printLine():
    terminalRows, terminalColumns = os.popen('stty size', 'r').read().split()
    for i in range(0, int(terminalColumns)):
        sys.stdout.write("-")
    print("\n")

feed = feedparser.parse('http://home.web.cern.ch/scientists/updates/feed')

for post in feed.entries:
    printLine()
    print post.title + "\n"
    print post.description + "\n"
printLine()

当这个代码运行时,输出结果是这样的:

-----------------------------------------------------------------------------------------------------

LHC seminar: Higgs boson width

<div class="field-body">
    <p>Constraints on the total Higgs boson width, Gamma_H, are presented using off-shell production and decay to ZZ in the 4l and 2l2nu final states. The analysis is based on data collected in 2012 by the CMS experiment at the LHC, corresponding to an integrated luminosity of L = 19.7/fb at a centre-of-mass energy of 8 TeV. The combined analysis of the 4l and 2l2nu events at high mass with the 4l measurement of the Higgs boson peak at 125.6 GeV leads to an upper limit on the Higgs boson width of Gamma_H &lt; 4.2 x Gamma_H(SM) at the 95% confidence level, assuming Gamma_H(SM) = 4.15 MeV. This result considerably improves over previous experimental constraints from direct measurements at the Higgs resonance peak.</p>
<h2><a href="https://indico.cern.ch/event/313506/">Watch the webcast at 11am CET</a></h2>
  </div>

-----------------------------------------------------------------------------------------------------

Neutrinos and nucleons

<p class="field-byline-taxonomy">
<a href="http://home.web.cern.ch/authors/christine-sutton">Christine Sutton</a></p>
  <div class="field-body">
    <p>On 7 April 1934 the journal <em>Nature</em> published a paper in which Hans Bethe and Rudolf Peierls made a first calculation of the neutrino cross-section and concluded that "it seems highly improbable that, even for cosmic ray energies, the cross-section becomes large enough to allow the process to be observed". Forty years on, neutrino cross-sections were not only being measured with the <a href="http://home.web.cern.ch/about/experiments/gargamelle">Gargamelle</a> bubble chamber at CERN's <a href="http://home.web.cern.ch/about/accelerators/proton-synchrotron">Proton Synchrotron</a>, they were helping to reveal a more fundamental layer to nature - the quarks.</p>
<p><strong>Read more:</strong> "<a href="http://cerncourier.com/cws/article/cern/56605">Neutrinos and nucleons</a>"- <em>CERN Courier</em></p>
  </div>

-----------------------------------------------------------------------------------------------------

有没有什么好的方法,可以把大多数RSS源的内容转换成没有HTML代码的纯文本呢?

1 个回答

1

你可以试试Python的一个模块叫做beautifulsoup4(可以通过pip安装)。这个问题可能会教你怎么用它来达到你的目的。

作为开始:

from bs4 import BeautifulSoup
soup = BeautifulSoup(post.description)
texts = soup.findAll(text = True)
print ''.join(texts)

这段代码展示了

Christine Sutton

On 7 April 1934 the journal Nature published a paper in which Hans Bethe and Rudolf Peierls made a first calculation of the neutrino cross-section and concluded that "it seems highly improbable that, even for cosmic ray energies, the cross-section becomes large enough to allow the process to be observed". Forty years on, neutrino cross-sections were not only being measured with the Gargamelle bubble chamber at CERN's Proton Synchrotron, they were helping to reveal a more fundamental layer to nature - the quarks.
Read more: "Neutrinos and nucleons"- CERN Courier

撰写回答