如何仅从RSS提要项中获取描述的有用部分?

2024-05-15 05:34:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用python feedparser从一个mashable提要中获取这个项[“description”]:

<img alt="9f4397d9c05e474fa54291507ad9c03a" src="http://rack.2.mshcdn.com/media/ZgkyMDE2LzA0LzI2LzM0LzlmNDM5N2Q5YzA1LjMzODI0LmpwZwpwCXRodW1iCTU3NXgzMjMjCmUJanBn/393b8db2/53c/9f4397d9c05e474fa54291507ad9c03a.jpg" />
<div style="float: right; width: 50px;"><a href="http://twitter.com/share?via=Mashable&amp;text=Nail+polish+stockings+are+exactly+what+you+need+for+a+lazy+summer+pedicure&amp;src=http%3A%2F%2Fmashable.com%2F2016%2F04%2F26%2Ftoe-nail-polish-stockings%2F" style="margin: 10px;"><img alt="Feed-tw" border="0" src="http://rack.1.mshcdn.com/assets/feed-tw-f7c0a094d16b7ee7c91a1e50839a8e00.jpg" /></a><a href="http://www.facebook.com/sharer.php?u=http%3A%2F%2Fmashable.com%2F2016%2F04%2F26%2Ftoe-nail-polish-stockings%2F&amp;src=sp" style="margin: 10px;"><img alt="Feed-fb" border="0" src="http://rack.1.mshcdn.com/assets/feed-fb-c0a21e8841794479b8086c32c6f24ba1.jpg" /></a></div>
<div>
    <p>Say goodbye messy pedicures and hello to finally feeling the sweet freedom of open toed shoes in summer.</p>
    <p>Japanese fashion company <a href="http://www.bellemaison.jp/cpg/fashion/fakenail/fakenail_index.html">Belle Maison</a> has a time saving solution for those of us out there who have little time and little hand coordination for painting our toenails &#8212; thin stockings with pre-painted toenails.</p>
    <div>
        <p>SEE ALSO: <a href="http://mashable.com/2016/02/23/weiner-dog-ear-plugs/">Weiner dog ear plugs will help you sleep deeper than a newborn pup</a></p>
    </div>
    <figure>
        <p><img class="" src="http://rack.1.mshcdn.com/media/ZgkyMDE2LzA0LzI2L2M1L3RvZW5haWxhcnRwLjI4NjBiLmpwZwpwCXRodW1iCTU3NXg0MDk2Pg/4f07495a/b32/toe-nail-art-polish-stockings-japan-10.jpg" /></p>
        <div>
            <p>Image:  belle maison</p>
        </div>
    </figure>
    <p>If you're worried about looking a little out-of-date with the classic stockings and open-toed heels that your grandma used to wear, don't fret. The stockings are designed to fit individual toes, giving your pedicure a better fit as well. <a href="http://mashable.com/2016/04/26/toe-nail-polish-stockings/">Read more...</a></p>
</div>
More about <a href="http://mashable.com/conversations/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Conversations</a>, <a href="http://mashable.com/pics/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Pics</a>, <a href="http://mashable.com/category/products/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Products</a>, <a href="http://mashable.com/lifestyle/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Lifestyle</a>, and <a href="http://mashable.com/category/weird-products/?utm_campaign=Mash-Prod-RSS-Feedburner-All-Partial&amp;utm_cid=Mash-Prod-RSS-Feedburner-All-Partial">Weird Products</a>

这是非常多的信息。我真正需要读者的是:

<p>Say goodbye messy pedicures and hello to finally feeling the sweet freedom of open toed shoes in summer.</p>
<p>Japanese fashion company <a href="http://www.bellemaison.jp/cpg/fashion/fakenail/fakenail_index.html">Belle Maison</a> has a time saving solution for those of us out there who have little time and little hand coordination for painting our toenails &#8212; thin stockings with pre-painted toenails.</p>

我怎么才能只得到这部分?我应该用python正则表达式吗?我不太确定,因为几乎所有的描述都是不同的,所以写一个表达式将是困难的。有没有另一个RSS item元素只提供我想要的信息?谢谢!你知道吗


Tags: divsrccomhttpprodallpartialamp
2条回答

如果您想走re的路,可以执行以下操作

pat = re.compile(r"<div>(.*?)</div>")
s = pat.search(html).group(1)
result = [line.strip() for line in s.strip().splitlines()[:2]]
# result
['<p>Say goodbye messy pedicures and hello to finally feeling the sweet freedom of open toed shoes in summer.</p>',
 '<p>Japanese fashion company <a href="http://www.bellemaison.jp/cpg/fashion/fakenail/fakenail_index.html">Belle Maison</a> has a time saving solution for those of us out there who have little time and little hand coordination for painting our toenails &#8212; thin stockings with pre-painted toenails.</p>']

但正如你所见,它很脏,很可能会破裂。所以一个解决方案是编写一个语法和一个小型解析器。但是健壮且方便的方法是使用类似Beautifulsouplxml的解析器。你知道吗

正如您正确猜测的那样,Regex将无法完成此任务(必须链接到this question)。 因此,最好的办法是将HTML提供给Beautifulsoup这样的解析器,并为解析后的DOM对象编写逻辑。你知道吗

from bs4 import BeautifulSoup 
soup = BeautifulSoup(my_input_html_string)
my_elements = soup.find_all('p')[0:2]

显然,这段代码假设您总是在任何给定的DOM中寻找前两个<p>。你将不得不根据你的输入提供的不同描述的一致性来调整你的逻辑。你知道吗

相关问题 更多 >

    热门问题