使用BeautifulSoup从单个博客归档页面提取多个帖子,无需脚本

1 投票
1 回答
1333 浏览
提问于 2025-04-18 11:40

我正在尝试从一系列WordPress和Blogger的博客归档页面中提取作者、标题、日期和帖子内容。我已经把这些页面保存下来了,这样就不用反复请求服务器。我其他的部分都能正常工作,但我似乎无法同时从每个文件中获取所有的帖子,并且不想获取底部的“add-to-any”或“sociable”这些乱七八糟的脚本。现在的情况是这样的。

import urllib2
from bs4 import BeautifulSoup
import re

file_list = open ("hafiles.txt", "r")
posts_file = open ("haposts.txt","w")


for indurl in file_list:
    indurl = indurl.rstrip("\n")
    with open(indurl,"r") as ha_file:
     soup_ha = BeautifulSoup(ha_file)

    #works the second find gets rid of the sociable crap
    # this is the way it looks on the page <div class='post-body'>

    posts = soup_ha.find("div", class_="post-body").find_all("p")


    #tried a trick i saw on http://stackoverflow.com/questions/24458353/cleaning-text-string-after-getting-body-text-using-beautifulsoup
    #no joy
    #posts = soup_ha.find("div", class_="post-body")
    #text = [''.join(s.findAll(text=True))for s in posts.findAll('p')] 
    text = str(posts) + "\n" + "\n"
    posts_file.write (text)

print ("All done!")



file_list.close()
posts_file.close()

所以如果我使用find_all来获取所有帖子(我甚至不确定我是否真的获取了所有),那么我就会得到脚本。如果我只用find,我至少可以用两种方法获取没有脚本的漂亮帖子。我有一个文件列表,每个文件里有几个帖子需要提取。

我在stackoverflow和网上搜索过。

补充说明:输入的是一个非常杂乱的网页,顶部有很多脚本,页面上还有所有的CSS定义,然后

<div id='main-wrapper'>
<div class='main section' id='main'><div class='widget Blog' id='Blog1'>
<div class='blog-posts'>
<h2 class='date-header'>27 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>
<h3 class='post-title'>
<a href='http:// edited for anon.html'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit this is post text - what i want</p>
<script type='text/javascript'>
          var permlink='edit';
          var title='edit';

          var spans = document.getElementsByTagName('span');
          var number = 0;
          for(i=0; i <spans.length; i++){
                var c = " " + spans[i].className + " ";
                if (c.indexOf("fullpost") != -1) {
                number++;
                }
                }

                if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
           memory = number;
           </script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
this is the author name, also want, have way to get
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='http://edit' title='permanent link'>2:53 pm</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>1 comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edi</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>26 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='5518681505930320089'></a>
<h3 class='post-title'>
<a href='edit'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit post text, what I want.</p>
<script type='text/javascript'>
          var permlink='http://edit';
          var title='edit';

          var spans = document.getElementsByTagName('span');
          var number = 0;
          for(i=0; i <spans.length; i++){
                var c = " " + spans[i].className + " ";
                if (c.indexOf("fullpost") != -1) {
                number++;
                }
                }

                if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
           memory = number;
           </script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
edit author name
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='edit' title='permanent link'>9:00 am</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>5
comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edit</a>,
<a href='edit' rel='tag'>edit</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>22 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>

哎呀!我大概有20个文件,每个文件里有1到10个帖子(这个文件里有2个)……如果能得到一个像这样的CSV或Excel文件就太好了: 日期 作者 标题 帖子内容

每列一个,每行一个。我也可以接受一个只有帖子内容的文件,帖子之间有一些空行。我对文本中的一些链接、加粗和列表什么的没问题,但我不想要那些乱七八糟的脚本。谢谢!

1 个回答

1

这里有一个例子,展示了一个页面上有多个帖子:

from bs4 import BeautifulSoup


soup = BeautifulSoup(open('test.html'))
posts = []
for post in soup.find_all('div', class_='post'):
    title = post.find('h3', class_='post-title').text.strip()
    author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
    content = post.find('div', class_='post-body').p.text.strip()
    date = post.find_previous_sibling('h2', class_='date-header').text.strip()

    posts.append({'title': title,
                  'author': author,
                  'content': content,
                  'date': date})
print posts

对于你发布的这个html,它会输出:

[{'content': u'edit this is post text - what i want', 
  'date': u'27 February, 2007', 
  'author': u'this is the author name, also want, have way to get', 
  'title': u'edit'}, 
 {'content': u'edit post text, what I want.', 
  'date': u'26 February, 2007', 
  'author': u'edit author name', 
  'title': u'edit'}]

撰写回答