使用BeautifulSoup从单个博客归档页面提取多个帖子，无需脚本

Question

我正在尝试从一系列WordPress和Blogger的博客归档页面中提取作者、标题、日期和帖子内容。我已经把这些页面保存下来了，这样就不用反复请求服务器。我其他的部分都能正常工作，但我似乎无法同时从每个文件中获取所有的帖子，并且不想获取底部的“add-to-any”或“sociable”这些乱七八糟的脚本。现在的情况是这样的。

import urllib2
from bs4 import BeautifulSoup
import re

file_list = open ("hafiles.txt", "r")
posts_file = open ("haposts.txt","w")


for indurl in file_list:
    indurl = indurl.rstrip("\n")
    with open(indurl,"r") as ha_file:
     soup_ha = BeautifulSoup(ha_file)

    #works the second find gets rid of the sociable crap
    # this is the way it looks on the page <div class='post-body'>

    posts = soup_ha.find("div", class_="post-body").find_all("p")


    #tried a trick i saw on http://stackoverflow.com/questions/24458353/cleaning-text-string-after-getting-body-text-using-beautifulsoup
    #no joy
    #posts = soup_ha.find("div", class_="post-body")
    #text = [''.join(s.findAll(text=True))for s in posts.findAll('p')] 
    text = str(posts) + "\n" + "\n"
    posts_file.write (text)

print ("All done!")



file_list.close()
posts_file.close()

所以如果我使用find_all来获取所有帖子（我甚至不确定我是否真的获取了所有），那么我就会得到脚本。如果我只用find，我至少可以用两种方法获取没有脚本的漂亮帖子。我有一个文件列表，每个文件里有几个帖子需要提取。

我在stackoverflow和网上搜索过。

补充说明：输入的是一个非常杂乱的网页，顶部有很多脚本，页面上还有所有的CSS定义，然后

<div id='main-wrapper'>
<div class='main section' id='main'><div class='widget Blog' id='Blog1'>
<div class='blog-posts'>
<h2 class='date-header'>27 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>
<h3 class='post-title'>
<a href='http:// edited for anon.html'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit this is post text - what i want</p>
<script type='text/javascript'>
          var permlink='edit';
          var title='edit';

          var spans = document.getElementsByTagName('span');
          var number = 0;
          for(i=0; i <spans.length; i++){
                var c = " " + spans[i].className + " ";
                if (c.indexOf("fullpost") != -1) {
                number++;
                }
                }

                if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
           memory = number;
           </script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
this is the author name, also want, have way to get
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='http://edit' title='permanent link'>2:53 pm</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>1 comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edi</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>26 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='5518681505930320089'></a>
<h3 class='post-title'>
<a href='edit'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit post text, what I want.</p>
<script type='text/javascript'>
          var permlink='http://edit';
          var title='edit';

          var spans = document.getElementsByTagName('span');
          var number = 0;
          for(i=0; i <spans.length; i++){
                var c = " " + spans[i].className + " ";
                if (c.indexOf("fullpost") != -1) {
                number++;
                }
                }

                if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
           memory = number;
           </script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
edit author name
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='edit' title='permanent link'>9:00 am</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>5
comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edit</a>,
<a href='edit' rel='tag'>edit</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>22 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>

哎呀！我大概有20个文件，每个文件里有1到10个帖子（这个文件里有2个）……如果能得到一个像这样的CSV或Excel文件就太好了：日期作者标题帖子内容

每列一个，每行一个。我也可以接受一个只有帖子内容的文件，帖子之间有一些空行。我对文本中的一些链接、加粗和列表什么的没问题，但我不想要那些乱七八糟的脚本。谢谢！

数据提取网页抓取 html解析数据清洗 beautifulsoup csv文件 wordpress blogger

使用BeautifulSoup从单个博客归档页面提取多个帖子，无需脚本

1 个回答

撰写回答