Python scrapy-beautifulsoup包_程序模块 - PyPI

用beautifulsoup处理非格式html的简单scrapy中间件

scrapy-beautifulsoup的Python项目详细描述

刮花美容组

用beautifulsoup处理非格式html的简单scrapy中间件

安装

包位于pypi上，可以使用pip：

安装

pip install scrapy-beautifulsoup

配置

将中间件添加到DOWNLOADER_MIDDLEWARES字典设置：

DOWNLOADER_MIDDLEWARES = {
    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400
}

默认情况下，BeautifulSoup将使用内置的html.parser解析器。要更改它，请设置BEAUTIFULSOUP_PARSER设置：

BEAUTIFULSOUP_PARSER = "html5lib"  # or BEAUTIFULSOUP_PARSER = "lxml"

html5lib是一个极为宽松的解析器，如果目标html严重损坏，您可以考虑将其作为您的首选。注意：在这种情况下，html5lib必须安装：

pip install html5lib

动机

BeautifulSoup本身在underlying parser of choice的帮助下处理格式不正确或损坏的html的工作相当出色。在某些情况下，通过BeautifulSoup来“修复”html是有意义的。

欢迎加入QQ群-->： 979659372

scrapy-beautifulsoup 0.0.2

scrapy-beautifulsoup的Python项目详细描述

刮花美容组

安装

配置

动机

推荐PyPI第三方库

tweebot

datapackage-pipelines-aws

scanpydoc

sphinxcontrib-lastupdate

bbcondeparser

populous

asyncba

sample-sheet

replace_me

mt103

yerkes

revkit

ofxhome

gpssim

bitlyshortener

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

scrapy-beautifulsoup 0.0.2

scrapy-beautifulsoup的Python项目详细描述

刮花美容组

安装

配置

动机

推荐PyPI第三方库

tweebot

datapackage-pipelines-aws

scanpydoc

sphinxcontrib-lastupdate

bbcondeparser

populous

asyncba

sample-sheet

replace_me

mt103

yerkes

revkit

ofxhome

gpssim

bitlyshortener

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签