Python webarticle2text包_程序模块 - PyPI

从网页中提取主要文章文本。

webarticle2text的Python项目详细描述

#webarticle2text-从网页中提取主要文章文本。

[！[]（https://img.shields.io/pypi/v/webarticle2text.svg)](https://pypi.python.org/pypi/webarticle2text）[！[生成状态]（https://img.shields.io/travis/chrisspen/webarticle2text.svg?branch=master)](https://travis-ci.org/chrisspen/webarticle2text）[！[]（https://pyup.io/repos/github/chrisspen/webarticle2text/shield.svg)](https://pyup.io/repos/github/chrisspen/webarticle2text）

##概述

这个项目已经过时了，现在只能作为参考。我建议您改为使用[newspaper]（https://github.com/codelucas/newspaper），这比我遇到的任何其他文章抽取库都要精确一个数量级。

有关几种类似工具的性能比较，请参见compare.csv。

这将尝试定位和提取网页。它通过遍历dom树，识别所有文本段及其在dom中的深度，大约在相同的深度，然后返回总数最大的块长度。

这种方法通常适用于典型的新闻网站新闻文章按url显示。这种方法通常失败于显示多个新闻摘要的URL（例如新闻聚合器）。

##安装

您可能需要安装tidylib系统包，可以在ubuntu 12.04上使用：

sudo apt-get install libtidy-0.99-0

或在软呢帽上使用：

sudo yum install libtidy

然后，只需使用pip:

pip install webarticle2text

##用法

您可以将脚本作为python模块调用：

from webarticle2text import webarticle2text print webarticle2text.extractFromURL(“http://some/arbitrary/url”)

或者作为独立的命令行脚本：

webarticle2text.py http://some/arbitrary/url

注意，要从命令行使用它，您需要确保它已执行权限，位于您的路径中。在大多数平台上，这应该由setup.py自动完成。

##开发

测试需要安装python开发头文件，您可以使用以下命令在ubuntu上安装它们：

sudo apt-get install python-dev python3-dev python3.4-dev

要跨多个python版本运行unittests，请安装：

sudo apt-get install python3.4-minimal python3.4-dev python3.5-minimal python3.5-dev

运行所有[测试]（http://tox.readthedocs.org/en/latest/）：

export TESTNAME=; tox

为特定环境（如Python2.7）运行测试：

export TESTNAME=; tox -e py27

运行特定测试：

export TESTNAME=.test_extract; tox -e py27

##历史记录

1.0.0（2008.9.16）首次公开发行。
1.2.0（2011.1.3）更新以支持Unicode。
1.2.2（2011.12.17）清理了安装过程和文档，并移到github.com。
1.2.3（2011.12.21）修复了重定向stdout时的编码错误。例如webarticle2text.pyhttp://some/arbitrary/url>；output.txt
1.2.5（2012.11.5）添加了用于指定在请求URL时要使用的用户代理头的选项。
2.0.0（2014.4.20）增加了对Python3.2的支持。

欢迎加入QQ群-->： 979659372

webarticle2text 3.0.2

webarticle2text的Python项目详细描述

##概述

##安装

##用法

##历史记录

推荐PyPI第三方库

RepeatFS

akashi-engine

mdd

causeinfer

odoo12-addon-web-pwa-oca

binarysearchtree

fast-lineage-caller

pro-distributions-func

penc

investec

hebrew-fix

dfin

mlcube

imio.restapi

NoStrError

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

webarticle2text 3.0.2

webarticle2text的Python项目详细描述

##概述

##安装

##用法

##历史记录

推荐PyPI第三方库

RepeatFS

akashi-engine

mdd

causeinfer

odoo12-addon-web-pwa-oca

binarysearchtree

fast-lineage-caller

pro-distributions-func

penc

investec

hebrew-fix

dfin

mlcube

imio.restapi

NoStrError

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签