Python extraction包_程序模块 - PyPI

从HTML网页中提取基本信息。

extraction的Python项目详细描述

===
extraction
==

extraction是一个python包，用于从网页中提取标题、描述、
图像和规范url。如果您正在构建一个链接聚合器，其中用户提交链接，而您希望显示链接（如提交到Facebook、Digg或Delicious的链接），则可能需要使用提取功能。相反，它是一个用于数据的工具，这些数据总是由另一个工具检索或爬网的。`请参见此处最后一个与python 2.x兼容的版本<；https://github.com/lethain/extraction/tree/c96afe2a9fd6d1fc1ad8eb793b43e2b9c1484c>；`.

请参见"github"上的；https://github.com/lethain/extraction>；` ` ` `，或在
"pypi<；http://pypi.python.org/pypi/extraction/0.1.0>；`.

==

==

使用"提取"的一个非常简单的示例是：：

>；>；导入提取
>；>；导入请求
>；>；url="http://lethain.com/social hierarchies in engineering organizations/"
>；>；html=requests.get（url）.text
>；>；extracted=extraction.extractor（）.extract（html，source\url=url）
>；>；extracted.title
>；>；"工程组织中的社会等级-非理性繁荣"
>；>print extracted.title，extracted.description，extracted.image，extracted.url
>；>print extracted.titles，extracted.descriptions，extracted.images，已提取。url

请注意，"source_url"在提取中是可选的，但建议
，因为这样可以将相对url和图像url
重写为绝对路径。` source_url`不用于获取数据，
，但可用于将提取技术定位到正确的域。

更多详细使用示例，包括如何添加自己的提取机制，在安装部分下面。

installation
==`
虽然"extraction"已经通过它的要求拉下了"html5lib"<；http://code.google.com/p/html5lib/>；`
，但我还是建议安装"lxml"<；http://lxml.de/>；`，
因为"html5lib"有一些非常棘手的问题，无法解析xhtml页面（例如，pypi无法完全解析html5lib:：

>；>；bs4.beautifulsoup（text，["html5lib"]）。查找所有（"title"）
[
>；>；bs4.beautifulsoup（text，["lxml"]）.查找所有（title）
[<；title>；提取0.1.3:python包索引lt；/title>；]

pip安装lxml<；http://lxml.de/>；`` upip:：

pip安装lxml

然后在安装"lxml"之后，
您可以从github安装：

cd extraction
python3-m venv env
。./env/bin/activate
pip install-r requirements.txt
pip install-e.

然后您可以运行测试：

python tests/tests.py

>所有这些测试都应该在正常安装中通过。

使用提取的各种方法，既可以使用
现有的提取技术，也可以添加自己的提取技术。

有关更多示例，请查看"提取/示例"目录。

url="http://lethain.com/social-hier工程组织中的archies/"
>；>html=requests.get（url）.text
>；>extracted=extraction.extractor（）.extract（html，source\url=url）
>；>extracted.title
>；>"工程组织中的社会等级-非理性繁荣"
>；>>打印提取.标题，提取.描述，提取.图像，提取.url

"提取"实例（由"extractor.extract"返回）中的"说明"和此类"说明"：打印提取。title

>；>；打印提取。description
>；>；打印提取。url
>；>；打印提取。image
>；>；>打印已提取。feed

>您可以使用多个版本获得提取值的完整列表：

>；>打印已提取。titles
>；>打印已提取。descriptions
>；>打印已提取。url
>；>打印已提取。图像
>；>；打印已提取。如果要查找正在提取但不属于其中一个类别的数据（可能使用自定义技术），请输入

然后
查看"提取"字典：

>；>；打印已提取。意外的值存储在那里（如果您经常遇到这种情况，请查看"提取子类以提取新类型的数据"。

排序是很重要的，最准确的排序技术应该总是先运行，更一般的，质量更低的排序技术应该在以后运行。

url存储在一个列表中，这个列表是在运行技术时构建的，
和"title"，"url"，` image`and`description`属性
只需返回相应列表中的第一个项。

technologies由一个字符串表示，该字符串包含
技术及其类的完整路径。例如，"extraction.technology.facebook opengraphtags"`
是技术的有效表示。

技术的默认顺序在extraction.extractor的
`technologs'类变量中，和is:：

extraction.technologies.facebook opengraphtags
extraction.technologies.twitterSummaryCardTags
extraction.technologies.html5semanticTags
extraction.technologies.headTags
extraction.technologies.semanticTags

技术有三种方式。
首先，在初始化提取时，可以通过将技术列表传递给
可选的"technologies"参数来修改它。提取程序：

>；>；technologies=["my_module.mytechnology"，"提取。技术。facebook opengraphtags"]
>；>；extractor=extraction.extractor（technologies=technologies）

=最后，第三个选项是直接修改"technologies"类变量。
避免将来调试的前两种技术：

>；>import extraction
>；>extraction.extractor.technologies.insert（0，"my_module.myawesometechnology"）
>；extraction.extractor.technologies.append（"my_module.mylastreporttechnology"）

.

编写新技术可能是您经常分析给定的网站，而
对默认提取技术的执行情况并不满意。在这种情况下，考虑编写您自己的技术。

让我们以"lethain.com"上的一个博客条目为例，它使用"h1"标记来表示整个博客标题，
并且始终使用"div.page"中的第一个"h2"标记作为实际的标题。

def extract（self，html）：
"从lethain.com中提取数据。"
soup=beautifulsoup（html）
page\div=soup.find（'div'，class='page'）
text\u div=soup.find（'div'，类"text"）
返回{"titles"：[页div.find（'h2'）.string]，
"dates"：[页div.find（'span'，类"date"）.string]，
"descriptions"：[".join（文本div.find（'p'）.strings）"，
"tags"：[x.find（'a'）.stringx在页面"div.find_all（'SPAN'，class_uk='tag'）"，
"images"：[x.attrs['src']在文本"div.find戋all（'img'）"中表示x，
}

请看上面的"使用自定义技术和更改技术顺序"部分。

添加包含微格式的新技术是一个值得考虑的有趣领域。大多数微格式的使用非常有限，但在使用它们的地方，它们往往是信息的高质量来源。dictionary
中的键由"extract"返回，该键将在"extract（）"中可用。意外的值`
dictionary。通过这种方式，您可以很容易地添加对提取
地址或其他内容的支持。

对于一个虚构的示例，我们将从"willarson.com"中提取我的地址，而这绝不是提取地址的实际示例，并且
只是作为如何添加新类型提取数据的示例。

我已经尽可能简洁地写了这篇文章，以便更清楚地融入本文档中）：

摘自extraction.technologies import technology
摘自extraction import extractor，从bs4 import beautifulsoup中提取

类地址提取（提取）：
def初始（self，addresses=none，*args，**kwargs）：
self.addresses=addresses或[]
super（addressextracted，self）。
@属性
def address（self）：
如果self.addresses否则返回self.addresses[0]

class address extractor（extractor）：
"支持地址作为一级数据的提取程序。"
提取的类=地址提取的文本类型=["titles"，"descriptions"，"addresses"]

class addresstechnology（technology）：
def extract（self，html）：
"extract address data from willarson.com。"
soup=beautifulsoup（html）
返回{addresses'：["".join（soup.find（'div'，，id='address'）.strings）}

用法如下：

>；>import requests
>；>from extraction.examples.new_return_type import address extractor
>；>extractor=addressextractor（）
>；>extracttor.techniques=["extraction.examples.new_return_type.addresstechnique"]
>；>extracted=extractor.extract（requests.get（"http://willarson.com/"）
>；>extracted.address
"Cole Valey San Francisco，CA USA"

提取的地址作为第一类提取的数据。

将参数传递给技术
有两种方法。

首先，您可以简单地用您想要的特定行为对技术进行子类化，可能从django设置中提取数据
或者不提取数据：

classmytechnology（technology）：
def初始（self，*args，**kwargs:
如果kwargs中有"something"：
self.something=kwargs["something"]
del kwargs["something"]
其他：
self.something="something else"
返回super（mytechnology，self）。\uu init_uuuu（*args，**kwargs）

def extract（html，source_url=none）：
print self.something
return super（mytechnology，self）。extract（html，source_url=source_url）

第二，所有的技术都通过提取器进行处理，因此，您可以将自定义项烘焙到
extraction.extractor子类：：

从extraction导入extractor
从extraction.technologies导入technology

类myextractor（extractor）：
technologies=["my_module.mytechnology"]
def_初始化（self，something，*args，**kwargs）：
self.something=something
super（myextractor，self）。

mytechnology类（technology）：
类提取（self，html，source_url=none:
打印self.extractor.something
返回super（mytechnology，self）.extract（html，source_url=source_url）

应该可以定制您需要的行为。

extraction technologies
=请看下面的"使用提取"部分。

rss提要等。
此技术分析的数据如下：

<；head>；
<；meta name="description"content="Will Larson&；&39；的编程和其他内容博客。"/>；
<；link rel="alternate"type="application/rss+xml"title="page feed"href="/feed s/"/>；
<；link rel="canonical"href="http://lethain.com/digg-v4-architecture-process/">；
<；title>；digg v4&；&39；s架构和开发过程-非理性繁荣<；/title>；
<；/head>；

虽然head标记是规范url和rss的权威来源，
但对于title、description和其他内容，它通常是非常重要的。
最坏情况下，这总比什么都没有要好。

extraction.technologies.facebook opengraph tags
--------------------------

无论好坏，页面数据的最高质量来源通常是"facebook opengraph meta tags<；https://developers.facebook.com/docs/opengraphprotocol/>；`.
此技术使用opengraph标记，如下所示：

<；head>；
…
<；meta property="og:title"content="something"/>；
<；meta property="og:url"content="http://www.example.org/something//"/>；
<；meta property="og:image"content="http://images.example.org/a/"/>；
<；meta property="og:description"content="神奇的东西。"/>；
…
<；/head>；

作为它们的数据源。

extraction.technologies.twitterSummaryCardTags
另一组越来越常见的元标记是"twitter card tags<；https://dev.twitter.com/docs/cards/types/summary card>；`.
此技术分析这些标记，它们看起来像：

<；head>；
…
<；meta name="twitter:card"content="summary">；
<；meta name="twitter:site"content="@nytimes">；
<；meta name="twitter:creator"content="@sarahmaslinnir">；
<；meta name="twitter:title"content="休斯顿葬礼的粉丝队伍">；
<；meta name="twitter:description"content="newark-宾客名单和队伍…">；
<；meta name="twitter:image"content="http://graphics8.nytimes.com/images/2012/02/19/us/19whitney span/19whitney span article.jpg"
……
<；/head>；

因为在twitter feed中使用图像进行渲染是必要的，这将是一个非常高质量的数据源。因此，它们对规范化文章没有多大帮助。

extraction.technologies.html5语义标记此外，"video"标签还为我们提供了一些有用的提示，用于为碰巧使用这些标签的网站提取页面信息。

此技术将从以下格式的页面提取信息：

<；html>；
<；body>；
<；h1>；这不是HTML5Semantictags的标题<；h1>；
<；文章>；
<；h1>；这是标题<；h1>；
<；p>；这是描述。<；p>；
<；p>；这不是描述。<；p>；
<；文章>；
<；视频>；
<；source src="this_is_a_video.mp4">；
<；video>；
<；body>；
<；html>；

因为它提供的高质量信息只在少数情况下出现，而在其他情况下，它希望"semantictags"在它后面运行sweep
以获得较低的质量，它发现了更丰富的点击量。

extraction.technologies.semantictags

和"p"标记通常包括可用作描述的文本：

<；html>；
<；body>；
<；h1>；将作为标题提取。lt；/h1>；
<；h2>；也将作为标题提取，但毕竟h1是这样。<；/h2>；
<；img src="此"将被提取为"u an"img.png">；
<；p>；并将此作为说明。<；/p>；
<；p>；此作为另一个可能的说明。<；/p>；
<；p>；此作为第三个可能的说明。<；/p>；
<；/body>；
<；/html>；

在"semantictags"中定义了一个限制，即一个给定类型的
标记将被使用的数量，通常是3-5，
，但图像除外，它是10（因为这实际上是检测图像的有效方法，与其他方法不同）。

这是一种真正的万不得已的方法。

建议的读取顺序是：

提取/测试.py
提取/_初始化py
提取/技术py

希望所有问题都能在其中得到解答。

贡献、问题、关注点
======

我很乐意把它合并进来。

欢迎加入QQ群-->： 979659372

extraction 0.3

extraction的Python项目详细描述

推荐PyPI第三方库

tinyapp

r

djangocms-socialshare

pylinac-qatrackplus

safetywrap

scisoftp

haihonglicom-test-package

blip-session

box2dp

fortinetapi

sci-distributions

kw-audis-common

minipresto

practical

cloudcmd

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

extraction 0.3

extraction的Python项目详细描述

推荐PyPI第三方库

tinyapp

r

djangocms-socialshare

pylinac-qatrackplus

safetywrap

scisoftp

haihonglicom-test-package

blip-session

box2dp

fortinetapi

sci-distributions

kw-audis-common

minipresto

practical

cloudcmd

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签