Python html2data包_程序模块 - PyPI

一种将html文件或url转换为结构化数据的简单方法。

html2data的Python项目详细描述

欢迎使用html2data

Author:	Daniel Perez Rada <dperezrada@gmail.com>

说明

一种将html文件或url转换为结构化数据的简单方法。您只需要定义元素的xpath。可选地，您可以定义之后要应用的函数。可以使用firebug扩展copy xpath轻松编写xpath（我建议编辑firebug给出的xpath，使其更短）。

示例

导入

>>> from html2data import HTML2Data

创建实例

>>> html = """<!DOCTYPE html><html lang="en"><head>
        <meta charset="utf-8" />
        <title>Example Page</title>
        <link rel="stylesheet" href="css/main.css" type="text/css" />
                </head>
                <body>
                <h1><b>Title</b></h1>
                <div class="description">This is not a valid HTML
                </body>
        </html>"""
>>> h2d_instance = HTML2Data(html = html) #You can also create it from a url = url

使用xpath配置

一个是你拥有的对象

>>> config = [
    {'name': 'header_title', 'xpath': '//head/title/text()'},
    {'name': 'body_title', 'xpath': '//h1/b/text()'},
    {'name': 'description', 'xpath': '//div[@class="description"]/text()'},
]

>>> h2d_instance.parse(config = config)
{'header_title': 'Example Page', 'body_title': 'Title', 'description': 'This is not a valid HTML'}

使用css选择器配置

>>> config = [
        {'name': 'header_title', 'css': 'head title'},
        {'name': 'body_title', 'css': 'h1 b '},
        {'name': 'description', 'css': 'div.description'},
    ]

>>> h2d_instance.parse(config = config)
{'header_title': 'Example Page', 'body_title': 'Title', 'description': 'This is not a valid HTML'}

现实生活中的例子

import urllib2

from html2data import HTML2Data

response = urllib2.urlopen('http://sil.senado.cl/cgi-bin/sil_ultproy.pl')
html = response.read()

config = [
    {'name': 'fecha', 'css': 'td:nth-child(1)'},
    {'name': 'id', 'css': 'td:nth-child(2) a'},
    {'name': 'nombre', 'css': 'td:nth-child(3)'},
    {'name': 'estado', 'css': 'td:nth-child(4)'},
]

html_instance = HTML2Data(html = html)
rows = html_instance.parse_one(css = 'td td tr', multiple = True, text = False)
for row_element in rows:
    row_in_html = HTML2Data(tree = row_element)
    print row_in_html.parse(config = config)

您将得到如下信息：

{'nombre': 'Reforma Constitucional que restablece obligatoriedad del voto.', 'fecha': '24/11/2011', 'estado': 'En tramitación', 'id': '8062-07'}
..
{'nombre': 'Prohíbe el anatocismo.', 'fecha': '02/11/2011', 'estado': 'En tramitación', 'id': '8007-03'}

要求

lxml 2.0+
httplib2

测试

要求

ludibrio
nose

运行

>> nosetests

欢迎加入QQ群-->： 979659372

html2data 0.4.3

html2data的Python项目详细描述

欢迎使用html2data

说明

示例

导入

创建实例

使用xpath配置

使用css选择器配置

现实生活中的例子

要求

测试

要求

运行

推荐PyPI第三方库

neuroqc

some-windows-snippets

gitup

marturion

annotationfactor

py-calc

odoo11-addon-delivery-carrier-partner

aliyun-python-sdk-baas

turtl-backup

pyandexmap

magicsuper

odoo_gatewa

octavia-lib

Bayesian

nanodb

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

html2data 0.4.3

html2data的Python项目详细描述

欢迎使用html2data

说明

示例

导入

创建实例

使用xpath配置

使用css选择器配置

现实生活中的例子

要求

测试

要求

运行

推荐PyPI第三方库

neuroqc

some-windows-snippets

gitup

marturion

annotationfactor

py-calc

odoo11-addon-delivery-carrier-partner

aliyun-python-sdk-baas

turtl-backup

pyandexmap

magicsuper

odoo_gatewa

octavia-lib

Bayesian

nanodb

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签