python将html文档转换为json

tojson的Python项目详细描述


tojson

python html文档转换为json

将HTML文档转换为JSON
>>> from tojson import HTML
>>> with open('sample.html', 'r') as src:
...    html = HTML(src.read(), text_skip=['html', 'head', 'body'])
...    html.tojson()) # return json format of html
{
  "html": {
    "head": {
      "title": {
        "text": "test"
      }
    },
    "body": {
      "bgcolor": "FFFFFF",
      "img": {
        "src": "clouds.jpg",
        "align": "bottom"
      },
      "a": [
        {
          "href": "http://somegreatsite.com",
          "text": "Link Name"
        },
        {
          "href": "mailto:support@yourcompany.com",
          "text": "support@yourcompany.com"
        }
      ],
      "h1": {
        "text": "This is a Header"
      },
      "h2": {
        "text": "This is a Medium Header"
      },
      "p": [
        {
          "text": "first paragraph"
        },
        {
          "text": "second paragraph!"
        }
      ],
      "b": {
        "text": "This is a new sentence without a paragraph break",
        "i": {
          "text": "This is a new sentence without a paragraph break"
        }
      }
    }
  }
}

遍历标记及其值

获取元组包含(标记,值):

>>> from tojson import HTML
>>> with open('sample.html', 'r') as src:
...    html = HTML(src.read(), text_skip=['html', 'head', 'body'])
...
>>> for item in html:
...    item
...
('html', {'head': {'title': {'text': 'test'}}, 'body': {'bgcolor': 'FFFFFF', 'img': {'src': 'clouds.jpg', 'align': 'bottom'}, 'a': [{'href': 'http://somegreatsite.com', 'text': 'Link Name'}, {'href': 'mailto:support@yourcompany.com', 'text': 'support@yourcompany.com'}], 'h1': {'text': 'This is a Header'}, 'h2': {'text': 'This is a Medium Header'}, 'p': [{'text': 'first paragraph'}, {'text': 'second paragraph!'}], 'b': {'text': 'This is a new sentence without a paragraph break', 'i': {'text': 'This is a new sentence without a paragraph break'}}}})
('head', {'title': {'text': 'test'}})
('title', {'text': 'test'})
('body', {'bgcolor': 'FFFFFF', 'img': {'src': 'clouds.jpg', 'align': 'bottom'}, 'a': [{'href': 'http://somegreatsite.com', 'text': 'Link Name'}, {'href': 'mailto:support@yourcompany.com', 'text': 'support@yourcompany.com'}], 'h1': {'text': 'This is a Header'}, 'h2': {'text': 'This is a Medium Header'}, 'p': [{'text': 'first paragraph'}, {'text': 'second paragraph!'}], 'b': {'text': 'This is a new sentence without a paragraph break', 'i': {'text': 'This is a new sentence without a paragraph break'}}})
('img', {'src': 'clouds.jpg', 'align': 'bottom'})
('a', [{'href': 'http://somegreatsite.com', 'text': 'Link Name'}, {'href': 'mailto:support@yourcompany.com', 'text': 'support@yourcompany.com'}])
('h1', {'text': 'This is a Header'})
('h2', {'text': 'This is a Medium Header'})
('p', [{'text': 'first paragraph'}, {'text': 'second paragraph!'}])
('b', {'text': 'This is a new sentence without a paragraph break', 'i': {'text': 'This is a new sentence without a paragraph break'}})
('i', {'text': 'This is a new sentence without a paragraph break'})

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java为扫描器的输入生成字符序列   hibernate中的java实体合并   如何使变量在Java文件中成为全局变量   java JVM崩溃“异常访问冲突”   向MediaMetadataRetriever中的setDataSource()发送Uri时发生java IllegalArgumentException   java没有节约协议?   用户界面java gui帮助actionlistener   java索引越界异常,即使大小小于索引?   在C++中使用java的困惑   在普通java编码中插入图像   JDBC上的java缓存数据   在Java中,在字符串的特定位置替换子字符串   java在运行elasticsearch集群时遇到Perm Gen空间问题   java Soap故障跟踪   java拆分器。固定长度(int)。拆分(字符串)   java获取jar内部包的路径