python将html文档转换为json

tojson的Python项目详细描述


tojson

python html文档转换为json

将HTML文档转换为JSON
>>> from tojson import HTML
>>> with open('sample.html', 'r') as src:
...    html = HTML(src.read(), text_skip=['html', 'head', 'body'])
...    html.tojson()) # return json format of html
{
  "html": {
    "head": {
      "title": {
        "text": "test"
      }
    },
    "body": {
      "bgcolor": "FFFFFF",
      "img": {
        "src": "clouds.jpg",
        "align": "bottom"
      },
      "a": [
        {
          "href": "http://somegreatsite.com",
          "text": "Link Name"
        },
        {
          "href": "mailto:support@yourcompany.com",
          "text": "support@yourcompany.com"
        }
      ],
      "h1": {
        "text": "This is a Header"
      },
      "h2": {
        "text": "This is a Medium Header"
      },
      "p": [
        {
          "text": "first paragraph"
        },
        {
          "text": "second paragraph!"
        }
      ],
      "b": {
        "text": "This is a new sentence without a paragraph break",
        "i": {
          "text": "This is a new sentence without a paragraph break"
        }
      }
    }
  }
}

遍历标记及其值

获取元组包含(标记,值):

>>> from tojson import HTML
>>> with open('sample.html', 'r') as src:
...    html = HTML(src.read(), text_skip=['html', 'head', 'body'])
...
>>> for item in html:
...    item
...
('html', {'head': {'title': {'text': 'test'}}, 'body': {'bgcolor': 'FFFFFF', 'img': {'src': 'clouds.jpg', 'align': 'bottom'}, 'a': [{'href': 'http://somegreatsite.com', 'text': 'Link Name'}, {'href': 'mailto:support@yourcompany.com', 'text': 'support@yourcompany.com'}], 'h1': {'text': 'This is a Header'}, 'h2': {'text': 'This is a Medium Header'}, 'p': [{'text': 'first paragraph'}, {'text': 'second paragraph!'}], 'b': {'text': 'This is a new sentence without a paragraph break', 'i': {'text': 'This is a new sentence without a paragraph break'}}}})
('head', {'title': {'text': 'test'}})
('title', {'text': 'test'})
('body', {'bgcolor': 'FFFFFF', 'img': {'src': 'clouds.jpg', 'align': 'bottom'}, 'a': [{'href': 'http://somegreatsite.com', 'text': 'Link Name'}, {'href': 'mailto:support@yourcompany.com', 'text': 'support@yourcompany.com'}], 'h1': {'text': 'This is a Header'}, 'h2': {'text': 'This is a Medium Header'}, 'p': [{'text': 'first paragraph'}, {'text': 'second paragraph!'}], 'b': {'text': 'This is a new sentence without a paragraph break', 'i': {'text': 'This is a new sentence without a paragraph break'}}})
('img', {'src': 'clouds.jpg', 'align': 'bottom'})
('a', [{'href': 'http://somegreatsite.com', 'text': 'Link Name'}, {'href': 'mailto:support@yourcompany.com', 'text': 'support@yourcompany.com'}])
('h1', {'text': 'This is a Header'})
('h2', {'text': 'This is a Medium Header'})
('p', [{'text': 'first paragraph'}, {'text': 'second paragraph!'}])
('b', {'text': 'This is a new sentence without a paragraph break', 'i': {'text': 'This is a new sentence without a paragraph break'}})
('i', {'text': 'This is a new sentence without a paragraph break'})

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java Apache Flink外部Jar   创建和强制转换对象数组时发生java错误   Java,添加数组   具有相同包结构和类的java JAR   java Jenkins未能构建Maven项目   java为什么一个forloop比另一个更快,尽管它们做的“一样”?   servlets在将“/”站点迁移到Java EE包时处理contextpath引用   无法解析java MavReplugin:2.21或其某个依赖项   泛型如何编写比较器来泛化Java中的两种类型的对象?   java Android Emulator未在netbeans上加载   多线程Java使用线程对数组中的数字求和:在同步块中使用新变量作为锁:差异   java如何在JSP/servlet中设置<input>标记的值?