python html/xml解析器,便于web抓取。
pyDHTMLParser的Python项目详细描述
这是什么?
dhtmlparser是一个轻量级的html/xml解析器,创建它的目的只有一个-简单快捷 从dom中选择标记。
当你需要为某个网页或刮刀编写自己的“游击”api时,它会非常有用。
如果需要,还可以比连接字符串更容易地创建html/xml文档。
文档
完整的模块文档可以在这里找到:http://pyDHTMLParser.rtfd.org
更改日志
2.2.2
- Attempt to fix strange recursive inheritance problem.
2.2.0
- Rewritten for compatibility with python3.
2.1.0-2.1.8
- State parser fixed - it can now recover from invalid html like ^{tt1}$.
- Rewritten to use ^{tt2}$ in parser for better readability.
- Garbage collector is now disabled during _raw_split().
- Fixed #16 - recovery after tags which don’t ends with ^{tt3}$ (^{tt4}$ for example).
- Closed #17 - implementation of ignoring of ^{tt5}$ in usage as is smaller than sign.
- Restored support of multiline attributes.
- ^{tt6}$ now doesn’t try to parse HTML element parameters.
- Implemented ^{tt7}$ getter.
- License changed to MIT.
- Fixed #18: bug which in some cases caused invalid output.
- Added HTMLElement.__repr__().
- Added test_coverage.sh.
- Added extended test_equality() coverage.
- Formatting improvements.
- Improved constructor handling, which is now much more readable.
- Updated formatting of the setup.py.
- Added more tests.
- Fixed #22; bug in the SpecialDict.
- Fixed some nasty unicode problems.
- Fixed python 2 / 3 problem in docs/__init__.py.
- getVersion() -> get_version().
2.0.10
- Added more tests of removeTags().
- run_tests.sh now gets arguments.
- Check for string in removeTags() changed to basestring from str.
2.0.6-2.0.9
- Fixed behaviour of toString() and tagToString().
- SpecialDict is now derived from OrderedDict.
- Changed and added tests of .params attribute (OrderedDict is now used).
- Fixed bug in _repair_tags().
- Removed _repair_tags() - it wasn’t really necessary.
- Fixed nasty bug which could cause invalid XML output.
2.0.1-2.0.5
- Fixed bugs in ^{tt8}$.
- Fixed broken links in documentation.
- Fixed bugs in ^{tt9}$.
- ^{tt10}$; Fixed bug which prevented tag_name to be None.
- Added op ^{tt11}$ to the SpecialDict.
- Added new method ^{tt12}$ to ^{tt13}$.
2.0.0
- Rewritten, refactored, splitted to multiple files.
- Added unittest coverage of almost 100% of the code.
- Added better selector methods (^{tt14}$, ^{tt15}$)
- Added Sphinx documentation.
- Fixed a lot of bugs.