Python/PHP中的模板提取

3 投票

3 回答

2007 浏览

数据工程师

提问于 2025-04-15 18:35

有没有现成的模板提取库可以在Python或PHP中使用？Perl有一个叫做Template::Extract的库，但我在Python或PHP中找不到类似的实现。

在Python中，我找到的唯一相似的东西是TemplateMaker（http://code.google.com/p/templatemaker/），但这并不算是真正的模板提取库。

文本处理 php perl 代码库模板库模板提取 TemplateMaker 动态内容生成

3 个回答

这里有一个有趣的讨论，来自TemplateMaker的作者Adrian，链接在这里：http://www.holovaty.com/writing/templatemaker/

这看起来很像我所说的一个“包装器引导库”。

如果你在寻找一些更灵活的东西（而不是专门用于抓取网页的），可以看看lxml.html和BeautifulSoup，这两个也是用于Python的。

回答于 2025-04-15 由 Python大师

分享举报

TmeplateMaker 看起来确实能满足你的需求，至少根据它的说明文档是这样说的。它不是直接让你提供一个模板，而是从几份文档中“学习”出一个模板。然后，它有一个 extract 方法，可以从用这个模板创建的其他文档中提取数据。

举个例子：

# Now that we have a template, let's extract some data.
>>> t.extract('<b>red and green</b>')
('red', 'green')
>>> t.extract('<b>django and stephane</b>')
('django', 'stephane')

# The extract() method is very literal. It doesn't magically trim
# whitespace, nor does it have any knowledge of markup languages such as
# HTML.
>>> t.extract('<b>  spacy  and <u>underlined</u></b>')
('  spacy ', '<u>underlined</u>')

# The extract() method will raise the NoMatch exception if the data
# doesn't match the template. In this example, the data doesn't have the
# leading and trailing "<b>" tags.
>>> t.extract('this and that')
Traceback (most recent call last):
...

所以，要完成你需要的任务，我觉得你应该：

给它几份用你的模板生成的文档——它会轻松地从中推断出模板。
用推断出的模板从新文档中提取数据。

想想看，这比 Perl 的 Template::Extract 还要有用，因为它不需要你提供一个干净的模板——它可以从示例文本中自己学习。

回答于 2025-04-15 由 Python大师

分享举报

经过一番查找，我终于找到了我想要的解决方案。filippo在这篇帖子中列出了几种用于屏幕抓取的Python解决方案：Options for HTML scraping?，其中有一个叫做scrapemark的工具（http://arshaw.com/scrapemark/）。

希望这对其他寻找相同解决方案的人有帮助。

回答于 2025-04-15 由 Python大师

分享举报

Python/PHP中的模板提取

3 个回答

撰写回答