提取xml标记名称、属性和值

网友

1楼 · 编辑于 2024-04-19 22:35:04

我不太清楚您为什么想要这个，但是您应该看看Python的lxml或BeautifulSoup。你知道吗

或者，如果您只希望它完全符合您上面介绍的形式：

def parse_html(html_string):
    import re
    fields = re.findall(r'(?<=\<)[\w=\s\"\']+?(?=\/?\>)', html_string)
    out = []
    for field in fields:
        tag = re.match(r'(?P<tag>\w+?) ?', field).group('tag')
        attrs = re.findall(r' (\w+?)\=[\"\'](.+?)[\"\']', field)
        if attrs:
            for x in attrs:
                out.append(','.join([tag] + list(x)))
        else:
            out.append(tag)

    print '\n'.join(out)

这有点过头了，这就是为什么您通常应该使用lxml或BeautifulSoup，但它可以完成这个特定的工作。你知道吗

以上程序输出：

r
P
c,val,1F497D
t,val,123
t,val2,234

网友

2楼 · 编辑于 2024-04-19 22:35:04

安装^{}然后：

>>> from lxml import etree
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> parsed_xml = etree.XML(s,parser)
>>> for i in parsed_xml.iter('*'):
...    print i.tag
...    for x in i.items():
...       print '%s,%s' % (x[0],x[1])
...
r
P
color
val,1F497D
t
val,123
val2,234

我让你来格式化输出。你知道吗

网友

3楼 · 编辑于 2024-04-19 22:35:04

我认为你最好的办法是使用BeautifulSoup

例如（从他们的docs）：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.title
# <title>The Dormouse's story</title>
soup.p['class']
# u'title'
for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

你也可以看看lxml，它简单高效，这就是BeautifulSoup的基础。具体来说，您可能想看看this page。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

提取xml标记名称、属性和值

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >