在python中使用Regex解析xml文件

<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE dblp SYSTEM "dblp.dtd"> <dblp> <article mdate="2011-01-11" key="journals/acta/Saxena96"> <author>Sanjeev Saxena</author> <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title> <pages>607-619</pages> <year>1996</year> <volume>33</volume> <journal>Acta Inf.</journal> <number>7</number> <url>db/journals/acta/acta33.html#Saxena96</url> <ee>http://dx.doi.org/10.1007/BF03036466</ee> </article> <article mdate="2011-01-11" key="journals/acta/Simon83"> <author>Hans-Ulrich Simon</author> <title>Pattern Matching in Trees and Nets.</title> <pages>227-248</pages> <year>1983</year> <volume>20</volume> <journal>Acta Inf.</journal> <url>db/journals/acta/acta20.html#Simon83</url> <ee>http://dx.doi.org/10.1007/BF01257084</ee> </article>

2条回答

网友

1楼 · 编辑于 2024-05-12 19:14:34

改用XML解析器。在

使用^{}的工作示例：

import lxml.etree as ET

data = """<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
        <article mdate="2011-01-11" key="journals/acta/Saxena96">
                <author>Sanjeev Saxena</author>
                <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
                <pages>607-619</pages>
                <year>1996</year>
                <volume>33</volume>
                <journal>Acta Inf.</journal>
                <number>7</number>
                <url>db/journals/acta/acta33.html#Saxena96</url>
                <ee>http://dx.doi.org/10.1007/BF03036466</ee>
                </article>
                <article mdate="2011-01-11" key="journals/acta/Simon83">
                <author>Hans-Ulrich Simon</author>
                <title>Pattern Matching in Trees and Nets.</title>
                <pages>227-248</pages>
                <year>1983</year>
                <volume>20</volume>
                <journal>Acta Inf.</journal>
                <url>db/journals/acta/acta20.html#Simon83</url>
                <ee>http://dx.doi.org/10.1007/BF01257084</ee>
        </article>
</dblp>
"""

root = ET.fromstring(data)

title = 'Parallel'
article = root.xpath('.//article[starts-with(title, "%s")]' % title)[0]

for prop in ['author', 'pages', 'year', 'volume', 'journal']:
    print article.findtext(prop)

印刷品：

^{pr2}$

网友

2楼 · 编辑于 2024-05-12 19:14:34

解析xml或html文档的最佳方法是使用适当的html解析器，例如beautifulsoup或{}模块，但是作为替代，您可以使用以下模式：

>>> s="""<?xml version="1.0" encoding="ISO-8859-1"?>
... <!DOCTYPE dblp SYSTEM "dblp.dtd">
... <dblp>
... <article mdate="2011-01-11" key="journals/acta/Saxena96">
... <author>Sanjeev Saxena</author>
... <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
... <pages>607-619</pages>
... <year>1996</year>
... <volume>33</volume>
... <journal>Acta Inf.</journal>
... <number>7</number>
... <url>db/journals/acta/acta33.html#Saxena96</url>
... <ee>http://dx.doi.org/10.1007/BF03036466</ee>
... </article>
... <article mdate="2011-01-11" key="journals/acta/Simon83">
... <author>Hans-Ulrich Simon</author>
... <title>Pattern Matching in Trees and Nets.</title>
... <pages>227-248</pages>
... <year>1983</year>
... <volume>20</volume>
... <journal>Acta Inf.</journal>
... <url>db/journals/acta/acta20.html#Simon83</url>
... <ee>http://dx.doi.org/10.1007/BF01257084</ee>
... </article>"""
>>> import re
>>> l=['author','pages','year','volume','journal']
>>> pat=r'|'.join(('<{}>(.*)</{}>'.format(i,i) for i in l))
>>> [j  for i in re.findall(pat,s) for j in i if j]
['Sanjeev Saxena', '607-619', '1996', '33', 'Acta Inf.', 'Hans-Ulrich Simon', '227-248', '1983', '20', 'Acta Inf.']

如果要从输入中获取单词，则需要以下额外命令：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章