在python中使用Regex解析xml文件

2024-05-12 19:14:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个xml文件作为文本文件以下内容:-在

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2011-01-11" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2011-01-11" key="journals/acta/Simon83">
<author>Hans-Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>http://dx.doi.org/10.1007/BF01257084</ee>
</article>

如果我输入'Parallel',那么我应该得到,整个标题名,后跟'author'、'pages'、'year'、'volume'、'journal'

作为样本输出:

^{pr2}$

如何使用regex执行上述操作?请帮忙!在

提前谢谢!在


Tags: keyurltitlearticlexmlpagesyearee
2条回答

改用XML解析器。在

使用^{}的工作示例:

import lxml.etree as ET

data = """<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
        <article mdate="2011-01-11" key="journals/acta/Saxena96">
                <author>Sanjeev Saxena</author>
                <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
                <pages>607-619</pages>
                <year>1996</year>
                <volume>33</volume>
                <journal>Acta Inf.</journal>
                <number>7</number>
                <url>db/journals/acta/acta33.html#Saxena96</url>
                <ee>http://dx.doi.org/10.1007/BF03036466</ee>
                </article>
                <article mdate="2011-01-11" key="journals/acta/Simon83">
                <author>Hans-Ulrich Simon</author>
                <title>Pattern Matching in Trees and Nets.</title>
                <pages>227-248</pages>
                <year>1983</year>
                <volume>20</volume>
                <journal>Acta Inf.</journal>
                <url>db/journals/acta/acta20.html#Simon83</url>
                <ee>http://dx.doi.org/10.1007/BF01257084</ee>
        </article>
</dblp>
"""

root = ET.fromstring(data)

title = 'Parallel'
article = root.xpath('.//article[starts-with(title, "%s")]' % title)[0]

for prop in ['author', 'pages', 'year', 'volume', 'journal']:
    print article.findtext(prop)

印刷品:

^{pr2}$

解析xmlhtml文档的最佳方法是使用适当的html解析器,例如beautifulsoup或{}模块,但是作为替代,您可以使用以下模式:

>>> s="""<?xml version="1.0" encoding="ISO-8859-1"?>
... <!DOCTYPE dblp SYSTEM "dblp.dtd">
... <dblp>
... <article mdate="2011-01-11" key="journals/acta/Saxena96">
... <author>Sanjeev Saxena</author>
... <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
... <pages>607-619</pages>
... <year>1996</year>
... <volume>33</volume>
... <journal>Acta Inf.</journal>
... <number>7</number>
... <url>db/journals/acta/acta33.html#Saxena96</url>
... <ee>http://dx.doi.org/10.1007/BF03036466</ee>
... </article>
... <article mdate="2011-01-11" key="journals/acta/Simon83">
... <author>Hans-Ulrich Simon</author>
... <title>Pattern Matching in Trees and Nets.</title>
... <pages>227-248</pages>
... <year>1983</year>
... <volume>20</volume>
... <journal>Acta Inf.</journal>
... <url>db/journals/acta/acta20.html#Simon83</url>
... <ee>http://dx.doi.org/10.1007/BF01257084</ee>
... </article>"""
>>> import re
>>> l=['author','pages','year','volume','journal']
>>> pat=r'|'.join(('<{}>(.*)</{}>'.format(i,i) for i in l))
>>> [j  for i in re.findall(pat,s) for j in i if j]
['Sanjeev Saxena', '607-619', '1996', '33', 'Acta Inf.', 'Hans-Ulrich Simon', '227-248', '1983', '20', 'Acta Inf.']

如果要从输入中获取单词,则需要以下额外命令:

^{pr2}$

相关问题 更多 >