使用Biopython解析PubMed Central XML

0 投票

1 回答

2051 浏览

提问于 2025-04-18 15:37

我正在尝试使用Biopython的Bio Entrez解析功能来解析PubMed Central的XML文件。这是我到目前为止尝试的内容：

from Bio import Entrez
for xmlfile in glob.glob ('samplepmcxml.xml'):
   print xmlfile
   fh = open (xmlfile, "r")
   read_xml (fh, outfp)
   fh.close()

def read_xml (handle, outh):
   records = Entrez.parse(handle)
   for record in records:
      print record

我遇到了以下错误：

Traceback (most recent call last):
File "3parse_info_from_pmc_nxml.py", line 78, in <module>
read_xml (fh, outfp)
File "3parse_info_from_pmc_nxml.py", line 10, in read_xml
for record in records:
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 137, in parse
self.parser.Parse(text, False)
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 165, in startNamespaceDeclHandler
raise NotImplementedError("The Bio.Entrez parser cannot handle XML data that make use of XML namespaces")
NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces

我已经下载了archivearticle.dtd文件。还有其他需要安装的DTD文件吗？这些文件会描述PMC文件的结构吗？有没有人成功使用过Bio Entrez功能或者其他方法来解析PMC文章？

谢谢你的帮助！

1 个回答

可以使用其他解析器，比如minidom。

from xml.dom import minidom

data = minidom.parse("pmc_full.xml")

现在，根据你想提取的数据，深入了解XML，尽情享受吧：

for title in data.getElementsByTagName("article-title"):
    for node in title.childNodes:
        if node.nodeType == node.TEXT_NODE:
            print node.data

回答于 2025-04-18 由 Python大师

分享举报

使用Biopython解析PubMed Central XML

1 个回答

撰写回答