通过Python从PubMed获取作者单位
我正在写一个Python脚本(这个脚本是从这里修改过来的),目的是在PubMed上查找某个大学的论文数量,并下载合作者的单位信息。但是当我运行代码时,得到的不是单位信息,而是<Element 'Affiliation' at 0x106ea7e50>
。你知道怎么解决这个问题吗?我该怎么做才能下载所有作者的单位信息呢?谢谢!
import urllib, urllib2, sys
import xml.etree.ElementTree as ET
def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))
query = '(("University of Copenhagen"[Affiliation]))# AND ("1920"[Publication Date] : "1930"[Publication Date]))'
esearch = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&mindate=2001&maxdate=2010&retmode=xml&retmax=10000000&term=%s' % (query)
handle = urllib.urlopen(esearch)
data = handle.read()
root = ET.fromstring(data)
ids = [x.text for x in root.findall("IdList/Id")]
print 'Got %d articles' % (len(ids))
for group in chunker(ids, 100):
efetch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&db=pubmed&retmode=xml&id=%s" % (','.join(group))
handle = urllib.urlopen(efetch)
data = handle.read()
root = ET.fromstring(data)
for article in root.findall("PubmedArticle"):
pmid = article.find("MedlineCitation/PMID").text
year = article.find("MedlineCitation/Article/Journal/JournalIssue/PubDate/Year")
if year is None: year = 'NA'
else: year = year.text
aulist = article.findall("MedlineCitation/Article/AuthorList/Author")
affiliation = article.find("MedlineCitation/Article/AuthorList/Author/Affiliation")
print pmid, year, len(aulist), affiliation
2 个回答
1
这个回答更新了代码,使其适应Python 3,并修正了XML中作者单位的位置(我发现它在 MedlineCitation/Article/AuthorList/Author/AffiliationInfo
,而不是 "MedlineCitation/Article/AuthorList/Author/Affiliation
,可能随着时间的推移位置发生了变化?)。在这个例子中,我们将仅获取一篇论文的作者单位,论文的链接是 https://pubmed.ncbi.nlm.nih.gov/31888621/,根据它的PMID(31888621
)来获取:
import xml.etree.ElementTree as ET
from urllib.request import urlopen
def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))
efetch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&db=pubmed&retmode=xml&id=%s" % ('31888621')
handle = urlopen(efetch)
data = handle.read()
root = ET.fromstring(data)
for article in root.findall("PubmedArticle"):
pmid = article.find("MedlineCitation/PMID").text
year = article.find("MedlineCitation/Article/Journal/JournalIssue/PubDate/Year")
if year is None: year = 'NA'
else: year = year.text
aulist = article.findall("MedlineCitation/Article/AuthorList/Author")
affiliation = article.find("MedlineCitation/Article/AuthorList/Author/AffiliationInfo")
#print(pmid, year, len(aulist), affiliation, aulist, ET.dump(root))
for author in aulist:
print(ET.dump(author))
输出结果:
<Author ValidYN="Y">
<LastName>Tang</LastName>
<ForeName>Lingkai</ForeName>
<Initials>L</Initials>
<AffiliationInfo>
<Affiliation>Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, S7N 5A9, Canada.</Affiliation>
</AffiliationInfo>
</Author>
None
<Author ValidYN="Y">
<LastName>Mostafa</LastName>
<ForeName>Sakib</ForeName>
<Initials>S</Initials>
<AffiliationInfo>
<Affiliation>Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, S7N 5A9, Canada.</Affiliation>
</AffiliationInfo>
</Author>
None
<Author ValidYN="Y">
<LastName>Liao</LastName>
<ForeName>Bo</ForeName>
<Initials>B</Initials>
<AffiliationInfo>
<Affiliation>School of Mathematics and Statistics, Hainan Normal University, Haikou, 571158, China.</Affiliation>
</AffiliationInfo>
</Author>
None
<Author ValidYN="Y">
<LastName>Wu</LastName>
<ForeName>Fang-Xiang</ForeName>
<Initials>FX</Initials>
<Identifier Source="ORCID">0000-0002-4593-9332</Identifier>
<AffiliationInfo>
<Affiliation>Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, S7N 5A9, Canada. faw341@mail.usask.ca.</Affiliation>
</AffiliationInfo>
<AffiliationInfo>
<Affiliation>Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, S7N 5A9, Canada. faw341@mail.usask.ca.</Affiliation>
</AffiliationInfo>
</Author>
None
2
出现这个问题的原因是,affiliation
对象指向的是一个XML元素,而不是一段文本。如果你想要的字符串在这个元素的值里面,比如这样:
<affiliation>
your_affiliation_text
</affiliation>
那么你应该打印affiliation.text
。
如果你想要的字符串在一个属性里面,比如这样:
<affiliation your_attribute_name="your_affiliation">
那么你应该使用affiliation.attrib[name]
。