用Python从Pubmed ID获取作者列表和引用?
我有一份PubMed的ID列表,想要提取出带有完整作者名单的引用。网上有一些工具可以做到这一点,比如这个:http://mickschroeder.com/citation/,不过作者名单会被简化成“等人”。
我正在尝试使用Biopython中的Entrez包来实现这个功能,同时也用xml.etree.ElementTree来解析XML对象。
这是我目前的代码:
from Bio.Entrez import efetch
import xml.etree.ElementTree as ET
def fetch_abstract(pmid):
handle = efetch(db='pubmed', id=pmid, retmode='xml')
xml_data = handle.read()
print xml_data #this prints the XML data structure correctly
article = ET.XML(xml_data)
#problem starts here. I want to create a citation, so I start by trying to
#get the names of the authors, but I am not sure why this is not working.
for author_name in article.findall('AuthorValidYN'):
print author_name
return
fetch_abstract(22864638)
XML的格式是这样的:
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2014//EN" "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_140101.dtd">
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">22864638</PMID>
<DateCreated>
<Year>2012</Year>
<Month>10</Month>
<Day>31</Day>
</DateCreated>
<DateCompleted>
<Year>2013</Year>
<Month>04</Month>
<Day>23</Day>
</DateCompleted>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Electronic">1573-7292</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>11</Volume>
<Issue>4</Issue>
<PubDate>
<Year>2012</Year>
<Month>Dec</Month>
</PubDate>
</JournalIssue>
<Title>Familial cancer</Title>
<ISOAbbreviation>Fam. Cancer</ISOAbbreviation>
</Journal>
<ArticleTitle>No evidence for breast cancer susceptibility associated with variants of BRD7, a component of p53 and BRCA1 pathways.</ArticleTitle>
<Pagination>
<MedlinePgn>601-6</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1007/s10689-012-9556-0</ELocationID>
<Abstract>
<AbstractText>BRD7 (bromodomain 7), a subunit of poly-bromo-associated BRG1-associated factor (PBAF)-specific Swi/Snf chromatin remodeling complexes, has been proposed as a tumour suppressor protein following its identification as an important component of both functional p53 and BRCA1 (breast cancer 1, early onset) pathways. As low BRD7 expression levels have been linked to p53-wild-type breast tumour cells, we hypothesized an implication of BRD7 germline alterations in the pathogenesis of hereditary breast cancer similar to that of TP53 in Li-Fraumeni syndrome. We performed sequence analysis of the BRD7 gene on 61 high-risk individuals with hereditary or very-early-onset breast cancer and 100 healthy controls. Four potentially disease-causing single-nucleotide alterations were detected within the cohort of breast cancer patients (one listed as a rare single-nucleotide polymorphism (SNP) in the NCBI (National Center for Biotechnology Information) SNP database). Two of the detected variants were also each found once within the control collective. Segregation analysis on both families of those carrying the remaining two variants revealed segregation of these BRD7 alterations independent of breast cancer. In conclusion, it seems that the BRD7 variants we detected represent rare polymorphisms and mainly rule out BRD7 as a frequent high-penetrance breast cancer susceptibility gene. However, further analyses in larger cohorts of women with hereditary breast cancer should clarify the role of BRD7 in breast cancer predisposition.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Penkert</LastName>
<ForeName>Judith</ForeName>
<Initials>J</Initials>
<Affiliation>Institute of Cell and Molecular Pathology, Hannover Medical School, Carl-Neuberg-Strasse 1, Hannover, Germany.</Affiliation>
</Author>
<Author ValidYN="Y">
<LastName>Schlegelberger</LastName>
<ForeName>Brigitte</ForeName>
<Initials>B</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Steinemann</LastName>
<ForeName>Doris</ForeName>
<Initials>D</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Gadzicki</LastName>
<ForeName>Dorothea</ForeName>
<Initials>D</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType>Comparative Study</PublicationType>
<PublicationType>Journal Article</PublicationType>
<PublicationType>Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>Netherlands</Country>
<MedlineTA>Fam Cancer</MedlineTA>
<NlmUniqueID>100898211</NlmUniqueID>
<ISSNLinking>1389-9600</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>BRCA1 Protein</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>BRCA1 protein, human</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>BRD7 protein, human</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Chromosomal Proteins, Non-Histone</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>TP53 protein, human</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Tumor Suppressor Protein p53</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Adult</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Aged</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">BRCA1 Protein</DescriptorName>
<QualifierName MajorTopicYN="Y">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Breast Neoplasms</DescriptorName>
<QualifierName MajorTopicYN="Y">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Case-Control Studies</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Chromosomal Proteins, Non-Histone</DescriptorName>
<QualifierName MajorTopicYN="Y">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Female</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y">Genetic Predisposition to Disease</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Male</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Middle Aged</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Mutation</DescriptorName>
<QualifierName MajorTopicYN="Y">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Pedigree</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Polymorphism, Single Nucleotide</DescriptorName>
<QualifierName MajorTopicYN="Y">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Prognosis</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Tumor Suppressor Protein p53</DescriptorName>
<QualifierName MajorTopicYN="Y">genetics</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Young Adult</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="entrez">
<Year>2012</Year>
<Month>8</Month>
<Day>7</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2012</Year>
<Month>8</Month>
<Day>7</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2013</Year>
<Month>4</Month>
<Day>24</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="doi">10.1007/s10689-012-9556-0</ArticleId>
<ArticleId IdType="pubmed">22864638</ArticleId>
</ArticleIdList>
</PubmedData>
3 个回答
0
你需要给出你想要的元素的完整路径;而且 findall 会返回一个元素的列表,所以你需要处理这个列表(比如用 Python 的列表推导式)来获取文本值的列表;对第一个、首字母等重复这个过程。
Author_LNs = Article.findall('PubmedArticle/Article/AuthorList/Author/LastName')
Author_Last_Names = [x.text for x in Author_LNs]
如果你只想选择带有特定属性的元素,可以像在 Xpath 中那样用括号括起来:
Article.findall('PubmedArticle/Article/AuthorList/Author/LastName[@ValidYN="Y"]')
默认的 PubMed 文章结构从那时起发生了变化,现在 Article 在 MedlineCitation 下面,所以作者的位置在:
Article.findall('PubmedArticle/MedlineCitation/Article/AuthorList/Author/LastName')
1
我觉得你找错了XML节点。ValidYN其实是Author这个节点的一个属性。所以你应该这样用:
for author_name in article.findall('Author')
“Element.findall()只会找到当前元素的直接子元素,也就是标签匹配的元素。”我觉得你需要把当前元素设置为AuthorList。所以可以这样做:
article.find('AuthorList').findall('Author')
2
这是我用BeautifulSoup做同样事情时的代码。
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(xml_data)
a_recs = []
for tag in soup.findAll("pubmedarticle"): # I'm working with multiple articles in one file
for a_tag in tag.findAll("author"):
a_rec = {}
a_rec['pmid'] = int(tag.pmid.text)
a_rec['lastname'] = a_tag.lastname.text
a_rec['forename'] = a_tag.forename.text
a_rec['suffix'] = a_tag.suffix.text
a_rec['initials'] = a_tag.initials.text
a_rec['affiliation'] = a_tag.affiliation.text
a_recs.append(a_rec)
很多时候,作者名字的不同部分可能是空的,如果你直接去访问每个元素的文本属性,就会出错。所以在直接访问文本属性之前,你需要先检查一下是否为空(我写了一个简单的函数,如果标签没有文本属性,就默认返回None)。