用Python从Pubmed ID获取作者列表和引用？

Question

我有一份PubMed的ID列表，想要提取出带有完整作者名单的引用。网上有一些工具可以做到这一点，比如这个：http://mickschroeder.com/citation/，不过作者名单会被简化成“等人”。

我正在尝试使用Biopython中的Entrez包来实现这个功能，同时也用xml.etree.ElementTree来解析XML对象。

这是我目前的代码：

from Bio.Entrez import efetch
import xml.etree.ElementTree as ET

def fetch_abstract(pmid):
    handle = efetch(db='pubmed', id=pmid, retmode='xml')
    xml_data = handle.read()
    print xml_data #this prints the XML data structure correctly

    article = ET.XML(xml_data)

    #problem starts here. I want to create a citation, so I start by trying to
    #get the names of the authors, but I am not sure why this is not working.
    for author_name in article.findall('AuthorValidYN'):
        print author_name

    return 


fetch_abstract(22864638)

XML的格式是这样的：

<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2014//EN"      "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_140101.dtd">
<PubmedArticleSet>
<PubmedArticle>
    <MedlineCitation Owner="NLM" Status="MEDLINE">
    <PMID Version="1">22864638</PMID>
    <DateCreated>
        <Year>2012</Year>
        <Month>10</Month>
        <Day>31</Day>
    </DateCreated>
    <DateCompleted>
        <Year>2013</Year>
        <Month>04</Month>
        <Day>23</Day>
    </DateCompleted>
    <Article PubModel="Print">
        <Journal>
            <ISSN IssnType="Electronic">1573-7292</ISSN>
            <JournalIssue CitedMedium="Internet">
                <Volume>11</Volume>
                <Issue>4</Issue>
                <PubDate>
                    <Year>2012</Year>
                    <Month>Dec</Month>
                </PubDate>
            </JournalIssue>
            <Title>Familial cancer</Title>
            <ISOAbbreviation>Fam. Cancer</ISOAbbreviation>
        </Journal>
        <ArticleTitle>No evidence for breast cancer susceptibility associated with variants of BRD7, a component of p53 and BRCA1 pathways.</ArticleTitle>
        <Pagination>
            <MedlinePgn>601-6</MedlinePgn>
        </Pagination>
        <ELocationID EIdType="doi" ValidYN="Y">10.1007/s10689-012-9556-0</ELocationID>
        <Abstract>
            <AbstractText>BRD7 (bromodomain 7), a subunit of poly-bromo-associated BRG1-associated factor (PBAF)-specific Swi/Snf chromatin remodeling complexes, has been proposed as a tumour suppressor protein following its identification as an important component of both functional p53 and BRCA1 (breast cancer 1, early onset) pathways. As low BRD7 expression levels have been linked to p53-wild-type breast tumour cells, we hypothesized an implication of BRD7 germline alterations in the pathogenesis of hereditary breast cancer similar to that of TP53 in Li-Fraumeni syndrome. We performed sequence analysis of the BRD7 gene on 61 high-risk individuals with hereditary or very-early-onset breast cancer and 100 healthy controls. Four potentially disease-causing single-nucleotide alterations were detected within the cohort of breast cancer patients (one listed as a rare single-nucleotide polymorphism (SNP) in the NCBI (National Center for Biotechnology Information) SNP database). Two of the detected variants were also each found once within the control collective. Segregation analysis on both families of those carrying the remaining two variants revealed segregation of these BRD7 alterations independent of breast cancer. In conclusion, it seems that the BRD7 variants we detected represent rare polymorphisms and mainly rule out BRD7 as a frequent high-penetrance breast cancer susceptibility gene. However, further analyses in larger cohorts of women with hereditary breast cancer should clarify the role of BRD7 in breast cancer predisposition.</AbstractText>
        </Abstract>
        <AuthorList CompleteYN="Y">
            <Author ValidYN="Y">
                <LastName>Penkert</LastName>
                <ForeName>Judith</ForeName>
                <Initials>J</Initials>
                <Affiliation>Institute of Cell and Molecular Pathology, Hannover Medical School, Carl-Neuberg-Strasse 1, Hannover, Germany.</Affiliation>
            </Author>
            <Author ValidYN="Y">
                <LastName>Schlegelberger</LastName>
                <ForeName>Brigitte</ForeName>
                <Initials>B</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Steinemann</LastName>
                <ForeName>Doris</ForeName>
                <Initials>D</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Gadzicki</LastName>
                <ForeName>Dorothea</ForeName>
                <Initials>D</Initials>
            </Author>
        </AuthorList>
        <Language>eng</Language>
        <PublicationTypeList>
            <PublicationType>Comparative Study</PublicationType>
            <PublicationType>Journal Article</PublicationType>
            <PublicationType>Research Support, Non-U.S. Gov't</PublicationType>
        </PublicationTypeList>
    </Article>
    <MedlineJournalInfo>
        <Country>Netherlands</Country>
        <MedlineTA>Fam Cancer</MedlineTA>
        <NlmUniqueID>100898211</NlmUniqueID>
        <ISSNLinking>1389-9600</ISSNLinking>
    </MedlineJournalInfo>
    <ChemicalList>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>BRCA1 Protein</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>BRCA1 protein, human</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>BRD7 protein, human</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>Chromosomal Proteins, Non-Histone</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>TP53 protein, human</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>0</RegistryNumber>
            <NameOfSubstance>Tumor Suppressor Protein p53</NameOfSubstance>
        </Chemical>
    </ChemicalList>
    <CitationSubset>IM</CitationSubset>
    <MeshHeadingList>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Adult</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Aged</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">BRCA1 Protein</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Breast Neoplasms</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Case-Control Studies</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Chromosomal Proteins, Non-Histone</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Female</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="Y">Genetic Predisposition to Disease</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Humans</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Male</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Middle Aged</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Mutation</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Pedigree</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Polymorphism, Single Nucleotide</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Prognosis</DescriptorName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Tumor Suppressor Protein p53</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
        </MeshHeading>
        <MeshHeading>
            <DescriptorName MajorTopicYN="N">Young Adult</DescriptorName>
        </MeshHeading>
    </MeshHeadingList>
</MedlineCitation>
<PubmedData>
    <History>
        <PubMedPubDate PubStatus="entrez">
            <Year>2012</Year>
            <Month>8</Month>
            <Day>7</Day>
            <Hour>6</Hour>
            <Minute>0</Minute>
        </PubMedPubDate>
        <PubMedPubDate PubStatus="pubmed">
            <Year>2012</Year>
            <Month>8</Month>
            <Day>7</Day>
            <Hour>6</Hour>
            <Minute>0</Minute>
        </PubMedPubDate>
        <PubMedPubDate PubStatus="medline">
            <Year>2013</Year>
            <Month>4</Month>
            <Day>24</Day>
            <Hour>6</Hour>
            <Minute>0</Minute>
        </PubMedPubDate>
    </History>
    <PublicationStatus>ppublish</PublicationStatus>
    <ArticleIdList>
        <ArticleId IdType="doi">10.1007/s10689-012-9556-0</ArticleId>
        <ArticleId IdType="pubmed">22864638</ArticleId>
    </ArticleIdList>
</PubmedData>

数据处理 xml解析生物信息学 biopython PubMed entrez 引用提取作者名单

用Python从Pubmed ID获取作者列表和引用？

3 个回答

撰写回答