使用Python从jatsxml文件中提取文本

2024-05-16 22:48:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从 JATS-XML file中提取文本

JATS是一种标准化的XML格式,用于表示研究出版物。在

<article>
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Elsevier Science B.V. All rights reserved.
P I I S</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>How does foreign direct investment affect economic 1 growth? E. Borenszteina ,*, J. De Gregoriob, J-W. Leec</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>E. Borensztein</string-name>
          <email>eborensztein@imf.org</email>
          <xref ref-type="aff" rid="0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J. De Gregorio</string-name>
          <xref ref-type="aff" rid="2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J-W. Lee</string-name>
          <xref ref-type="aff" rid="3">3</xref>
        </contrib>
        <aff id="0">
          <label>0</label>
          <institution>International Monetary Fund, Research Department</institution>
          ,
          <addr-line>Washington DC 20431</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="1">
          <label>1</label>
          <institution>We are grateful for comments from Robert Barro</institution>
          ,
          <addr-line>Elhanan Helpman, Boyan Jovanovic, Mohsin Khan, Se-Jik Kim, Donald Mathieson, Sergio Rebelo, Jeffrey Sachs</addr-line>
          ,
          <institution>Peter Wickham, and two anonymous referees. Comments by participants in seminars at 1995 World Congress of the Econometric Society, Korean Macroeconomics Workshop, Kobe University, and Osaka University were very helpful. This paper was partially prepared while Jose ́ de Gregorio and Jong-Wha Lee were at the Research Department, International Monetary Fund. Any opinions expressed are only those of the</institution>
        </aff>
        <aff id="2">
          <label>2</label>
          <institution>Center for Applied Economics, Department of Industrial Engineering, Universidad de Chile</institution>
          ,
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="3">
          <label>3</label>
          <institution>Economics Department, Korea University and NBER</institution>
          ,
          <addr-line>Seoul 136 -701</addr-line>
          <country country="KR">Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We test the effect of foreign direct investment (FDI) on economic growth in a cross-country regression framework, utilizing data on FDI flows from industrial countries to 69 developing countries over the last two decades. Our results suggest that FDI is an important vehicle for the transfer of technology, contributing relatively more to growth than domestic investment. However, the higher productivity of FDI holds only when the host country has a minimum threshold stock of human capital. Thus, FDI contributes to economic growth only when a sufficient absorptive capability of the advanced technologies is available in the host economy. 1998 Elsevier Science B.V.</p>
      </abstract>
      <kwd-group>
        <kwd>Foreign direct investment</kwd>
        <kwd>Economic growth</kwd>
        <kwd>Cross-country regression framework</kwd>
        <kwd>Developing countries</kwd>
      </kwd-group>
      <volume>0</volume>
      <issue>0</issue>
      <fpage>115</fpage>
      <lpage>135</lpage>
      <pub-date>
        <year>1998</year>
      </pub-date>
      <history>
        <date date-type="accepted">
          <day>20</day>
          <month>5</month>
          <year>1997</year>
        </date>
        <date date-type="received">
          <day>21</day>
          <month>2</month>
          <year>1996</year>
        </date>
        <date date-type="revised">
          <day>24</day>
          <month>2</month>
          <year>1997</year>
        </date>
      </history>
    </article-meta>
  </front>
  <back>
    <ref-list>
      <ref id="1">
        <mixed-citation>
          <string-name>
            <surname>Aitken</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harrison</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <year>1993</year>
          ,
          <article-title>Do Domestically-Owned Firms Benefit from Foreign Direct Investment: Evidence from Panel Data, Unpublished manuscript</article-title>
          ,
          <source>International Monetary Fund.</source>
        </mixed-citation>
      </ref>
      <ref id="2">
        <mixed-citation>
          <string-name>
            <surname>Barro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J-W.</given-names>
          </string-name>
          ,
          <year>1993</year>
          .
          <article-title>International comparisons of educational attainment</article-title>
          .
          <source>Journal of Monetary Economics</source>
          <volume>32</volume>
          ,
          <fpage>361</fpage>
          -
          <lpage>394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="3">
        <mixed-citation>
          <string-name>
            <surname>Barro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J-W.</given-names>
          </string-name>
          ,
          <year>1994</year>
          .
          <article-title>Sources of economic growth</article-title>
          .
          <source>Carnegie Rochester Conference Series on Public Policy</source>
          <volume>40</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="4">
        <mixed-citation>
          <string-name>
            <surname>Barro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <article-title>Sala-i-</article-title>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <year>1995</year>
          . Economic Growth,
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , Cambridge, MA.
        </mixed-citation>
      </ref>
      <ref id="5">
        <mixed-citation>
          <string-name>
            <surname>Benhabib</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spiegel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <year>1994</year>
          .
          <article-title>The roles of human capital in economic development: evidence from aggregate cross-country data</article-title>
          .
          <source>Journal of Monetary Economics</source>
          <volume>34</volume>
          ,
          <fpage>143</fpage>
          -
          <lpage>173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="6">
        <mixed-citation>
          <string-name>
            <surname>Blomstrom</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipsey</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zejan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <year>1992</year>
          .
          <article-title>What Explains Developing Country Growth</article-title>
          . NBER Working Paper No.
          <volume>4132</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="7">
        <mixed-citation>
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <year>1993</year>
          .
          <article-title>Foreign Finance and Economic Growth - An Empirical Analysis</article-title>
          .
          <article-title>Unpublished manuscript</article-title>
          , CEPREMAP.
        </mixed-citation>
      </ref>
      <ref id="8">
        <mixed-citation>
          <string-name>
            <surname>De Gregorio</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <year>1992</year>
          .
          <article-title>Economic growth in Latin America</article-title>
          .
          <source>Journal of Development Economics</source>
          <volume>39</volume>
          ,
          <fpage>58</fpage>
          -
          <lpage>84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="9">
        <mixed-citation>
          <string-name>
            <surname>Easterly</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <year>1993</year>
          .
          <article-title>How much do distortions affect growth</article-title>
          .
          <source>Journal of Monetary Economics</source>
          <volume>32</volume>
          ,
          <fpage>187</fpage>
          -
          <lpage>212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="10">
        <mixed-citation>
          <string-name>
            <surname>Easterly</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>King</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levine</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rebelo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <year>1994</year>
          . Policy,
          <article-title>Technology Adoption and Growth</article-title>
          . NBER Working Paper No.
          <volume>4681</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="11">
        <mixed-citation>
          <string-name>
            <surname>Edwards</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <year>1990</year>
          . Capital Flows, Foreign Direct Investment, and
          <article-title>Debt-Equity Swaps in Developing Countries</article-title>
          . NBER Working Paper No.
          <volume>3497</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="12">
        <mixed-citation>
          <string-name>
            <surname>Ethier</surname>
            ,
            <given-names>W.J.</given-names>
          </string-name>
          ,
          <year>1982</year>
          .
          <article-title>National and international returns to scale in the modern theory of international trade</article-title>
          .
          <source>American Economic Review</source>
          <volume>72</volume>
          ,
          <fpage>389</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="13">
        <mixed-citation>
          <string-name>
            <surname>Findlay</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <year>1978</year>
          .
          <article-title>Relative backwardness, direct foreign investment, and the transfer of technology: a simple dynamic model</article-title>
          .
          <source>Quarterly Journal of Economics 92</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="14">
        <mixed-citation>
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Helpman</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <year>1991</year>
          .
          <article-title>Innovation and Growth in the Global Economy</article-title>
          , MIT Press Cambridge, MA.
        </mixed-citation>
      </ref>
      <ref id="15">
        <mixed-citation>
          <string-name>
            <surname>Gastil</surname>
            ,
            <given-names>R.D.</given-names>
          </string-name>
          ,
          <year>1987</year>
          . Freedom in the World, Greenwood Press, Westport, CT.
        </mixed-citation>
      </ref>
      <ref id="16">
        <mixed-citation>
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krugman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>1991</year>
          .
          <article-title>Foreign Direct Investment in the United States</article-title>
          , Institute for International Economics, Washington DC.
        </mixed-citation>
      </ref>
      <ref id="17">
        <mixed-citation>
          <string-name>
            <surname>Jovanovic</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rob</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <year>1989</year>
          .
          <article-title>Growth and diffusion of technology</article-title>
          .
          <source>Review of Economic Studies</source>
          <volume>56</volume>
          ,
          <fpage>569</fpage>
          -
          <lpage>582</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="18">
        <mixed-citation>
          <string-name>
            <surname>King</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <year>1993</year>
          .
          <article-title>Finance and growth: Schumpeter might be right</article-title>
          .
          <source>Quarterly Journal of Economics</source>
          <volume>108</volume>
          ,
          <fpage>717</fpage>
          -
          <lpage>738</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="19">
        <mixed-citation>
          <string-name>
            <surname>Knack</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keefer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>1995</year>
          .
          <article-title>Institutions and economic performance: cross-country tests using alternative institutional measures</article-title>
          .
          <source>Economics and Politics</source>
          <volume>7</volume>
          ,
          <fpage>207</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="20">
        <mixed-citation>
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Renelt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <year>1992</year>
          .
          <article-title>A sensitivity analysis of cross-country growth regressions</article-title>
          .
          <source>American Economic Review</source>
          <volume>82</volume>
          ,
          <fpage>942</fpage>
          -
          <lpage>963</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="21">
        <mixed-citation>
          <string-name>
            <surname>Nelson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Phelps</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <year>1966</year>
          .
          <article-title>Investment in humans, technological diffusion, and economic growth</article-title>
          .
          <source>American Economic Review: Papers and Proceedings</source>
          <volume>61</volume>
          ,
          <fpage>69</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="22">
        <mixed-citation>
          <string-name>
            <surname>Romer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>1990</year>
          .
          <article-title>Endogenous technological change</article-title>
          .
          <source>Journal of Political Economy</source>
          <volume>98</volume>
          ,
          <fpage>S71</fpage>
          -
          <lpage>S102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="23">
        <mixed-citation>
          <string-name>
            <surname>Romer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>1993</year>
          .
          <article-title>Idea gaps and object gaps in economic development</article-title>
          .
          <source>Journal of Monetary Economics</source>
          <volume>32</volume>
          ,
          <fpage>543</fpage>
          -
          <lpage>573</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="24">
        <mixed-citation>
          <string-name>
            <surname>Segerstrom</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          ,
          <year>1991</year>
          . Innovation, imitation, and
          <article-title>economic growth</article-title>
          .
          <source>Journal of Political Economy</source>
          <volume>99</volume>
          ,
          <fpage>807</fpage>
          -
          <lpage>827</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="25">
        <mixed-citation>
          <string-name>
            <given-names>United</given-names>
            <surname>Nations</surname>
          </string-name>
          ,
          <year>1992</year>
          .
          <source>World Investment Report 1992 Transnational Corporations as Engines of Growth</source>
          , Department of Economic and Social Development, United Nations, New York.
        </mixed-citation>
      </ref>
      <ref id="26">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J-Y.</given-names>
          </string-name>
          ,
          <year>1990</year>
          .
          <article-title>Growth, technology transfer, and the long-run theory of international capital movements</article-title>
          .
          <source>Journal of International Economics</source>
          <volume>29</volume>
          ,
          <fpage>255</fpage>
          -
          <lpage>271</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>

在第58行周围有一个标记<abstract>。我打算摘录相应的文本。这里需要注意的是,文件结构太复杂,虽然其结构类似于XML,但我无法获得输出。我尝试过很多库,比如untanglelxml和{},但是没有成功。在

这是我试过的代码之一。在

^{pr2}$

编辑: 我还用动态名称嵌套了一些标记。我想提取标记之间的文本,例如

<body>
    <sec id="1">
      <title>1. Introduction</title>
      <p>Technology diffusion plays a central role in the process of economic
    development.2 In contrast to the traditional growth framework, where technological
    change was left as an unexplained residual, the recent growth literature has
    highlighted the dependence of growth rates on the state of domestic technology
    relative to that of the rest of the world. Thus, growth rates in developing countries
    are, in part, explained by a ‘catch-up’ process in the level of technology. In a
    typical model of technology diffusion, the rate of economic growth of a backward
    country depends on the extent of adoption and implementation of new
    technologies that are already in use in leading countries.</p>
<p>The paper is divided into four sections. Section 2 presents a simple model to
motivate our empirical investigation; Section 3 provides an account of the data
used in the empirical analysis; Section 4 describes the regression results, and
Section 5 presents some concluding remarks.</p>
     </sec>
 <sec id="2">... </sec>
</body>

Tags: ofnamerefsourcestringnamestitlearticle
3条回答

我还通过使用xpathlxml.etree模块成功地获得了摘要。在

import os
import lxml.etree as et

def get_article_abstract(article_file, tag_path_elements=None):
    """
    :param article_file: the xml file for a single article
    :param tag_path_elements: xpath search results of the location in the article's XML tree
    :param article_file: individual local PLOS XML article
    :return: plain-text string of content in abstract
    """
    if tag_path_elements is None:
        tag_path_elements = ("/",
                             "article",
                             "front",
                             "article-meta",
                             "abstract")

    article_tree = et.parse(article_file)
    article_root = article_tree.getroot()
    tag_location = '/'.join(tag_path_elements)
    abstract = article_root.xpath(tag_location)
    abstract_text = et.tostring(abstract[0], encoding='unicode', method='text')

    # clean up text: rem white space, new line marks, blank lines
    abstract_text = abstract_text.strip().replace('  ', '')
    abstract_text = os.linesep.join([s for s in abstract_text.splitlines() if s])

    return print(abstract_text)

lxml似乎正在使用xpath处理您的数据:

>>> d = etree.parse(open('...'))  # file with your exact content
>>> e = d.getroot()
>>> e.xpath('.//abstract')
[<Element abstract at 0x7f9239c10710>]
>>> e.xpath('.//abstract/p')[0].text  # first p inside abstract
'We test the effect of foreign direct investment (FDI) ...'

你可以用bs4库找到它。在

from bs4 import BeautifulSoup

soup = BeautifulSoup(xmla)
print (soup.find('abstract'))

>>> '<abstract>haha</abstract>'

相关问题 更多 >