BeautifulSoup:从节中提取文本时,<emph>和其他标记被忽略,导致相邻的单词被推到一起

2024-05-15 23:43:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个XML文档。我想在所有

之间提取所有文本。。<;.p>;标签。下面是一个文本示例。问题是这样一句话:

"Because the <emph>raspberry</emph> and.." 

输出是“因为他们是…和…”。不知何故,emph标签被丢弃了(这很好),但是丢弃的方式会将相邻的单词推到一起。你知道吗

以下是我使用的相关代码:

xml = BeautifulSoup(xml, convertEntities=BeautifulSoup.HTML_ENTITIES)
for para in xml.findAll('p'):
    text = text + " " + para.text + " "

以下是部分正文的开头,以防全文有帮助:

<!DOCTYPE art SYSTEM "keton.dtd">
<art jid="PNAS" aid="1436" vid="94" iss="14" date="07-08-1997" ppf="7349" ppl="7355">
<fm>
<doctopic>Developmental Biology</doctopic>
<dochead>Inaugural Article</dochead>
<docsubj>Biological Sciences</docsubj>
<atl>Suspensor-derived polyembryony caused by altered expression of
valyl-tRNA synthetase in the <emph>twn2</emph>
mutant of <emph>Arabidopsis</emph></atl>
<prs>This contribution is part of the special series of Inaugural
Articles by members of the National Academy of Sciences elected on
April 30, 1996.</prs>
<aug>
<au><fnm>James Z.</fnm><snm>Zhang</snm></au>
<au><fnm>Chris R.</fnm><snm>Somerville</snm></au>
<fnr rid="FN150"><aff>Department of Plant Biology, Carnegie Institution of Washington,
290 Panama Street, Stanford CA 94305</aff>
</fnr></aug>
<acc>May 9, 1997</acc>
<con>Chris R. Somerville</con>
<pubfront>
<cpyrt><date><year>1997</year></date>
<cpyrtnme><collab>The National Academy of Sciences of the USA</collab></cpyrtnme></cpyrt>
<issn>0027-8424</issn><extent>7</extent><price>2.00/0</price>
</pubfront>
<fn id="FN150"><p>To whom reprint requests should be addressed. e-mail:
<email>crs@andrew.stanford.edu</email>.</p>
</fn>
<abs><p>The <emph>twn2</emph> mutant of <emph>Arabidopsis</emph>
exhibits a defect in early embryogenesis where, following one or two
divisions of the zygote, the decendents of the apical cell arrest. The
basal cells that normally give rise to the suspensor proliferate
abnormally, giving rise to multiple embryos. A high proportion of the
seeds fail to develop viable embryos, and those that do, contain a high
proportion of partially or completely duplicated embryos. The adult
plants are smaller and less vigorous than the wild type and have a
severely stunted root. The <emph>twn2-1</emph> mutation, which is the
only known allele, was caused by a T-DNA insertion in the 5′
untranslated region of a putative valyl-tRNA synthetase gene,
<it>valRS</it>. The insertion causes reduced transcription of the
<it>valRS</it> gene in reproductive tissues and developing seeds but
increased expression in leaves. Analysis of transcript initiation sites
and the expression of promoter–reporter fusions in transgenic plants
indicated that enhancer elements inside the first two introns interact
with the border of the T-DNA to cause the altered pattern of expression
of the <it>valRS</it> gene in the <emph>twn2</emph> mutant. The
phenotypic consequences of this unique mutation are interpreted in the
context of a model, suggested by Vernon and Meinke &amp;lsqbVernon, D. M. &amp;
Meinke, D. W. (1994) <emph>Dev. Biol.</emph> 165, 566–573&amp;rsqb, in
which the apical cell and its decendents normally suppress the
embryogenic potential of the basal cell and its decendents during early
embryo development.</p>
</abs>
</fm>

Tags: andofthetoinbyitxml
1条回答
网友
1楼 · 发布于 2024-05-15 23:43:44

我认为这里的问题是您试图用bs3编写bs4代码。你知道吗

显而易见的解决办法是使用bs4代替。你知道吗

但是在bs3中,文档显示了两种方法来递归地从soup的所有内容中获取所有文本:

''.join(e for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
''.join(soup.findAll(text=True))

显然,您可以更改其中一个,以显式去除边缘的空白,并在每个节点之间添加一个空格,而不是依赖于可能存在的任何空格:

' '.join(e.strip() for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
' '.join(map(str.strip, soup.findAll(text=True)))

我不想保证这与bs4text属性完全相同……但我认为这正是您想要的。你知道吗

相关问题 更多 >