BeautifulSoup：从节中提取文本时，<emph>和其他标记被忽略，导致相邻的单词被推到一起

<!DOCTYPE art SYSTEM "keton.dtd"> <art jid="PNAS" aid="1436" vid="94" iss="14" date="07-08-1997" ppf="7349" ppl="7355"> <fm> <doctopic>Developmental Biology</doctopic> <dochead>Inaugural Article</dochead> <docsubj>Biological Sciences</docsubj> <atl>Suspensor-derived polyembryony caused by altered expression of valyl-tRNA synthetase in the <emph>twn2</emph> mutant of <emph>Arabidopsis</emph></atl> <prs>This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected on April 30, 1996.</prs> <aug> <au><fnm>James Z.</fnm><snm>Zhang</snm></au> <au><fnm>Chris R.</fnm><snm>Somerville</snm></au> <fnr rid="FN150"><aff>Department of Plant Biology, Carnegie Institution of Washington, 290 Panama Street, Stanford CA 94305</aff> </fnr></aug> <acc>May 9, 1997</acc> <con>Chris R. Somerville</con> <pubfront> <cpyrt><date><year>1997</year></date> <cpyrtnme><collab>The National Academy of Sciences of the USA</collab></cpyrtnme></cpyrt> <issn>0027-8424</issn><extent>7</extent><price>2.00/0</price> </pubfront> <fn id="FN150"><p>To whom reprint requests should be addressed. e-mail: <email>crs@andrew.stanford.edu</email>.</p> </fn> <abs><p>The <emph>twn2</emph> mutant of <emph>Arabidopsis</emph> exhibits a defect in early embryogenesis where, following one or two divisions of the zygote, the decendents of the apical cell arrest. The basal cells that normally give rise to the suspensor proliferate abnormally, giving rise to multiple embryos. A high proportion of the seeds fail to develop viable embryos, and those that do, contain a high proportion of partially or completely duplicated embryos. The adult plants are smaller and less vigorous than the wild type and have a severely stunted root. The <emph>twn2-1</emph> mutation, which is the only known allele, was caused by a T-DNA insertion in the 5′ untranslated region of a putative valyl-tRNA synthetase gene, <it>valRS</it>. The insertion causes reduced transcription of the <it>valRS</it> gene in reproductive tissues and developing seeds but increased expression in leaves. Analysis of transcript initiation sites and the expression of promoter–reporter fusions in transgenic plants indicated that enhancer elements inside the first two introns interact with the border of the T-DNA to cause the altered pattern of expression of the <it>valRS</it> gene in the <emph>twn2</emph> mutant. The phenotypic consequences of this unique mutation are interpreted in the context of a model, suggested by Vernon and Meinke &lsqbVernon, D. M. & Meinke, D. W. (1994) <emph>Dev. Biol.</emph> 165, 566–573&rsqb, in which the apical cell and its decendents normally suppress the embryogenic potential of the basal cell and its decendents during early embryo development.</p> </abs> </fm>

1条回答

网友

1楼 · 发布于 2024-05-15 23:43:44

我认为这里的问题是您试图用bs3编写bs4代码。你知道吗

显而易见的解决办法是使用bs4代替。你知道吗

但是在bs3中，文档显示了两种方法来递归地从soup的所有内容中获取所有文本：

''.join(e for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
''.join(soup.findAll(text=True))

显然，您可以更改其中一个，以显式去除边缘的空白，并在每个节点之间添加一个空格，而不是依赖于可能存在的任何空格：

' '.join(e.strip() for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
' '.join(map(str.strip, soup.findAll(text=True)))

我不想保证这与bs4text属性完全相同……但我认为这正是您想要的。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章