如何获取.content中未包含的HTML标记中的文本？

from bs4 import BeautifulSoup import requests url = "https://www.ncbi.nlm.nih.gov/protein/P22217" r = requests.get(url) data = r.content soup = BeautifulSoup(data, "html.parser") PageInfo = soup.find("pre", attrs={"class":"genbank"}) print(PageInfo)

2条回答

网友

1楼 · 编辑于 2024-06-07 00:30:36

页面正在进行XHR调用以获取您要查找的信息。调用是https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=135747&db=protein&report=genpept&conwithfeat=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&；maxdownloadsize=1000000

它又回来了

<div class="sequence">
<a name="locus_P22217.3"></a><div class="localnav"><ul class="locals"><li><a href="#comment_P22217.3" title="Jump to the comment section of this record">Comment</a></li><li><a href="#feature_P22217.3" title="Jump to the feature table of this record">Features</a></li><li><a href="#sequence_P22217.3" title="Jump to the sequence of this record">Sequence</a></li></ul></div>
<pre class="genbank">LOCUS       TRX1_YEAST               103 aa            linear   PLN 18-SEP-2019
DEFINITION  RecName: Full=Thioredoxin-1; AltName: Full=Thioredoxin I;
            Short=TR-I; AltName: Full=Thioredoxin-2.
ACCESSION   P22217
VERSION     P22217.3
**DBSOURCE**    UniProtKB: locus TRX1_YEAST, accession <a href="https://www.uniprot.org/uniprot/P22217">P22217</a>;
            class: standard.
            extra accessions:D6VY45
            created: Aug 1, 1991.

。。。你知道吗

因此，从代码中执行HTTP调用以获取数据。你知道吗

网友

2楼 · 编辑于 2024-06-07 00:30:36

您可以使用它，因为页面依赖于xmlhttprequests

代码：

from bs4 import BeautifulSoup

import requests,re

url = "https://www.ncbi.nlm.nih.gov/protein/P22217"

r = requests.get(url)
soup = BeautifulSoup(r.content,features='html.parser')
pageId = soup.find('meta', attrs={'name':'ncbi_uidlist'})['content']

api = requests.get('https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}'.format(pageId))

data = re.search(r'DBSOURCE([\w\s\n\t.:,;()-_]*)KEYWORD',api.text)
print(data.group(1).strip())

演示代码：Here

说明：

对url的请求将有助于获取您所请求的产品的id，该id存在于页面的meta中。你知道吗
通过获取id，第二个请求将使用websiteapi来获取所需的描述。正则表达式模式将用于分隔想要的部分和不想要的部分。你知道吗

正则表达式：

DBSOURCE([\w\s\n\t.:,;()-_]*)KEYWORD

演示正则表达式：Here

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何获取.content中未包含的HTML标记中的文本？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >