BeautifulSoup：获取文本，创建字典

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html' page = requests.get(START_URL) soup = BeautifulSoup(page.text, 'html.parser') for paper in soup.findAll("li",class_="list-group-item downfree"): print(paper.text)

2条回答

网友

1楼 · 编辑于 2024-05-15 17:00:15

您可以使用regex来匹配字符串的每个部分。在

[-\d]+字符串只有数字和-
(?<=\s).*?(?=by)字符串以blank开头，以by结尾（以author开头）
(?<=by\s).*作者，整个字符串的其余部分

完整代码

import requests 
from bs4 import BeautifulSoup
import re

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
    data = dict()
    data["date"] = re.findall(r"[-\d]+",paper.text)[0]
    data["Title"] = re.findall(r"(?<=\s).*?(?=by)",paper.text)[0]
    data["Author(s)"] = re.findall(r"(?<=by\s).*",paper.text)[0]
    print(data)
    datas.append(data)

网友

2楼 · 编辑于 2024-05-15 17:00:15

提取所有子体并只选择NavigableStrings的子体会得到很好的结果。确保从bs4导入NavigableString。我也使用了numpy列表理解，但是你也可以使用for循环。在

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')

papers = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
    info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
    papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})

print(papers[1])

{'Date': '2018-069',
 'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
 'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}

相关问题更多 >

编程相关推荐

热门问题

热门文章

BeautifulSoup：获取文本，创建字典

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >