如何在每日时间序列对象上迭代webscraping脚本，以便从webpag创建数据的每日时间序列

from bs4 import BeautifulSoup import requests import re import wget import pandas as pd # Starting url and the indicator (key) for links of interest url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm" key = '/monetarypolicy/fomcprojtabl' # Cook the soup page = requests.get(url) data = page.text soup = BeautifulSoup(data) # Create the tuple of links for projection pages projections = [] for link in soup.find_all('a', href=re.compile(key)): projections.append(link["href"]) # Create a tuple to store the projections decfcasts = [] for i in projections: url = "https://www.federalreserve.gov{}".format(i) file = wget.download(url) df_list = pd.read_html(file) fcast = df_list[-1].iloc[:,0:2] fcast.columns = ['Target', 'Votes'] fcast.fillna(0, inplace = True) decfcasts.append(fcast)

Create daily time series object for each day in time series: if day in time series = day in link: run webscraper other wise, append time series with last available observation

1条回答

网友

1楼 · 发布于 2024-04-18 13:25:34

我为你编辑了代码。现在它从url获取日期。日期在数据帧中保存为句点。只有当数据帧（从pickle还原）中不存在日期时，才会对其进行处理和追加。你知道吗

from bs4 import BeautifulSoup
import requests
import re
import wget
import pandas as pd

# Starting url and the indicator (key) for links of interest
url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm"
key = '/monetarypolicy/fomcprojtabl'

# Cook the soup
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)

# Create the tuple of links for projection pages
projections = []
for link in soup.find_all('a', href=re.compile(key)):
    projections.append(link["href"])

# past results from pickle, when no pickle init empty dataframe
try:
    decfcasts = pd.read_pickle('decfcasts.pkl')
except FileNotFoundError:
    decfcasts = pd.DataFrame(columns=['target', 'votes', 'date'])


for i in projections:

    # parse date from url
    date = pd.Period(''.join(re.findall(r'\d+', i)), 'D')

    # process projection if it wasn't included in data from pickle
    if date not in decfcasts['date'].values:

        url = "https://www.federalreserve.gov{}".format(i)
        file = wget.download(url)
        df_list = pd.read_html(file)
        fcast = df_list[-1].iloc[:, 0:2]
        fcast.columns = ['target', 'votes']
        fcast.fillna(0, inplace=True)

        # set date time
        fcast.insert(2, 'date', date)
        decfcasts = decfcasts.append(fcast)

# save to pickle
pd.to_pickle(decfcasts, 'decfcasts.pkl')

相关问题更多 >

编程相关推荐

热门问题

热门文章