Python提高代码速度Pandas.append

/resources/archive/us/2007.html /resources/archive/us/2008.html /resources/archive/us/2009.html /resources/archive/us/2010.html /resources/archive/us/2011.html /resources/archive/us/2012.html /resources/archive/us/2013.html /resources/archive/us/2014.html /resources/archive/us/2015.html /resources/archive/us/2016.html

headlines = pd.DataFrame(columns=["date", "headline"]) for y in years: yurl = "http://www.reuters.com"+str(y) response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', }) bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') days =[] links = bs.findAll('h5') for mon in links: for day in mon.next_sibling.next_sibling: days.append(day) days = [e for e in days if str(e) not in ('\n')] for ind in days: hlday = ind['href'] date = re.findall('(?!\/)[0-9].+(?=\.)', hlday)[0] date = date[4:6] + '-' + date[6:] + '-' + date[:4] print(date.split('-')[2]) yurl = "http://www.reuters.com"+str(hlday) response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', }) if response.status_code == 404 or response.content == b'': print('') else: bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') lines = bs.findAll('div', {'class':'headlineMed'}) for h in lines: headlines = headlines.append([{"date":date, "headline":h.text}], ignore_index = True)

1条回答

网友

1楼 · 发布于 2024-06-16 09:54:26

您正在使用此反模式：

headlines = pd.DataFrame()
for for y in years:
    for ind in days:
        headlines = headlines.append(blah)

相反，请执行以下操作：

headlines = []
for for y in years:
    for ind in days:
        headlines.append(pd.DataFrame(blah))

headlines = pd.concat(headlines)

第二个潜在的问题是您正在发出3650个web请求。如果我经营这样一个网站，我会建立节流，以减缓像你这样的刮板。您可能会发现最好只收集一次原始数据，将其存储在磁盘上，然后在第二次处理中进行处理。这样就不会在每次需要调试程序时产生3650个web请求的开销。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章