我在抓取网站时遇到问题,脚本只提取到aria rowindex 29,但我需要提取到aria rowindex 2509
这是我的代码,你可以看到我正在使用playwright和selectolax来抓取网站的数据。每当我运行这个脚本时,它会从网站的表格中提取数据,最多提取到第29行,然后执行就停止了,虽然没有报错,但我希望这个脚本能一直执行到第2509行。
from playwright.sync_api import sync_playwright
from selectolax.parser import HTMLParser
import time
import pandas as pd
def extract_full_body_html(url):
TIMEOUT = 30000 # Reduced timeout to prevent long waits
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Maximize the window
page.set_viewport_size({'width': 1920, 'height': 1080})
page.goto(url, wait_until='networkidle')
# Wait for the initial dynamic content to load
page.wait_for_selector('div[role="gridcell"]', timeout=TIMEOUT) # Adjusted selector
# Scroll down and periodically check for new content
def load_more_content():
last_row_index = 0
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(10) # Wait for the page to load more content
# Check for new elements based on the aria-rowindex attribute
new_last_row_index = int(page.evaluate('''() => {
const rows = document.querySelectorAll('div[role="gridcell"][aria-rowindex]');
return rows[rows.length - 1].getAttribute("aria-rowindex");
}'''))
if new_last_row_index <= last_row_index:
break # No new data loaded, stop the process
last_row_index = new_last_row_index
# Small delay to ensure all data is loaded for the new rows
time.sleep(2)
load_more_content()
return page.inner_html('body')
def extraction(html):
tree = HTMLParser(html)
data = []
# Adjust the range if you expect more or fewer rows
for i in range(1, 2510): # Extract data up to aria row index 2509
row_selector = f'div[role="gridcell"][aria-rowindex="{i}"]'
company_div = tree.css_first(f'{row_selector}[aria-colindex="1"]')
if company_div is None:
break # Exit if no more rows are found
# Extracting data for each column in the row
row_data = {
'Company': company_div.text(deep=True, separator=' '),
'Emails': tree.css_first(f'{row_selector}[aria-colindex="2"]').text(deep=True, separator=' '),
'Addresses': tree.css_first(f'{row_selector}[aria-colindex="3"]').text(deep=True, separator=' '),
'Urls': tree.css_first(f'{row_selector}[aria-colindex="4"]').text(deep=True, separator=' '),
'Description': tree.css_first(f'{row_selector}[aria-colindex="5"]').text(deep=True, separator=' '),
'Stage': tree.css_first(f'{row_selector}[aria-colindex="6"]').text(deep=True, separator=' '),
'Number of Portfolio Organizations': tree.css_first(f'{row_selector}[aria-colindex="7"]').text(deep=True, separator=' '),
'Number of Investments': tree.css_first(f'{row_selector}[aria-colindex="8"]').text(deep=True, separator=' '),
'Accelerator Duration (in weeks)': tree.css_first(f'{row_selector}[aria-colindex="9"]').text(deep=True, separator=' '),
'Number of Exits': tree.css_first(f'{row_selector}[aria-colindex="10"]').text(deep=True, separator=' '),
'Linkedin': tree.css_first(f'{row_selector}[aria-colindex="11"]').text(deep=True, separator=' '),
'Founders': tree.css_first(f'{row_selector}[aria-colindex="12"]').text(deep=True, separator=' '),
'Twitter': tree.css_first(f'{row_selector}[aria-colindex="13"]').text(deep=True, separator=' ')
}
data.append(row_data)
return data
if __name__ == '__main__':
url = 'https://app.folk.app/shared/All-accelerators-rw0kuUNqtzl6j6dDQquoZTYF6MFKIQHo'
html = extract_full_body_html(url)
data = extraction(html)
df = pd.DataFrame(data)
df.to_excel('output.xlsx', index=False)
在我的脚本中,我觉得页面的HTML内容没有完全加载出来,或者在脚本继续执行时,页面的HTML没有加载或显示出来,导致无法抓取。
1 个回答
0
我觉得这大概就是你想要的:
import time
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto('https://app.folk.app/shared/All-accelerators-rw0kuUNqtzl6j6dDQquoZTYF6MFKIQHo')
#We make click into the table (Otherwise we can not make scroll)
page.locator("//div[@data-testid='contact-table']").click()
# We make scroll till the end of the page
for i in range(5): # make the range as long as needed
page.mouse.wheel(0, 150000)
time.sleep(1)
# We get the aria-rowindex of the last row of the table
print(page.locator("//div[@role='row'][last()]").get_attribute('aria-rowindex'))
num_rows = page.locator("//div[@role='row'][last()]").get_attribute('aria-rowindex')
# We make scrill again till the top of the page again
for i in range(5): # make the range as long as needed
page.mouse.wheel(0, -150000)
time.sleep(1)
# We iterate to take all the data using the num of rows we previously took
for i in range(1, int(num_rows)+1):
page.locator(f"//div[@class='c-klyBnI c-klyBnI-inIPuL-css']/div[@aria-rowindex='{i}']").scroll_into_view_if_needed()
company = page.locator(f"//div[@class='c-klyBnI c-klyBnI-inIPuL-css']/div[@aria-rowindex='{i}']//span[2]").inner_text()
email = page.locator(f"//div[@role='row' and @aria-rowindex='{i}']//div[@aria-colindex='2']/span").inner_text()
print(f"{i} - {company} - {email}")
我在代码里加了一些注释,来解释代码在做什么。
基本上,正如你所说,页面是通过Javascript加载的,所以我认为关键是先获取最后一行,然后逐行滚动,直到获取到所有数据。
我只是提取了几列数据,但我觉得对你来说,获取其他列应该很简单。
祝你好运!