我做了两次尝试,让我的代码导航到一个网页,将数据从一个表导入到一个数据框,然后移动到下一个页面,再次执行同样的操作。下面是我测试的一些示例代码。现在我被卡住了;我不知道怎么继续
# first attempt
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from time import sleep
lst = []
url = "https://www.nasdaq.com/market-activity/stocks/screener"
for numb in (1, 10):
url = "https://www.nasdaq.com/market-activity/stocks/screener"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all('table')
df = pd.DataFrame(table)
lst.append(df)
def get_cpf():
driver = webdriver.Chrome("C:/Utility/chromedriver.exe")
driver.get(url)
driver.find_element_by_class('pagination__page" data-page="'' + numb + ''').click()
sleep(10)
text=driver.find_element_by_id('texto_cpf').text
print(text)
get_cpf()
get_cpf.click
### second attempt
#import BeautifulSoup
from bs4 import BeautifulSoup
import pandas as pd
import requests
from selenium import webdriver
from time import sleep
lst = []
for numb in (1, 10):
r=requests.get('https://www.nasdaq.com/market-activity/stocks/screener')
data = r.text
soup = BeautifulSoup(data, "html.parser")
table = soup.find( "table", {"class":"nasdaq-screener__table"} )
for row in table.findAll("tr"):
for cell in row("td"):
data = cell.get_text().strip()
df = pd.DataFrame(data)
lst.append(df)
def get_cpf():
driver = webdriver.Chrome("C:/Utility/chromedriver.exe")
driver.get(url)
driver.find_element_by_class('pagination__page" data-page="'' + numb + ''').click()
sleep(10)
text=driver.find_element_by_id('texto_cpf').text
print(text)
get_cpf()
get_cpf.click
### third attempt
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time
import requests
import pandas as pd
lst = []
url="https://www.nasdaq.com/market-activity/stocks/screener"
driver = webdriver.Chrome("C:/Utility/chromedriver.exe")
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"#_evh-ric-c"))).click()
for pages in range(1,9):
try:
print(pages)
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all('table')
df = pd.DataFrame(table)
lst.append(df)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button.pagination__next"))).click()
time.sleep(1)
except:
break
这是一个屏幕截图的HTML背后的表,我正试图刮
因此,在第一页中,我想从以下内容中删除所有内容:
AAPL Apple Inc. Common Stock $127.79 6.53 5.385% 2,215,538,678,600
致:
ASML ASML Holding N.V. New York Registry Shares $583.55 16.46 2.903% 243,056,764,541
然后,移到第2页,做同样的事情,移到第3页,做同样的事情,等等,等等。我不确定仅使用BeautifulSoup是否可行。或者我需要Selenium,用于按钮单击事件。我愿意做这里最简单的事。谢谢
不会处理API,因为Nuran只会按照用户的要求处理
下面是浏览前10页的示例。首先,我们删除通知。然后等待“下一步”按钮可交互并单击它
进口
你可以这样做
请注意,,您不需要使用
selenium
来执行此类任务,因为它会降低您的进程在真实场景中,我们只使用
selenium
绕过浏览器检测,然后将cookie传递给任何HTTP模块以继续操作关于您的任务,我注意到有一个
API
实际上为HTML
源提供了信息这是一个快速呼叫
您不需要在这里循环
pages
。因为你可以通过增加限制来互动但是如果您想使用
for
循环,那么您必须循环以下内容并保持偏移量
相关问题 更多 >
编程相关推荐