使用Selenium、PhantomJS和BS4进行爬虫
我现在在用Windows 10和Python 3.7,我在研究怎么在抓取网址的时候,不用为每个网址打开一个Firefox浏览器窗口。下面的代码出现了错误,我觉得这可能和PhantomJS的使用方式有关,但我不太清楚具体是什么问题。
我听说PhantomJS和Selenium一起用是个解决方案。我安装了PhantomJS,设置了电脑上的路径,看起来是可以运行的,但我不太确定怎么在代码里使用它。
driver = webdriver.PhantomJS(executable_path=r"C:\phantomjs")
这一行是尝试运行PhantomJS的代码。在使用driver = webdriver.Firefox()
之前,这段代码是可以正常工作的。
urls = ["https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=0&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=90&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=180&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=270&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=360&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=450&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=540&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=630&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=720&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=810&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=900&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=990&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD"]
#url = "https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=180&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD"
user_agent = UserAgent()
#make csv file
csv_file = open("gcscrape.csv", "w", newline='') #added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name","bass_price"])
for url in urls:
web_r = requests.get(url)
web_soup = BeautifulSoup(web_r.text,"html.parser")
#print(web_soup.findAll("li", class_="product-container")) #finding all of the grid items on the url above - price, photo, image, details and all
#print(len(web_soup.findAll("li", class_="product-container"))) #printing out the length of the
#driver = webdriver.Firefox()
driver = webdriver.PhantomJS(executable_path=r"C:\phantomjs")
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML") #whats inside of this is a javascript call to get the outer html content of the page
sel_soup = BeautifulSoup(html, "html.parser")
for content in sel_soup.findAll("li", class_="product-container"):
#print(content)
bass_name = content.find("div", class_="productTitle").text.strip() #pulls the bass guitar name
print(bass_name)
prices_new = []
for i in content.find("span", class_="productPrice").text.split("$"):
prices_new.append(i.strip())
bp = prices_new[1]
print(bp)
#write row to new csv file
csv_writer.writerow([bass_name, bp])
相关问题:
- 暂无相关问题
1 个回答
0
确保你下载适合你操作系统的正确版本的 PhantomJs,可以在这里找到。
如果你是 Windows 用户,下面这行代码应该可以正常使用:
driver = webdriver.PhantomJS("C://phantomjs.exe")
driver.get(url)