用硒刮网

2024-06-16 09:29:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我的目的是从网页(http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061)中获取评论的名称、位置、发布时间、标题和整个评论内容。在

我的代码:

    from bs4 import BeautifulSoup
    from selenium  import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

    firefox_capabilities = DesiredCapabilities.FIREFOX
    firefox_capabilities['marionette'] = True
    firefox_capabilities['binary'] = '/etc/firefox'

    driver = webdriver.Firefox(capabilities=firefox_capabilities)
    driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061')
    soup = BeautifulSoup(driver.page_source,"lxml")
    for link in soup.select(".profile"):
        try:
           profile = link.select("p:nth-of-type(1) a")[0]
           profile1 = link.select("p:nth-of-type(2)")[0]
        except:pass      
           print(profile.text,profile1.text)
   driver = webdriver.Firefox(capabilities=firefox_capabilities)
   driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061')
   soup1 = BeautifulSoup(driver.page_source,"lxml")
   for link in soup1.select(".col-10.review"):
      try:
        profile2 = link.select("small:nth-of-type(1)")[0]
        profile3 = link.select("span:nth-of-type(3)")[0]
        profile4 = link.select("a:nth-of-type(1)")[0]
      except:pass
        print(profile2.text,profile3.text,profile4.text)
   driver = webdriver.Firefox(capabilities=firefox_capabilities)
   driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061')
   soup2 = BeautifulSoup(driver.page_source,"lxml")
   for link in soup2.select(".more.review"):
      try:
         containers=page_soup.findAll("div",{"class":"more reviewdata"})
         count=len(containers)
         for index in range(count):
           count1=len(containers[index].p)
           for i in range(count1):
             profile5 = link.select("p:nth-of-type(i)")[0]
      except:pass
         print(profile5.text)
   driver.quit()

我得到的输出的名称,地点,时间和标题的评论,但我无法得到一个用户的全面审查。如果有人能帮我得到同样的输出,我将不胜感激,同时优化我的代码(即,我希望我的代码只需加载一次网页就可以提取所需的数据)。另外,如果有人能帮我从网站的所有网页中提取Jio的所有客户评论,那将对我非常有帮助。在


Tags: oftextinhttpforwwwtypedriver
1条回答
网友
1楼 · 发布于 2024-06-16 09:29:44

只需几行代码就可以达到同样的效果,同时减少痛苦。但是,我在这里定义了三个主要的类别,如namereview_titlereview_data以及其他可以很容易切换的字段。在

您可以选择这样做:

from selenium import webdriver;import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061")
wait = WebDriverWait(driver, 10)

for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".review-article"))):
    link = item.find_element_by_css_selector(".reviewdata a")
    link.click()
    time.sleep(2)

    name = item.find_element_by_css_selector("p a").text
    review_title = item.find_element_by_css_selector("strong a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews]").text
    review_data = ' '.join([' '.join(items.text.split()) for items in item.find_elements_by_css_selector(".reviewdata")])
    print("Name: {}\nReview_Title: {}\nReview_Data: {}\n".format(name, review_title, review_data))

driver.quit()

或者结合使用(硒+bs4):

^{pr2}$

相关问题 更多 >