我的目的是从网页(http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061)中获取评论的名称、位置、发布时间、标题和整个评论内容。在
我的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
firefox_capabilities['binary'] = '/etc/firefox'
driver = webdriver.Firefox(capabilities=firefox_capabilities)
driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061')
soup = BeautifulSoup(driver.page_source,"lxml")
for link in soup.select(".profile"):
try:
profile = link.select("p:nth-of-type(1) a")[0]
profile1 = link.select("p:nth-of-type(2)")[0]
except:pass
print(profile.text,profile1.text)
driver = webdriver.Firefox(capabilities=firefox_capabilities)
driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061')
soup1 = BeautifulSoup(driver.page_source,"lxml")
for link in soup1.select(".col-10.review"):
try:
profile2 = link.select("small:nth-of-type(1)")[0]
profile3 = link.select("span:nth-of-type(3)")[0]
profile4 = link.select("a:nth-of-type(1)")[0]
except:pass
print(profile2.text,profile3.text,profile4.text)
driver = webdriver.Firefox(capabilities=firefox_capabilities)
driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061')
soup2 = BeautifulSoup(driver.page_source,"lxml")
for link in soup2.select(".more.review"):
try:
containers=page_soup.findAll("div",{"class":"more reviewdata"})
count=len(containers)
for index in range(count):
count1=len(containers[index].p)
for i in range(count1):
profile5 = link.select("p:nth-of-type(i)")[0]
except:pass
print(profile5.text)
driver.quit()
我得到的输出的名称,地点,时间和标题的评论,但我无法得到一个用户的全面审查。如果有人能帮我得到同样的输出,我将不胜感激,同时优化我的代码(即,我希望我的代码只需加载一次网页就可以提取所需的数据)。另外,如果有人能帮我从网站的所有网页中提取Jio的所有客户评论,那将对我非常有帮助。在
只需几行代码就可以达到同样的效果,同时减少痛苦。但是,我在这里定义了三个主要的类别,如
name
,review_title
,review_data
以及其他可以很容易切换的字段。在您可以选择这样做:
或者结合使用(硒+bs4):
^{pr2}$相关问题 更多 >
编程相关推荐