我正在使用Python、Scrapy、Selenium和Phantomjs(用于动态加载的内容)制作一个web爬虫/刮板。我的程序在网站(http://www.pcworld.com/)上查找用户提供的一些输入。然后获取搜索结果中的所有文章的链接。每一篇文章都有一个请求。然后,代码获取文章的标题、文章文本、url和发布日期,并将其保存到Mysql数据库(dbname=pcworld,username=testuser,passwd=test123)。
问题是,在获取了一些文章之后,程序只是停止,什么也不做。命令行中没有响应。在
这是蜘蛛:
# -*- coding: utf-8 -*-
#!/usr/bin/python
'''
@param: scrapy crawl pcworld_spider -a external_input="EXAMPLEINPUT"
'''
import scrapy
from scrapy.http.request import Request
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
import time
import lxml.html
import json
import MySQLdb
from Thesis.items import ThesisItem
class PcworldSpider(scrapy.Spider):
name = "pcworld_spider"
custom_settings = {
'ITEM_PIPELINES': {
'Thesis.pipelines.PcworldPipeline': 50
}
}
allowed_domains = ['pcworld.com']
start_urls = ['http://www.pcworld.com/search?query=']
global_next_page = ''
def __init__(self, external_input):
self.external_input = external_input
super(PcworldSpider, self).__init__() #rufe ich auf um external_input an pipeline zu übergeben
self.start_urls[0] = self.start_urls[0] + external_input
self.global_next_page = self.start_urls[0]
# Open database connection
self.db = MySQLdb.connect("localhost","testuser","test123","pcworld" )
# prepare a cursor object using cursor() method
self.cursor = self.db.cursor()
# Drop table if it already exist using execute() method. #whitespace entfernen wegen mysql
self.cursor.execute("DROP TABLE IF EXISTS %s" %external_input.replace(" ", ""))
# Create table as per requirement #whitespace entfernen wegen mysql
sql = """CREATE TABLE """+external_input.replace(" ", "")+""" (
Ueberschrift varchar(1500),
Article text,
Datum date,
Original_URL varchar(1500))"""
try:
# Execute the SQL command
self.cursor.execute(sql)
# Commit your changes in the database
self.db.commit()
except:
print(" ")
print("IM ROLLING BACK in PARSE")
print(" ")
# Rollback in case there is any error
self.db.rollback()
self.driver = webdriver.PhantomJS()
self.driver.set_window_size(1120, 550)
#self.driver = webdriver.Chrome("C:\Users\Daniel\Desktop\Sonstiges\chromedriver.exe")
self.driver.wait = WebDriverWait(self.driver, 4) #wartet bis zu 4 sekunden
def parse(self, response):
print("\n1\n")
self.driver.get(self.global_next_page)
print("\n2\n")
#wartet bis zu 4 sekunden(in __init__() definiert) auf den Eintritt der Condition, danach schmeist er den TimeoutException error
try:
self.driver.wait.until(EC.presence_of_element_located(
(By.CLASS_NAME, "excerpt-text")))
print("Found : excerpt-text")
except TimeoutException:
#closeself.driver.close()
print(" excerpt-text NOT FOUND IN PCWORLD !!!")
print("\n3\n")
#Crawle durch Javascript erstellte Inhalte mit Selenium
ahref = self.driver.find_elements(By.XPATH,'//div[@class="excerpt-text"]/h3/a')
print("\n4\n")
hreflist = []
#Alle Links zu den jeweiligen Artikeln sammeln
for elem in ahref :
hreflist.append(elem.get_attribute("href"))
print("\n5\n")
for elem in hreflist :
print(elem)
#self.driver.implicitly_wait(2)
yield scrapy.Request(url=elem , callback=self.parse_content)
print("\n6\n")
#Den Link fuer die naechste Seite holen
try:
if(self.driver.find_elements(By.XPATH,"//a[@rel='next']")):
#print("es gibt next")
next = self.driver.find_element(By.XPATH,"//a[@rel='next']")
self.global_next_page = next.get_attribute("href")
yield scrapy.Request(url=self.global_next_page, callback=self.parse, dont_filter=True) #das ist der richtige Request!
print(" ")
else:
print("next gibt es nicht!")
except TimeoutException:
print("TIMEOUTEXCEPTION WHILE SEARCHING FOR NEXT")
#self.driver.close()
print("\n7\n")
def parse_content(self, response):
print("\n8\n")
self.driver.get(response.url)
print("\n9\n")
title = self.driver.find_element(By.XPATH,"//h1[@itemprop='headline']")
titletext = title.get_attribute("innerHTML")
titletext = titletext.replace('\n', ' ').replace('\r', '') #newlines/carriagereturns entfernen weil sonst lxml fehler schmeißt
titletext = lxml.html.fromstring(titletext).text_content()
date = self.driver.find_element(By.XPATH,"//meta[@name='date']")
date_text = date.get_attribute("content")
article = self.driver.find_elements(By.XPATH,"//div[contains(@itemprop, 'articleBody')]//p")
article_list = []
print("\n10\n")
for elem in article:
print(" ")
elem_text = elem.get_attribute("innerHTML")
elem_text = elem_text.replace('\n', ' ').replace('\r', '') #newlines/carriagereturns entfernen weil sonst lxml fehler schmeißt
#print(elem_text.encode("utf-8"))
article_list.append(elem_text)
article_text = ' '.join(article_list) #article Liste in einen String umwandeln
article_text = lxml.html.fromstring(article_text).text_content()
pcworld_data = ThesisItem()
pcworld_data['Ueberschrift'] = titletext
pcworld_data['Article'] = article_text
pcworld_data['Datum'] = date_text
pcworld_data['Original_URL'] = response.url
print("\n11\n")
#returns the item -> next thing called is Pipeline with class ThesisItem()
return pcworld_data
管道代码:
^{pr2}$我的物品:
import scrapy
class ThesisItem(scrapy.Item):
Ueberschrift = scrapy.Field()
Article = scrapy.Field()
Datum = scrapy.Field()
Original_URL = scrapy.Field()
设置中发生了变化:
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 200
CONCURRENT_ITEMS = 400
DOWNLOAD_DELAY = 4
COOKIES_ENABLED = False
我观察到:
程序总是在调用后停止self.driver.get(响应.url)在parse_内容中
数据库与代码块无关(我尝试删除所有内容并将项目放入json文件)
我对其他网站也有类似的代码,在其中一些网站上,它运行得很好,但在这里它被阻止了。在
关于为什么代码在一些(4-10)文章请求后停止的建议?在
输出:(块前的最后几行)
6
2017-07-25 09:52:57 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:60768/wd/hub/session/31152680-710e-11e7-bb53-730b5237f77b/elements {"using": "xpath",
"sessionId": "31152680-710e-11e7-bb53-730b5237f77b", "value": "//a[@rel='next']"}
2017-07-25 09:52:57 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-07-25 09:52:57 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:60768/wd/hub/session/31152680-710e-11e7-bb53-730b5237f77b/element {"using": "xpath",
"sessionId": "31152680-710e-11e7-bb53-730b5237f77b", "value": "//a[@rel='next']"}
2017-07-25 09:52:57 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-07-25 09:52:57 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:60768/wd/hub/session/31152680-710e-11e7-bb53-730b5237f77b/element/:wdc:1500969177413/a
ttribute/href {"sessionId": "31152680-710e-11e7-bb53-730b5237f77b", "name": "href", "id": ":wdc:1500969177413"}
2017-07-25 09:52:57 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
7
2017-07-25 09:53:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.pcworld.com/article/2142565/report-nsa-secretly-exploited-devastating-heartbleed-bug-for-years.html> (
referer: http://www.pcworld.com/search?query=heartbleed&start=10)
8
2017-07-25 09:53:01 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:60768/wd/hub/session/31152680-710e-11e7-bb53-730b5237f77b/url {"url": "http://www.pcw
orld.com/article/2142565/report-nsa-secretly-exploited-devastating-heartbleed-bug-for-years.html", "sessionId": "31152680-710e-11e7-bb53-730b5237f77b"}
我做的另一个测试:output:https://pastebin.com/XAky7YJP(这里它在第一个请求之后停止)
我使用的是python2.7和scrapy1.4.0
目前没有回答
相关问题 更多 >
编程相关推荐