抓取到文本文件,但除非复制,否则不会写入。Python 3.7 ChromeDriver BS4

2024-05-16 20:04:10 发布

您现在位置:Python中文网/ 问答频道 /正文

这段代码在4-5小时前就可以工作了,现在它复制了我希望它写入文件的内容。我尝试过的最明显的事情是注释掉file.write行或下面的打印行,结果是一个空白文本文件。我试过使用开放行的各种参数,如a+、a、w和w+,同时注释掉前面提到的两行中的一行,但它仍然是空白的。希望有人能找出我把事情搞砸的地方,帮我纠正这个问题

我的另一个问题是,在复制了当前章节后,我将如何导航到下一章,但如果我必须为此提出一个新问题,我会。此外,如果你有任何建议,使代码更好地减去defs,我将在以后添加(脚本完成后),让我知道

以下是迄今为止的代码:

#! python3
import requests
import bs4 as BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options

def Close():
    driver.stop_client()
    driver.close()
    driver.quit()

CHROMEDRIVER_PATH = 'E:\Downloads\chromedriver_win32\chromedriver.exe'

# start raw html
NovelName = 'Novel/Isekai-Maou-to-Shoukan-Shoujo-Dorei-Majutsu'
BaseURL = 'https://novelplanet.com/'
url = '%(U)s/%(N)s' % {'U': BaseURL, "N": NovelName}

options = Options()
options.add_experimental_option("excludeSwitches",["ignore-certificate-errors"])
options.add_argument("--headless") # Runs Chrome in headless mode.
options.add_argument('--no-sandbox') # Bypass OS security model
options.add_argument('--disable-gpu')  # applicable to windows os only
options.add_argument('start-maximized') # 
options.add_argument('disable-infobars')
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(CHROMEDRIVER_PATH, options=options)
driver.get(url)

# wait for title not be equal to "Please wait 5 seconds..."
wait = WebDriverWait(driver, 10)
wait.until(lambda driver: driver.title != "Please wait 5 seconds...")

soup = BeautifulSoup.BeautifulSoup(driver.page_source, 'html.parser')
# End raw html

# Start get first chapter html coded
i=0
for chapterLink in soup.find_all(class_='rowChapter'):
    i+=1
cLink = chapterLink.find('a').contents[0].strip()
print(driver.title)
# end get first chapter html coded

# start navigate to first chapter
link = driver.find_element_by_link_text(cLink)
link.click()
# end navigate to first chapter

# start copy of chapter and add to a file
wait = WebDriverWait(driver, 10)
wait.until(lambda driver: driver.title != "Please wait 5 seconds...")
print(driver.title)
soup = BeautifulSoup.BeautifulSoup(driver.page_source, 'html.parser')
readables = soup.find(id='divReadContent')
text = readables.text.strip().replace('○','0').replace('×','x').replace('《',' <<').replace('》','>> ').replace('「','"').replace('」','"')
name = driver.title
file_name = (name.replace('Read ',"").replace(' - NovelPlanet',"")+'.txt')
print(file_name)

with open(file_name,'a+') as file:
    print(text,file=file)

lastURL = driver.current_url.replace('https://novelplanet.com','')
# end copy of chapter and add to a file

# start goto next chapter if exists then return to copy chapter else Close()

# end goto next chapter if exists then return to copy chapter else Close()

Close()
#EOF

编辑: 更改了上面的代码以使用下面的建议。考虑到文档中没有这些信息,我花了大约一个小时才意识到可以使用修改器,这也是我偏离简易路径的原因

现在要弄清楚如何浏览页面,有6个<div class="4u 12u(small)">,第2个和第5个是组合框/选项框,我怀疑它们是否容易使用。第一章和第四章是前面的章节,第三章和第六章是后面的章节。当上一个按钮或下一个按钮没有地方可去时,它们会说<div class="4u 12u(small)">&nbsp;</div>。有人知道如何在所有6个选项中选择第3个或第6个按钮,以及如何在程序结束时终止程序吗


Tags: tonameimportaddtitlehtmldriverargument