我有一个简短的程序,旨在测试我的应用程序中的所有链接,以确保链接的工作。我已经到了将应用程序中的所有链接提取到字典中的阶段。你知道吗
当我运行一次程序时,它似乎能工作。我的意思是这本词典的内容是我所期望的。你知道吗
但是,当我添加while循环来检查字典上的值,看它是否包含value=false时,整个过程就结束了。 字典的结果是不同的,而在真正的条目是正确的,他们现在似乎没有。你知道吗
我试图返回字典的值,并将其复制到字典,这是真理的来源,但我仍然得到的问题。你知道吗
当我试图检查dictionary enty的值以检查其值是否为False时,会抛出错误
class ScrapeLinks:
unique_urls = set()
visited_urls = {}
def get_a_hrefs(self, response):
content = response.content
soup = BeautifulSoup(content, 'lxml')
# Get the csrf
locator = 'input[name="_csrf"]'
links = 'a'
csrf_html= soup.select_one(locator)
csrf_value = csrf_html.attrs['value']
ScrapeLinks.csrf_token = csrf_value
for items in soup.find_all(links):
if items.has_attr('href'):
if items.attrs['href'].startswith('#') or not items.attrs['href'].startswith('/'):
pass
else:
if '#' in items.attrs['href']:
sep = '#'
theUrl = items.attrs['href'].split(sep, 1)[0]
print(f'the splited url is: {theUrl}')
ScrapeLinks.visited_urls[theUrl]='False'
ScrapeLinks.visited_urls[items.attrs['href']]='False'
def split(self, key, value, sep):
if '#' in key:
print('in splitter')
sep = '#'
theUrl = key.split(sep, 1)[0]
print(f'the splited url is: {theUrl}')
return theUrl
def crawl(self):
tempdict = {}
dict_copy = copy(ScrapeLinks.visited_urls)
print(dict_copy)
for key, value in dict_copy.items():
theUrl=self.split(key, value, '#')
tempdict[key]=theUrl
if 'False' in value:
r = requests.get(ScrapeLinks.base_path+f'{key}', headers={
'content-type': 'application/x-www-form-urlencoded',
'X-Csrf-Token': f'{ScrapeLinks.csrf_token}',
'Cookie': f'SID/newtshirt={ScrapeLinks.cookie}'
})
tempdict[key]='True'
print(r.status_code)
print("**************************************")
content = r.content
soup = BeautifulSoup(content, 'lxml')
links = 'a'
for items in soup.find_all(links):
if items.has_attr('href'):
print(items.attrs["href"])
if items.attrs['href'] not in dict_copy:
if items.attrs['href'].startswith('#') or not items.attrs['href'].startswith('/'):
print(f'Skipping: {items.attrs["href"]}')
continue
print(f'Adding the link: {items.attrs["href"]}')
ScrapeLinks.unique_urls.add(items.attrs['href'])
else:
print('already in dictionary - do nothing')
for i in ScrapeLinks.unique_urls:
tempdict.update({i:'False'})
print('@@@@@@@@@@@@@@@@@')
print(f'temp dict is : {tempdict}')
# works when below recursion is commented out
print('!!!!!')
print(ScrapeLinks.visited_urls)
return tempdict
# iterations = 500
# while 'False' in ScrapeLinks.visited_urls.values():
# .crawl()
scrape = ScrapeLinks()
session = scrape.login()
scrape.get_a_hrefs(session)
while 'False' in ScrapeLinks.visited_urls.values():
updated_dict = scrape.crawl()
ScrapeLinks.visited_urls = updated_dict
print(ScrapeLinks.visited_urls)
当函数crawl被反复运行时,我想让程序使用vistited\u url字典中存储的值,这样我就可以记录我访问过的页面等
我收到的错误消息是:
Traceback (most recent call last):
File "essRequest.py", line 155, in <module>
updated_dict = scrape.crawl()
File "essRequest.py", line 83, in crawl
if 'False' in value:
TypeError: argument of type 'NoneType' is not iterable
目前没有回答
相关问题 更多 >
编程相关推荐