不知道为什么我的字典值在多次函数调用后被设置为无

2024-06-10 23:24:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个简短的程序,旨在测试我的应用程序中的所有链接,以确保链接的工作。我已经到了将应用程序中的所有链接提取到字典中的阶段。你知道吗

当我运行一次程序时,它似乎能工作。我的意思是这本词典的内容是我所期望的。你知道吗

但是,当我添加while循环来检查字典上的值,看它是否包含value=false时,整个过程就结束了。 字典的结果是不同的,而在真正的条目是正确的,他们现在似乎没有。你知道吗

我试图返回字典的值,并将其复制到字典,这是真理的来源,但我仍然得到的问题。你知道吗

当我试图检查dictionary enty的值以检查其值是否为False时,会抛出错误

class ScrapeLinks:
    unique_urls = set()
    visited_urls = {}

def get_a_hrefs(self, response):
        content = response.content
        soup = BeautifulSoup(content, 'lxml')
        # Get the csrf
        locator = 'input[name="_csrf"]'
        links = 'a'
        csrf_html= soup.select_one(locator)
        csrf_value = csrf_html.attrs['value']
        ScrapeLinks.csrf_token = csrf_value
        for items in soup.find_all(links):
            if items.has_attr('href'):
                if items.attrs['href'].startswith('#') or not items.attrs['href'].startswith('/'):
                    pass
                else:
                    if '#' in items.attrs['href']:
                        sep = '#'
                        theUrl = items.attrs['href'].split(sep, 1)[0]
                        print(f'the splited url is: {theUrl}')
                        ScrapeLinks.visited_urls[theUrl]='False'

                    ScrapeLinks.visited_urls[items.attrs['href']]='False'


    def split(self, key, value, sep):
        if '#' in key:
            print('in splitter')
            sep = '#'
            theUrl = key.split(sep, 1)[0]
            print(f'the splited url is: {theUrl}')
            return theUrl

    def crawl(self):
        tempdict = {}
        dict_copy = copy(ScrapeLinks.visited_urls)
        print(dict_copy)
        for key, value in dict_copy.items():
            theUrl=self.split(key, value, '#')
            tempdict[key]=theUrl
            if 'False' in value:
                r = requests.get(ScrapeLinks.base_path+f'{key}', headers={
                'content-type': 'application/x-www-form-urlencoded',
                'X-Csrf-Token': f'{ScrapeLinks.csrf_token}',
                'Cookie': f'SID/newtshirt={ScrapeLinks.cookie}'
                })

                tempdict[key]='True'
                print(r.status_code)
                print("**************************************")

                content = r.content
                soup = BeautifulSoup(content, 'lxml')
                links = 'a'

                for items in soup.find_all(links):
                    if items.has_attr('href'):
                        print(items.attrs["href"])
                        if items.attrs['href'] not in dict_copy:
                            if items.attrs['href'].startswith('#') or not items.attrs['href'].startswith('/'):
                                print(f'Skipping: {items.attrs["href"]}')
                                continue
                            print(f'Adding the link: {items.attrs["href"]}')
                            ScrapeLinks.unique_urls.add(items.attrs['href'])
                        else:
                            print('already in dictionary - do nothing')


        for i in ScrapeLinks.unique_urls:
            tempdict.update({i:'False'})

        print('@@@@@@@@@@@@@@@@@')
        print(f'temp dict is : {tempdict}')


# works when below recursion is commented out
        print('!!!!!')
        print(ScrapeLinks.visited_urls)
        return tempdict
        # iterations = 500
        # while 'False' in ScrapeLinks.visited_urls.values():
        #     .crawl()







scrape = ScrapeLinks()
session = scrape.login()
scrape.get_a_hrefs(session)
while 'False' in ScrapeLinks.visited_urls.values():
    updated_dict = scrape.crawl()
    ScrapeLinks.visited_urls = updated_dict
print(ScrapeLinks.visited_urls)

当函数crawl被反复运行时,我想让程序使用vistited\u url字典中存储的值,这样我就可以记录我访问过的页面等

我收到的错误消息是:

Traceback (most recent call last):
  File "essRequest.py", line 155, in <module>
    updated_dict = scrape.crawl()
  File "essRequest.py", line 83, in crawl
    if 'False' in value:
TypeError: argument of type 'NoneType' is not iterable

Tags: keyinfalseifvalueitemsurlscsrf