无法区分这两个表达式应该以相同的方式工作

2024-04-19 20:45:22 发布

您现在位置:Python中文网/ 问答频道 /正文

几天前,我创建了this post,以寻求任何解决方案,让我的脚本以这样的方式循环,这样脚本将使用很少的链接来检查我定义的title(应该从每个链接中提取)是否在four次内一文不值。如果title仍然是空的,那么脚本将break替换loop,并转到另一个链接以重复相同的操作。你知道吗

这就是我获得成功的方式——►通过将fetch_data(link)改为return fetch_data(link),并在while loop之外而在if语句内部定义counter=0。你知道吗

正稿:

import time
import requests
from bs4 import BeautifulSoup

links = [
    "https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2",
    "https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3",
    "https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4"
]
counter = 0

def fetch_data(link):
    global counter
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    try:
        title = soup.select_one("p.tcode").text
    except AttributeError: title = ""

    if not title:
        while counter<=3:
            time.sleep(1)
            print("trying {} times".format(counter))
            counter += 1
            return fetch_data(link) #First fix
        counter=0 #Second fix

    print("tried with this link:",link)

if __name__ == '__main__':
    for link in links:
        fetch_data(link)

这是上述脚本生成的输出(根据需要):

trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=2
trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=3
trying 0 times
trying 1 times
trying 2 times
trying 3 times
tried with this link: https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page=4

I used wrong selector within my script so that I can let it meet the condition I've defined above.

Why should I use return fetch_data(link) instead of fetch_data(link) as the expressions work identically most of the times?


Tags: httpscomwebdatacounterlinkfetchsort
1条回答
网友
1楼 · 发布于 2024-04-19 20:45:22

如果函数中的while循环无法获取标题,它将启动一个递归调用。当您使用return fetch_data(link)时,它可以工作,因为每当计数器小于或等于3 while counter<=3时,它将在while循环结束时立即退出函数,因此不会转到将计数器重置为0 counter=0的下行。因为计数器是一个全局变量,每个递归深度只增加1,所以最大递归深度只有4个,因为只要counter大于3,它就不会进入调用另一个fetch_data(link)的while循环。你知道吗

fetch_data (counter=0)
   > fetch_data (counter=1)
     > fetch_data (counter=2)
       > fetch_data (counter=3)
         > fetch_data (counter=4) 
        - not go into while loop, reset counter, print url
        - return to above function
      - return to above function
    - return to above function
  - return to above function

如果使用fetch_data(link),函数仍将在while循环中启动递归调用。但是,不会立即退出,并将计数器重置为0。这是危险的,因为在计数器转到4之后,函数返回while循环中上一个函数调用的while循环,while循环将不会中断并继续启动其他递归调用,因为计数器当前设置为0,即<;=3。这将最终达到最大递归深度,并将使程序崩溃。你知道吗

fetch_data (counter=0)
   > fetch_data (counter=1)
     > fetch_data (counter=2)
       > fetch_data (counter=3)
         > fetch_data (counter=4) 
        - not go into while loop, !!!reset counter!!!, print url
        - return to above function
      - not return to above function call
      - since counter = 0, continue the while loop
         > fetch_data (counter=1)
           > fetch_data (counter=2)
             > fetch_data (counter=3)
...

相关问题 更多 >