使用BeautifulSoup 4在whoscall.in进行的爬取问题

2024-04-25 03:38:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我的python脚本使用BeautifulSoup,似乎无法理解页面上div之外的单词,这有什么具体原因吗?我可以抓取个人资料图片来统计信息的数量,但不是文本本身。你知道吗

(作为参考,我使用了以下页面:http://whoscall.in/1/2392247496/

if(website == "1"):  
  reqInput = "http://whoscall.in/1/%s/" % (teleWho)
  urlfile = urllib2.Request(reqInput)
  print (reqInput)
  time.sleep(1)
  requestRec = requests.get(reqInput)
  soup = BeautifulSoup(requestRec.content, "lxml")
  noMatch = soup.find(text=re.compile(r"no reports yet on the phone number"))
  print(requestRec.content)# #only if needed#
  type(noMatch) is str
  if noMatch is None:
     worksheet.write(idx+1, 2, "Got a hit")
     howMany = soup.find_all('img',{'src':'/default-avatar.gif'})
     howManyAreThere = len(howMany)
     worksheet.write(idx+1,1,howManyAreThere)
     print (howManyAreThere)
     scamNum = soup.find_all(text=("scam"),recursive=True)
     #,'scam','Scammer','scammer'#
     scamCount = len(scamNum)
     print(scamNum)
     searchTerms = {scamCount:scamCount}
     sentiment = max(searchTerms, key=searchTerms.get)
     worksheet.write(idx+1,3,sentiment)

我好像没法把“骗局”这个词从书页上撤下来

我不确定为什么它拒绝找到文本,因为另一个漂亮的Soup代码工作得很完美。你知道吗

https://github.com/GarnetSunset/Haircuttery/


Tags: iffindwriteprintsoupidxworksheetbeautifulsoup
1条回答
网友
1楼 · 发布于 2024-04-25 03:38:05

更改此行:

scamNum = soup.find_all(text=("scam"),recursive=True)

收件人:

scamNum = [ div.text for div in soup.find_all('div', {'style':'font-size:14px; margin:10px; overflow:hidden'}) if 'scam' in div.text.lower() ]  

对多个单词尝试以下方法:

words = [ 'word1', 'word2', ... ]
scamNum = [ div.text for div in soup.find_all('div', {'style':'font-size:14px; margin:10px; overflow:hidden'}) if any( word for word in words if word in div.text.lower()) ]  

相关问题 更多 >