如何在数据帧中获取链接

import pandas as pd import requests # Global variables HEADS = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'} dateiname = 'test.xlsx' # Global variables def get_response(url): # URL-Anfrage durchfuehren try: response = requests.get(url, headers=HEADS) except AttributeError: print('AttributeError') return response def scraping_kader(response): try: dfs = pd.read_html(response.text) #dfs = dfs.to_html(escape=False) print(dfs[1]) print(dfs[1].iloc[0, :]) except ImportError: print(' ImportError') except ValueError: print(' ValueError') except AttributeError: print(' AttributeError') response = get_response('https://www.transfermarkt.de/tsg-1899-hoffenheim/kader/verein/533/saison_id/2019/plus/1') scraping_kader(response)

2条回答

网友

1楼 · 编辑于 2024-06-09 08:36:16

据我所知read_html只从表中获取文本，它不关心链接、隐藏元素、属性等

您需要像BeautifulSoup或lxml这样的模块来处理完整的HTML并手动获取所需的信息

   soup = BeautifulSoup(response.text, 'html.parser')
   
   all_tooltips = soup.find_all('td', class_='hauptlink')
   
   for item in all_tooltips:
       item = item.find('a', class_='spielprofil_tooltip')
       if item:
           print(item['href']) #, item.text)

本例仅获取链接，但与获取其他元素的方式相同

import requests
from bs4 import BeautifulSoup
#import pandas as pd

HEADS = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
}

def get_response(url):
   try:
      response = requests.get(url, headers=HEADS)
   except AttributeError:
      print('AttributeError') 

   return response

def scraping_kader(response): 
   try:
       soup = BeautifulSoup(response.text, 'html.parser')
       
       all_tooltips = soup.find_all('td', class_='hauptlink')
       
       for item in all_tooltips:
           item = item.find('a', class_='spielprofil_tooltip')
           if item:
               print(item['href']) #, item.text)
           
       #print(dfs[1])
       #print(dfs[1].iloc[0, :])

   except ImportError:
       print(' ImportError')
    
   except ValueError:
       print(' ValueError')
    
   except AttributeError:
       print(' AttributeError') 

#  - main  

response = get_response('https://www.transfermarkt.de/tsg-1899-hoffenheim/kader/verein/533/saison_id/2019/plus/1')
scraping_kader(response)

结果

/oliver-baumann/profil/spieler/55089
/philipp-pentke/profil/spieler/8246
/luca-philipp/profil/spieler/432671
/stefan-posch/profil/spieler/223974
/kevin-vogt/profil/spieler/84435
/benjamin-hubner/profil/spieler/52348
/kevin-akpoguma/profil/spieler/160241
/kasim-adams/profil/spieler/263801
/ermin-bicakcic/profil/spieler/51676
/havard-nordtveit/profil/spieler/42234
/melayro-bogarde/profil/spieler/476915
/konstantinos-stafylidis/profil/spieler/148967
/pavel-kaderabek/profil/spieler/143798
/joshua-brenet/profil/spieler/207006
/florian-grillitsch/profil/spieler/195736
/diadie-samassekou/profil/spieler/315604
/dennis-geiger/profil/spieler/251309
/ilay-elmkies/profil/spieler/443752
/christoph-baumgartner/profil/spieler/324278
/mijat-gacinovic/profil/spieler/215864
/jacob-bruun-larsen/profil/spieler/293281
/sargis-adamyan/profil/spieler/125614
/felipe-pires/profil/spieler/327911
/robert-skov/profil/spieler/270393
/ihlas-bebou/profil/spieler/237164
/andrej-kramaric/profil/spieler/46580
/ishak-belfodil/profil/spieler/111039
/munas-dabbur/profil/spieler/145866
/klauss/profil/spieler/498862
/maximilian-beier/profil/spieler/578392

网友

2楼 · 编辑于 2024-06-09 08:36:16

这对我有帮助

我现在已经用pandas复制了表，并用BS4代码中的链接名称替换了列。工作

相关问题更多 >

编程相关推荐

热门问题

热门文章