从页面创建链接列表https://www.stubhub.com/newyorkrangerstickets/performer/2764/ 包含“纽约游骑兵队”的文字

2024-05-23 19:50:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图用python创建一个包含特定字符串的页面中所有链接的列表。例如,我希望所有包含此页面https://www.stubhub.com/new-york-rangers-tickets/performer/2764/中的“纽约流浪者@”的链接

感谢所有的帮助-如果这是一个愚蠢的问题,很抱歉,但在任何地方都找不到


Tags: 字符串httpscom列表new链接www页面
3条回答

使用Selenium您不需要,要从包含文本纽约流浪者的页面https://www.stubhub.com/new-york-rangers-tickets/performer/2764/创建所有链接的列表,即href属性,您需要为visibility_of_all_elements_located()诱导WebDriverWait,并且可以使用以下Locator Strategy

  • 使用XPATH

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    # configuring the driver for optimum results
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')   
    driver.get("https://www.stubhub.com/new-york-rangers-tickets/performer/2764/")
    
    # just one line of code
    print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[./div[contains(., 'New York Rangers')]]")))])
    
  • 控制台输出:

    ['https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-1-31-2020/event/104217508/', 'https://www.stubhub.com/detroit-red-wings-tickets-detroit-red-wings-detroit-little-caesars-arena-2-1-2020/event/104215245/', 'https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-3-2020/event/104212773/', 'https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-5-2020/event/104215469/', 'https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-7-2020/event/104217518/', 'https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-9-2020/event/104214839/', 'https://www.stubhub.com/winnipeg-jets-tickets-winnipeg-bell-mts-place-2-11-2020/event/104212882/', 'https://www.stubhub.com/minnesota-wild-tickets-minnesota-wild-saint-paul-xcel-energy-center-2-13-2020/event/104216234/', 'https://www.stubhub.com/columbus-blue-jackets-tickets-columbus-blue-jackets-columbus-nationwide-arena-2-14-2020/event/104212942/', 'https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-16-2020/event/104217520/', 'https://www.stubhub.com/chicago-blackhawks-tickets-chicago-blackhawks-chicago-united-center-2-19-2020/event/104213910/', 'https://www.stubhub.com/carolina-hurricanes-tickets-carolina-hurricanes-raleigh-pnc-arena-2-21-2020/event/104212812/', 'https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-22-2020/event/104217524/', 'https://www.stubhub.com/new-york-islanders-tickets-new-york-islanders-uniondale-nycb-live-home-of-the-nassau-veterans-memorial-coliseum-2-25-2020/event/104354662/', 'https://www.stubhub.com/montreal-canadiens-tickets-montreal-bell-centre-2-27-2020/event/104215418/', 'https://www.stubhub.com/philadelphia-flyers-tickets-philadelphia-flyers-philadelphia-wells-fargo-center-philadelphia-2-28-2020/event/104212712/', 'https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-3-1-2020/event/104215027/', 'https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-3-3-2020/event/104217528/', 'https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-3-5-2020/event/104215030/', 'https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-3-7-2020/event/104215474/']
    

数据嵌入在<srcipt>标记内的页面中。您可以使用此示例解析数据(使用rejson模块):

import re
import json
import requests

headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'}
url = 'https://www.stubhub.com/new-york-rangers-tickets/performer/2764/'

txt = requests.get(url, headers=headers).text

data = json.loads(re.search(r'window.__INITIAL_STATE__\s*=\s*(.*})<', txt)[1])

# print(json.dumps(data, indent=4)) # <  uncomment to see all data (prices, dates, etc.)

for event in data['EVENT_SEO_LIST']['events']:
    if 'PARKING PASSES ONLY' in event['name']:
        continue
    print('{:<45} {}'.format(event['name'], 'https://www.stubhub.com/' + event['webURI']))

印刷品:

Detroit Red Wings at New York Rangers         https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-1-31-2020/event/104217508/
New York Rangers at Detroit Red Wings         https://www.stubhub.com/detroit-red-wings-tickets-detroit-red-wings-detroit-little-caesars-arena-2-1-2020/event/104215245/
Dallas Stars at New York Rangers              https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-3-2020/event/104212773/
Toronto Maple Leafs at New York Rangers       https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-5-2020/event/104215469/
Buffalo Sabres at New York Rangers            https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-7-2020/event/104217518/
Los Angeles Kings at New York Rangers         https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-9-2020/event/104214839/
New York Rangers at Winnipeg Jets Tickets (Replica Hall of Fame Banner Giveaway) https://www.stubhub.com/winnipeg-jets-tickets-winnipeg-bell-mts-place-2-11-2020/event/104212882/
New York Rangers at Minnesota Wild            https://www.stubhub.com/minnesota-wild-tickets-minnesota-wild-saint-paul-xcel-energy-center-2-13-2020/event/104216234/
New York Rangers at Columbus Blue Jackets     https://www.stubhub.com/columbus-blue-jackets-tickets-columbus-blue-jackets-columbus-nationwide-arena-2-14-2020/event/104212942/
Boston Bruins at New York Rangers             https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-16-2020/event/104217520/
New York Rangers at Chicago Blackhawks        https://www.stubhub.com/chicago-blackhawks-tickets-chicago-blackhawks-chicago-united-center-2-19-2020/event/104213910/
New York Rangers at Carolina Hurricanes       https://www.stubhub.com/carolina-hurricanes-tickets-carolina-hurricanes-raleigh-pnc-arena-2-21-2020/event/104212812/
San Jose Sharks at New York Rangers           https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-2-22-2020/event/104217524/
New York Rangers at New York Islanders        https://www.stubhub.com/new-york-islanders-tickets-new-york-islanders-uniondale-nycb-live-home-of-the-nassau-veterans-memorial-coliseum-2-25-2020/event/104354662/
New York Rangers at Montreal Canadiens        https://www.stubhub.com/montreal-canadiens-tickets-montreal-bell-centre-2-27-2020/event/104215418/
New York Rangers at Philadelphia Flyers       https://www.stubhub.com/philadelphia-flyers-tickets-philadelphia-flyers-philadelphia-wells-fargo-center-philadelphia-2-28-2020/event/104212712/
Philadelphia Flyers at New York Rangers       https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-3-1-2020/event/104215027/
St. Louis Blues at New York Rangers           https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-3-3-2020/event/104217528/
Washington Capitals at New York Rangers       https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-3-5-2020/event/104215030/
New Jersey Devils at New York Rangers         https://www.stubhub.com/new-york-rangers-tickets-new-york-rangers-new-york-madison-square-garden-3-7-2020/event/104215474/
New York Rangers at Dallas Stars              https://www.stubhub.com/dallas-stars-tickets-dallas-stars-dallas-american-airlines-center-3-10-2020/event/104214902/

首先,你需要获得你想要搜索链接的网页的内容。我强烈建议使用requests,这是一个简单的Python HTTP库:

import requests

response = request.get(https://www.stubhub.com/new-york-rangers-tickets/performer/2764/)

由于某些原因,此特定URL需要用户代理标头,因此您应在请求时发送一个标头:

url = 'https://www.stubhub.com/new-york-rangers-tickets/performer/2764/'
user_agent = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'
response = requests.get(url, headers={'User-Agent':user_agent})

然后可以使用beautifulsoup4开始分析页面内容。可以使用方法find_all将编译后的正则表达式作为text参数传递,以查找包含特定文本的所有a标记:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(response.content, "html.parser")
rangers_anchor_tags = soup.find_all("a", text=re.compile(r".*\bNew York Rangers at\b.*")
urls = [anchor["href"] for anchor in rangers_anchor_tags]

urls则是一个URL列表,锚定标记的相应内部文本包含有问题的字符串

相关问题 更多 >