在beautifulsoup中提取th之后的链接

def import_midifiles(): archive_url="http://www.tadpoletunes.com/tunes/celtic1/" sauce= urllib.request.urlopen("http://www.tadpoletunes.com/tunes/celtic1/celtic.htm").read() soup=bs.BeautifulSoup(sauce,'lxml') tables=soup.find_all('table') for table in tables: for link in table.find_all('a',href=True): if link['href'].endswith('.mid'): listofmidis.append(archive_url + link['href']) if listofmidis: listoflists.append(listofmidis) midi_list = [item for sublist in listoflists for item in sublist] return midi_list

1条回答

网友

1楼 · 发布于 2024-06-16 09:12:41

要获取所有“卷轴”链接，需要执行以下操作：

获取“卷轴”和“幻灯片”之间的链接，如你所说。为此，首先需要找到包含<a name="reels">REELS</a>的<tr>标记。这可以使用^{}方法来完成。你知道吗

reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')

现在，您可以使用^{}方法获取“revels”之后的所有<tr>标记。当找到带有<a name="slides">SLIDES</a>（或.find('a').text == 'SLIDES'）的<tr>标记时，我们可以中断循环。你知道吗

完整代码：

def import_midifiles():
    BASE_URL = 'http://www.tadpoletunes.com/tunes/celtic1/'
    r = requests.get(BASE_URL)
    soup = BeautifulSoup(r.text, 'lxml')
    midi_list = []
    reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
    for tr in reels_tr.find_next_siblings('tr'):
        if tr.find('a').text == 'SLIDES':
            break
        midi_list.append(BASE_URL + tr.find('a')['href'])
    return midi_list

print(import_midifiles())

部分输出：

['http://www.tadpoletunes.com/tunes/celtic1/ashplant.mid', 'http://www.tadpoletunes.com/tunes/celtic1/bashful.mid', 'http://www.tadpoletunes.com/tunes/celtic1/bigpat.mid', 'http://www.tadpoletunes.com/tunes/celtic1/birdcage.mid', 'http://www.tadpoletunes.com/tunes/celtic1/boatstre.mid',
...
...
'http://www.tadpoletunes.com/tunes/celtic1/silspear.mid', 'http://www.tadpoletunes.com/tunes/celtic1/stafreel.mid', 'http://www.tadpoletunes.com/tunes/celtic1/kilkenny.mid', 'http://www.tadpoletunes.com/tunes/celtic1/swaltail.mid', 'http://www.tadpoletunes.com/tunes/celtic1/cuptea.mid']

相关问题更多 >

编程相关推荐

热门问题

热门文章