Rap天才w/Python的Web抓取Rap歌词

2024-06-06 13:00:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我是一个编程新手,我一直试图从Rap genius http://genius.com/artists/Andre-3000中删除Andre 3000的歌词,方法是使用Beautiful Soup(一个用于从HTML和XML文件中提取数据的Python库)。我的最终目标是以字符串格式保存数据。以下是我目前掌握的情况:

from bs4 import BeautifulSoup
from urllib2 import urlopen

artist_url = "http://rapgenius.com/artists/Andre-3000"

def get_song_links(url):
    html = urlopen(url).read()
    # print html 
    soup = BeautifulSoup(html, "lxml")
    container = soup.find("div", "container")
    song_links = [BASE_URL + dd.a["href"] for dd in container.findAll("dd")]

    print song_links

get_song_links(artist_url)
for link in soup.find_all('a'):
    print(link.get('href'))

所以我需要其他代码的帮助。我怎样才能把他的歌词变成字符串格式?然后我如何使用自然语言工具包(NLTK)来标记句子和单词。


Tags: comhttpurlgetsongcontainerhtml歌词
3条回答

首先,对于每个链接,您需要下载该页面并用BeautifulSoup解析它。然后在该页面上寻找一个区分歌词和其他页面内容的属性。我发现<;a data editate=“accepted”data classification=“accepted”data group=“0”>;很有帮助。然后运行a。在歌词页内容上查找所有歌词行。对于可以调用的每一行,get_text()从每个歌词行中获取文本。

至于NLTK,一旦安装好,就可以导入它并解析如下语句:

from nltk.tokenize import word_tokenize, sent_tokenize
words = [word_tokenize(t) for t in sent_tokenize(lyric_text)]

这将给你一个每个句子中所有单词的列表。

下面是一个示例,如何获取页面上的所有歌曲链接,并按照它们获取歌曲歌词:

from urlparse import urljoin
from bs4 import BeautifulSoup
import requests


BASE_URL = "http://genius.com"
artist_url = "http://genius.com/artists/Andre-3000/"

response = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})

soup = BeautifulSoup(response.text, "lxml")
for song_link in soup.select('ul.song_list > li > a'):
    link = urljoin(BASE_URL, song_link['href'])
    response = requests.get(link)
    soup = BeautifulSoup(response.text)
    lyrics = soup.find('div', class_='lyrics').text.strip()

    # tokenize `lyrics` with nltk

注意这里使用的是^{}模块。还要注意,User-Agent头是必需的,因为站点返回不带它的403 - Forbidden

GitHub / jashanj0tsingh / LyricsScraper.py提供将genius.com上的歌词基本刮到一个文本文件中,其中每一行代表一首歌。它以艺术家的名字作为输入。然后,生成的文本文件可以很容易地馈送到您的自定义nltk或通用解析器中,以完成您想要的工作。

代码如下:

# A simple script to scrape lyrics from the genius.com based on atrtist name.

import re
import requests
import time
import codecs

from bs4 import BeautifulSoup
from selenium import webdriver

mybrowser = webdriver.Chrome("path\to\chromedriver\binary") # Browser and path to Web driver you wish to automate your tests cases.

user_input = input("Enter Artist Name = ").replace(" ","+") # User_Input = Artist Name
base_url = "https://genius.com/search?q="+user_input # Append User_Input to search query
mybrowser.get(base_url) # Open in browser

t_sec = time.time() + 60*20 # seconds*minutes
while(time.time()<t_sec): # Reach the bottom of the page as per time for now TODO: Better condition to check end of page.
    mybrowser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    html = mybrowser.page_source
    soup = BeautifulSoup(html, "html.parser")
    time.sleep(5)

pattern = re.compile("[\S]+-lyrics$") # Filter http links that end with "lyrics".
pattern2 = re.compile("\[(.*?)\]") # Remove unnecessary text from the lyrics such as [Intro], [Chorus] etc..

with codecs.open('lyrics.txt','a','utf-8-sig') as myfile:
    for link in soup.find_all('a',href=True):
            if pattern.match(link['href']):
                f = requests.get(link['href'])
                lyricsoup = BeautifulSoup(f.content,"html.parser")
                #lyrics = lyricsoup.find("lyrics").get_text().replace("\n","") # Each song in one line.
                lyrics = lyricsoup.find("lyrics").get_text() # Line by Line
                lyrics = re.sub(pattern2, "", lyrics)
                myfile.write(lyrics+"\n")
mybrowser.close()
myfile.close()

相关问题 更多 >