识别特定列中最常见的词,但仅识别前10首歌曲(另一列)

2024-04-25 09:38:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我在使用此代码时遇到一些问题。我应该从每年(1965-2015)的前10首歌曲中检索20个最常见的单词。有一个排名,所以我觉得我可以用排名来识别前10名<;=10但我对如何开始感到迷茫。这就是我目前所拥有的。我还没有收录排名前十的歌曲。另外,最常见的20个单词来自歌词栏(第4栏)

import collections
import csv
import re

words = re.findall(r'\w+', open('billboard_songs.csv').read().lower())
reader = csv.reader(words, delimiter=',')
csvrow = [row[4] for row in reader]
most_common = collections.Counter(words[4]).most_common(20)
print(most_common)

我的文件的第一行如下所示:

"Rank","Song","Artist","Year","Lyrics","Source"   
1,"wooly bully","sam the sham and the pharaohs",1965,"sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs .....,3   

当它达到100(排名)时,第二年又从1开始,以此类推


Tags: csvtheimportremostsamcommon单词
1条回答
网友
1楼 · 发布于 2024-04-25 09:38:30

您可以使用csv.DictReader来解析文件,并从中获得可用的Python词典列表。然后,您可以使用for comprehensions和itertools.groupby()提取所需的歌曲信息。最后,您可以使用collections.Counter查找歌曲中最常见的单词

#!/usr/bin/env python

import collections
import csv
import itertools


def analyze_songs(songs):
    # Grouping songs by year (groupby can only be used with a sorted list)
    sorted_songs = sorted(songs, key=lambda s: s["year"])
    for year, songs_iter in itertools.groupby(sorted_songs, key=lambda s: s["year"]):
        # Extract lyrics of top 10 songs
        top_10_songs_lyrics = [
            song["lyrics"] for song in songs_iter if song["rank"] <= 10
        ]

        # Join all lyrics together from all songs, and then split them into
        # a big list of words.
        top_10_songs_words = (" ".join(top_10_songs_lyrics)).split()

        # Using Counter to find the top 20 words
        most_common_words = collections.Counter(top_10_songs_words).most_common(20)

        print(f"Year {year}, most common words: {most_common_words}")


with open("billboard_songs.csv") as songs_file:
    reader = csv.DictReader(songs_file)

    # Transform the entries to a simpler format with appropriate types
    songs = [
        {
            "rank": int(row["Rank"]),
            "song": row["Song"],
            "artist": row["Artist"],
            "year": int(row["Year"]),
            "lyrics": row["Lyrics"],
            "source": row["Source"],
        }
        for row in reader
    ]

analyze_songs(songs)

在这个回答中,我假设billboard_songs.csv的格式如下:

"Rank","Song","Artist","Year","Lyrics","Source"
1,"wooly bully","sam the sham and the pharaohs",1965,"sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs","Source Evian"

我假设数据集是1965年到2015年,如问题中所述。如果没有,则应首先相应地过滤歌曲列表

相关问题 更多 >