python:复杂字符串算法

2024-06-16 10:01:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一张单子

listcdtitles = 

["""    Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka)   """,
""" Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro)  """,
""" Tchaikovsky, 'The Tempest' Fantasy. Liszt, Symphonic Poem #1. (London Symph./Butt)  """,
""" Duffy, John: 'Heritage: Civilization and the Jews'- Fanfare & Chorale, Symphonic Dances + Orchestral Suite. Bernstein, 'On the Town' Dance Episodes. (Royal Phil./R.Williams)   """,
""" Lilien, Ignace {1897-1963}: Songs, 1920-1935. (Anja van Wijk, mezzo & Frans van Ruth, piano)    """,
""" Hindemith, Trauermusik. Purcell, 'Fairy Queen' Suite. Rossini, String Sonata #6. Petrov, 'Creation of the World' Ballet Suite. Bartok, Romanian Folkdances Sz 56. Tartini, Flute Concerto in G {w.A.Maiorov} (Leningrad Orch.for Ancient & Modern Music/ Serov) """,
""" Bizet, Verdi, Massenet, Puccini: Arias from Carmen, Rigoletto, Werther, Manon Lescaut, Tosca, Turandot + Songs by Lara, Di Capua et al. (Peter Dvorsky, tenor w.Bratislava Orch./Lenard {Also performing 'Carmen' Overt.& 'Thais' Meditation}. Rec.Live, 10/87) """,
""" Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)    """,
""" Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting) """,
""" Gluck, Mozart, Beethoven, Weber, Verdi, Wagner, Ponchielli, Mascagni, Puccini: Arias from Alceste, Don Giovanni, Fidelio, Oberon, Ballo, Tristan, Walkure, Siegfried, Gotterdammerung, Gioconda, Cavalleria, Tosca. (Helene Wildbrunn. Rec.1919-24) """,
""" Stanley, Wesley, Stubley, Boyce, Handel, Heron, Russell, Hook: '18th Century Organ Music on Period Instruments' (Same instruments and artist as above)  """,
""" Reimann, 'Unrevealed' for Baritone & String Quartet to Texts by Lord Byron {R.Salter w.Kreuzberger Quartet}; Variations for Piano (David Levine)    """,
""" Bruckner, Symphony #9. (Berlin Philharmonic/ Jochum. Rec. 'live', 11/28/77) """,
""" Bruckner, Symphony #5. (Haas Edition. BBC Symph./ Horenstein. Rec.9/71) """,
..............................]

我在这个列表中有大约14000个元素

我想把那些有相似单词的字符串组合在一起。在

有什么办法吗?我不认为有正确/错误的方法

非常感谢你的建议


Tags: ofthefromforsuiteepisodestoscafaust
2条回答

首先,解析所有这些,并将每个令牌与一个频率相关联。高频代币将被列入黑名单。在

然后,您必须比较字符串,对它们进行迭代,并将元组与距离分数相关联。根据这个分数,你将连接他们-或不。在

这将是一个简单的方法。在

我是python语言的新手,但我已经编写了一个示例代码来计算列表中条目之间的相似度得分。在

代码如下。在

import re
import array

listcdtitles = ["""    Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka)   """,
""" Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro)  """,
""" Tchaikovsky, 'The Tempest' Fantasy. Liszt, Symphonic Poem #1. (London Symph./Butt)  """,
""" Duffy, John: 'Heritage: Civilization and the Jews'- Fanfare & Chorale, Symphonic Dances + Orchestral Suite. Bernstein, 'On the Town' Dance Episodes. (Royal Phil./R.Williams)   """,
""" Lilien, Ignace {1897-1963}: Songs, 1920-1935. (Anja van Wijk, mezzo & Frans van Ruth, piano)    """,
""" Hindemith, Trauermusik. Purcell, 'Fairy Queen' Suite. Rossini, String Sonata #6. Petrov, 'Creation of the World' Ballet Suite. Bartok, Romanian Folkdances Sz 56. Tartini, Flute Concerto in G {w.A.Maiorov} (Leningrad Orch.for Ancient & Modern Music/ Serov) """,
""" Bizet, Verdi, Massenet, Puccini: Arias from Carmen, Rigoletto, Werther, Manon Lescaut, Tosca, Turandot + Songs by Lara, Di Capua et al. (Peter Dvorsky, tenor w.Bratislava Orch./Lenard {Also performing 'Carmen' Overt.& 'Thais' Meditation}. Rec.Live, 10/87) """,
""" Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)    """,
""" Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting) """,
""" Gluck, Mozart, Beethoven, Weber, Verdi, Wagner, Ponchielli, Mascagni, Puccini: Arias from Alceste, Don Giovanni, Fidelio, Oberon, Ballo, Tristan, Walkure, Siegfried, Gotterdammerung, Gioconda, Cavalleria, Tosca. (Helene Wildbrunn. Rec.1919-24) """,
""" Stanley, Wesley, Stubley, Boyce, Handel, Heron, Russell, Hook: '18th Century Organ Music on Period Instruments' (Same instruments and artist as above)  """,
""" Reimann, 'Unrevealed' for Baritone & String Quartet to Texts by Lord Byron {R.Salter w.Kreuzberger Quartet}; Variations for Piano (David Levine)    """,
""" Bruckner, Symphony #9. (Berlin Philharmonic/ Jochum. Rec. 'live', 11/28/77) """,
""" Bruckner, Symphony #5. (Haas Edition. BBC Symph./ Horenstein. Rec.9/71) """]

entryDictionary = {}
i=0
for entry in listcdtitles:
    #remove unnecessary characters from the string
    entry=re.sub(r'[^\w ]', '', entry.lower(), flags=re.IGNORECASE)
    #split the entry into words and store it in the 
    entryDictionary[i]=entry.split(" ")
    i=i+1
# print the dictionary
print("Entries")
print(entryDictionary)

# define a score matrix, compare the words in each entry and if
# a word is same in both entries, that is one point
scoreMatrix = []
for k in range(i):
    scoreMatrix.append([])
    for j in range (i):
        if j>k:
            scoreMatrix[k].append(0)
        else:
            scoreMatrix[k].append("-")
k=0
j=0

for k in range(i-1):
    entry1 = entryDictionary[k]
    for j in range(k+1,i):
        entry2 = entryDictionary[j]
        for kk in range(len(entry1)):
            for jj in range(len(entry2)):
                if entry1[kk] != "" and entry1[kk] == entry2[jj]:
                    scoreMatrix[k][j] = scoreMatrix[k][j] + 1

print "Score Matrix (Higher numbers denote heigher similarity between two entries"

print repr("").rjust(10),
for k in range(i-1):
    print repr("Entry " + str(k)).rjust(10),
print repr("Entry " + str(i-1)).rjust(10)

for k in range(i):
    scoreMatrix.append([])
    print repr("Entry " + str(k)).rjust(10),
    for j in range (i-1):
        print repr(scoreMatrix[k][j]).rjust(10),
    print repr(scoreMatrix[k][i-1]).rjust(10)

结果如下: 分数矩阵(数字越大表示两个条目之间的相似度越高

^{pr2}$

相关问题 更多 >