在mysql/python或mysql/.n中按子字符串分组字符串

2024-04-28 08:24:08 发布

您现在位置:Python中文网/ 问答频道 /正文

数据将存储在mysql数据库中,如下所示:

5911    CD  $4.99   Eben, Landscapes of Patmos {w.Martin Lenniger, percussion}; 2 Choral Phantasies; Laudes. (All w.Sieglinde Ahrens, organ)
5913    CD  $5.99   Turina, Sevilliana; Rafaga; Hommage a Tarrega; Sonata. Rodrigo, 3 Piezas Espanolas; En Los Trigales; Sarabande Lointaine. (Eric Hill, guitar^)
145460  CD  $13.98  Wagner, The Flying Dutchman. (Hans Hotter, Astrid Varnay, Set Svanholm et al. Cond. Reiner. Rec.1950. PLEASE NOTE: Limited-pressing CDRs)
145461  CD  $13.98  Montemezzi, L'Amore dei Tre Re. (Virgilio Lazzari, Dorothy Kirsten, Charles Kullman, Robert Weede, Leslie Chabay et al. Cond. Giuseppe Antonicelli. Rec. 1949. PLEASE NOTE: Limited-pressing CDRs)
145462  CD  $13.98  Ponchielli, La Gioconda. (Zinka Milanov, Giacomo Vaghi, Leonard Warren, Rise Stevens, Richard Tucker, Margaret Harshaw et al. Cond. Emil Cooper. Rec. 1946. PLEASE NOTE: Limited-pressing CDRs)
145465  CD  $5.99   ' Yankele: Yiddish Songs'. (16 titles incl. Az der Rebe, Rozhinkes mit Mandlekh, Shabes, Yankele, Belz, Di Grine Kuzine. Moshe Leiser, voice and guitar. Ami Flammer, violin. Gerard Barreaux, accordion. Rec. 'live', Lyon Opera. Total time: 78')
145467  CD  $4.99   Brahms, Piano Trios 2 & 3. (Trio Bamberg: Evgeny Schuk, violin; Stephan Gerlinghaus, cello. Robert Benz, piano. Rec. Nuremberg, 4/7/2000. Total time: 51'45')
145468  CD  $4.99   Gaubert, Piece Romantique; Trois Aquarelles. Debussy, Premier Trio in G. Francaix, Trio. (Trio Cantabile: Hans-Jorg Wegner, flute. Guido Larisch, cello. Christiane Kroeker, piano. Rec. Hannover, 3/2001. Total time: 62'35')
145469  CD  $4.99   Gattermeyer, Heinrich [b.1923]: Ophelias Schattentheater [text by Michael Ende]. Matthias Drude [b.1960], Jorinde und Joringel. Christoph J. Keller [b.1959], Die Kristallkugel [both texts by Brother Grimm]. (Helmut Thiele, narrator w.Bernd-Christian Schulze, piano. Total time: 68'08')
145470  CD  $2.99   Morrill, Dexter [b.1938]: Dance Bagatelles for Viola & Piano; Three Lyric Pieces for Violin and Piano [Laura Klugherz, viola & violin. Jill Timmons, piano]; Fantasy for Solo Cello [James Kirkwood, cello]; String Quartet #2 [Tremont String Quartet]. (Total time: 51'03')
145471  CD  $2.99   Werntz, Julia: String Trio with Homage to Chopin [Curtis Macomber, violin. Lois Martin, viola. Ted Mook, cello]; 'To You Strangers'- Five Poems of Dylan Thomas for Mezzo-Soprano Solo [Christina Ascher]; Piano Piece [John McDonald]. John Mallia, Lock [Stephanie Kay, clarinet]; Poor Denizens of Hell [chamber ensemble/ Daniel Hosken]; Plexus 2. (Aura Group for New Music)
145472  CD  $2.99   Morrill, Dexter [b.1938]- 'Music for Trumpets': 'Ponzo' for Two Trumpets; 'Nine Pieces' for Solo Trumpet; 'TARR' for Four Trumpets & Computer; 'Studies' for Trumpet & Computer; 'Trumpet Concerto' for Trumpet & Piano. (Mark Ponzo, trumpet with Barbara Butler [trumpet] & William Koehler, piano. Total time: 52'02')
145473  CD  $2.99   Kallstrom, Michael [b.1956]: 'Stories'. (A chamber opera for solo performer with puppets and electronic tape based on Old Testament stories)
145474  CD  $2.99   Carosio, Vailati, Lechi, Ponchielli, D'Alessandro, Sterzati, Riva, Pucci, Casazza, Denti, Gnaga, Anelli, Feroldi: 'The Mandolins of Stradivari'. (16 pieces for mandolin ensemble et al. Ugo Orlandi, mandolin. Alessandro Bono, guitar. Maura Mazzonetto, piano. Giampaolo Baldin, baritone. Quartetto romantico a plettro 'Umbert Sterzati'. Orchestra di Mandolini e Chitarre 'Citta di Brescia'/ Mandonico. Total time: 77'19')
145475  CD  $3.99   Rachmaninov, Symphony #3; Symphonic Dances. (St. Petersburg Philharmonic/ Jansons. Total time: 72'16')

我需要每个标题与4个其他标题,有共同的单词分组。举个例子,如果我想把4张同时有贝多芬和莫扎特两个词的CD放在一组。在

但是,我不想指定它应该根据哪些词进行分组。我希望这是一种人工智能的方式

我认为算法应该是这样的:

  1. 对所有单词做频率分布
  2. 扔掉英语中常用的单词(比如if,or,the,where can i get a list of these)??在
  3. 开始根据出现频率最低的单词分组

有人知道用什么聪明的方法来分组吗?在


Tags: offortriotimecd单词ettotal
1条回答
网友
1楼 · 发布于 2024-04-28 08:24:08

Re(2),你想要的被称为“stopwords”,例如,在NLTK(它是Python,但是我想会有C的等价物),在它优秀的在线书籍中

>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',
'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow',
'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always',
 ...]

我引用的这本书也可以帮助你理解第1点,但第3点实际上是一个不同的领域聚类。您需要一种非常特殊的聚类(指定的和相同的集群大小),因此现有的算法可能不适合您,但是根据您提到的内容设计一些算法并不难。在

基本上,你希望每个单词的“分数”对于在英语中比较少见的单词(NLTK,或任何同等强大的C语言处理工具包,当然可以帮助你做到这一点)减去单词频率的对数可以作为一个开始。在

根据你提到的规格,你只需要对至少出现在五个文档中的不间断单词进行评分,因此有意义的单词数量应该非常少,而且彻底搜索甚至可能是可行的。在

事实上,最大的问题可能是另一个问题,如果有一个不到5个文档的组,总的来说,没有与其他文档中的任何一个有不间断的共同点呢?出现这种情况的可能性表明,你必须在某些方面放松你的规格(因为我对你的应用一无所知,我当然不能给出具体的建议,但这可能是从允许多个文档数量不同于5的组,到放宽分组标准等等)。在

或者,您是否更愿意诊断某些情况下,实际上不可能满足您的严格约束,并在出现错误时提供错误消息而不是任何结果?在

相关问题 更多 >