Python中的Soundex算法（作业求助）

0 投票

3 回答

4401 浏览

提问于 2025-04-15 15:23

美国人口普查局使用一种叫做“soundex”的特殊编码来查找个人信息。这个soundex编码是根据姓氏的发音来生成的，而不是根据拼写。发音相同但拼写不同的姓氏，比如SMITH和SMYTH，会有相同的编码，并被归类在一起。这个编码系统的设计是为了让你能够找到一个姓氏，即使它可能被记录成了不同的拼写。

在这个实验中，你将设计、编写和记录一个程序，当输入一个姓氏时，它会生成对应的soundex编码。用户会被提示输入一个姓氏，程序应该输出相应的编码。

基本的Soundex编码规则

每个姓氏的soundex编码由一个字母和三个数字组成。这个字母总是姓氏的第一个字母。剩下的字母会根据下面的soundex编码指南分配数字。如果需要，最后会加零，以确保总是生成一个四个字符的编码。多余的字母会被忽略。

Soundex编码指南

Soundex为不同的辅音分配数字。发音相似的辅音会被分配相同的数字：

数字辅音

1 B, F, P, V 2 C, G, J, K, Q, S, X, Z 3 D, T 4 L 5 M, N 6 R

Soundex会忽略字母A、E、I、O、U、H、W和Y。

还有三个额外的Soundex编码规则需要遵循。一个好的程序设计会将这些规则实现为一个或多个独立的函数。

规则1. 有双字母的名字

如果姓氏中有双字母，它们应该被视为一个字母。例如：

Gutierrez的编码是G362（G，T的编码是3，第一个R的编码是6，第二个R被忽略，Z的编码是2）。

规则2. 旁边有相同Soundex编码数字的字母

如果姓氏中有不同的字母并排在一起，但它们在soundex编码指南中有相同的数字，它们应该被视为一个字母。例子：

Pfister的编码是P236（P，F被忽略，因为它和P是相同的，S的编码是2，T的编码是3，R的编码是6）。

Jackson的编码是J250（J，C的编码是2，K被忽略，因为它和C是相同的，S被忽略，N的编码是5，最后加0）。

规则3. 辅音分隔符

3.a. 如果一个元音（A、E、I、O、U）分隔了两个有相同soundex编码的辅音，那么元音右边的辅音会被编码。例子：

Tymczak的编码是T-522（T，M的编码是5，C的编码是2，Z被忽略（见上面的“旁边”规则），K的编码是2）。因为元音“A”分隔了Z和K，所以K被编码。

3.b. 如果“H”或“W”分隔了两个有相同soundex编码的辅音，右边的辅音不会被编码。例子：

Ashcraft的编码是A261（A，S的编码是2，C被忽略，因为它和S相同，中间有H，R的编码是6，F的编码是1）。它的编码不是A226。

到目前为止，这是我的代码：

surname = raw_input("Please enter surname:")
outstring = ""

outstring = outstring + surname[0]
for i in range (1, len(surname)):
        nextletter = surname[i]
        if nextletter in ['B','F','P','V']:
            outstring = outstring + '1'

        elif nextletter in ['C','G','J','K','Q','S','X','Z']:
            outstring = outstring + '2'

        elif nextletter in ['D','T']:
            outstring = outstring + '3'

        elif nextletter in ['L']:
            outstring = outstring + '4'

        elif nextletter in ['M','N']:
            outstring = outstring + '5'

        elif nextletter in ['R']:
            outstring = outstring + '6'

print outstring

它基本上完成了要求的功能，但我不太确定如何编码这三个规则。这就是我需要帮助的地方。所以，任何帮助都非常感谢。

字符串处理程序设计函数实现 soundex 编码算法发音匹配数据归类规则解析

3 个回答

在编程中，有时候我们需要处理一些数据，这些数据可能来自不同的地方，比如用户输入、文件或者网络请求。为了让程序能够理解这些数据，我们需要把它们转换成程序能处理的格式。

比如说，如果你从一个表单获取了用户的名字，这个名字可能是一个字符串（就是一串字符），但是在程序里，我们可能需要把它变成一个特定的对象，方便后续使用。

这个过程就叫做“数据转换”，它可以帮助我们把不同类型的数据整合在一起，确保程序能够顺利运行。

在实际操作中，我们可能会用到一些工具或者库来帮助我们完成这些转换工作，这样可以节省时间，也能减少出错的机会。

总之，数据转换是编程中一个非常重要的环节，理解它能让你在处理数据时更加得心应手。

surname = input("Enter surname of the author: ") #asks user to input the author's surname

while surname != "": #initiates a while loop thats loops on as long as the input is not equal to an empty line

    str_ini = surname[0] #denotes the initial letter of the surname string
    mod_str1 = surname[1:] #denotes modified string excluding the first letter of the surname

    import re #importing re module to access the sub function
    mod_str2 = re.sub(r'[aeiouyhwAEIOUYHW]', '', mod_str1) #eliminating any instances of the given letters


    mod_str21 = re.sub(r'[bfpvBFPV]', '1', mod_str2)
    mod_str22 = re.sub(r'[cgjkqsxzCGJKQSXZ]', '2', mod_str21)
    mod_str23 = re.sub(r'[dtDT]', '3', mod_str22)
    mod_str24 = re.sub(r'[lL]', '4', mod_str23)
    mod_str25 = re.sub(r'[mnMN]', '5', mod_str24)
    mod_str26 = re.sub(r'[rR]', '6', mod_str25)
                #substituting given letters with specific numbers as required by the soundex algorithm

    mod_str3 = str_ini.upper()+mod_str26 #appending the surname initial with the remaining modified trunk

    import itertools #importing itertools module to access the groupby function
    mod_str4 = ''.join(char for char, rep in itertools.groupby(mod_str3))
                #grouping each character of the string into individual characters
                #removing sequences of identical numbers with a single number
                #joining the individually grouped characters into a string

    mod_str5 = (mod_str4[:4]) #setting character limit of the modified string upto the fourth place

    if len (mod_str5) == 1:
        print (mod_str5 + "000\n")
    elif len (mod_str5) == 2:
        print (mod_str5 + "00\n")
    elif len (mod_str5) == 3:
        print (mod_str5 + "0\n")
    else:
        print (mod_str5 + "\n")
                #using if, elif and else arguments for padding with trailing zeros

    print ("Press enter to exit") #specification for the interactor, to press enter (i.e., equivalent to a new line for breaking the while loop) when he wants to exit the program
    surname = input("Enter surname of the author: ") #asking next input from the user if he wants to carry on

exit(0) #exiting the program at the break of the while loop

回答于 2025-04-15 由 Python大师

分享举报

这个方法并不是完美的（比如说，如果输入的内容不是以字母开头，它会给出错误的结果），而且它没有把规则做成可以独立测试的函数，所以这并不能算是作业问题的答案。不过，我会这样来实现它：

>>> def soundex_prepare(s):
        """Prepare string for Soundex encoding.

        Remove non-alpha characters (and the not-of-interest W/H/Y), 
        convert to upper case, and remove all runs of repeated letters."""
        p = re.compile("[^a-gi-vxz]", re.IGNORECASE)
        s = re.sub(p, "", s).upper()
        for c in set(s):
            s = re.sub(c + "{2,}", c, s)
        return s

>>> def soundex_encode(s):
        """Encode a name string using the Soundex algorithm."""
        result = s[0].upper()
        s = soundex_prepare(s[1:])
        letters = 'ABCDEFGIJKLMNOPQRSTUVXZ'
        codes   = '.123.12.22455.12623.122'
        d = dict(zip(letters, codes))
        prev_code=""
        for c in s:
            code = d[c]
            if code != "." and code != prev_code:
                result += code
         if len(result) >= 4: break
            prev_code = code
        return (result + "0000")[:4]

回答于 2025-04-15 由 Python大师

分享举报

我建议你试试以下方法。

先存一个当前编码和上一个编码的变量，这样在输出之前可以方便使用。
把系统拆分成一些有用的功能，比如说：
1. 判断一个字符是不是元音的函数 IsVowel(Char)
2. 对一个字符进行编码的函数 Coded(Char)
3. 判断两个字符是否符合规则1的函数 IsRule1(Char, Char)

一旦你把这些拆分得很好，管理起来就会简单多了。

回答于 2025-04-15 由 Python大师

分享举报

Python中的Soundex算法（作业求助）

3 个回答

撰写回答