从lin中提取名称

2条回答

网友

1楼 · 编辑于 2024-06-08 19:28:51

不是代码答案，但看起来您可以从http://www.abec.alabama.gov/rostersearch2.asp?search=%25&submit1=Search上的许可委员会获得您想要的大部分/所有数据。名字很容易找到。在

网友

2楼 · 编辑于 2024-06-08 19:28:51

最好的办法是找到不同的数据源。说真的。这个是假的。在

如果你做不到，我会做一些类似的工作：

将所有双空格替换为单个空格。在
按空格分隔线
取列表中最后两项。它们是lat和lng
在列表中向后循环，将每个项查找到潜在语言的列表中。如果查找失败，则完成语言。在
用空格将剩余的列表项连接起来
在该行中，找到第一个打开的paren。在中阅读大约13或14个字符，将所有标点替换为空字符串，然后将其重新格式化为普通电话号码。在
将电话号码后的剩余部分用逗号分开。在
使用该拆分，循环查看列表中的每个项。如果文本以多个大写字母开头，请将其添加到证书中。否则，将其添加到实践领域。在
回到第6步中找到的索引，在此之前将行对齐。把它分成空格，然后拿走最后一项。这就是国家。只剩下名字和城市了！在
取空间拆分行中的前2项。到目前为止，这是你对名字最好的猜测。在
看第三项。如果是单个字母，请将其添加到名称中并从列表中删除。在
下载美国邮政编码从这里：http://download.geonames.org/export/zip/US.zip
在美国数据文件中，将其全部拆分为选项卡。以索引2和索引4的数据为例，它们是城市名称和州缩写。循环遍历所有数据并将每一行插入到一个新列表中，每一行都连接为缩写+“：”+城市名称（即AK:Sand Point）。在
以步骤13中相同的格式，将行中剩余项的所有可能连接组合起来。所以你最终会得到AL:Brown-Birmingham和AL:Birmingham作为第二条线路。在
遍历每个组合，并在步骤13中创建的列表中搜索它。如果找到它，请将其从拆分列表中删除。在
将字符串拆分列表中剩余的所有项目添加到此人的姓名中。在
如果需要，将名称拆分为逗号。index[0]是姓氏index[1]是所有剩余的名字。不要对中间名做任何假设。在

只是为了好玩，我实现了这个。享受吧。在

import itertools

# this list of languages could be longer and should read from a file
languages = ["English", "Spanish", "Italian", "Japanese", "French",
             "Standard Chinese", "Chinese", "Hindi", "Standard Arabic", "Russian"]

languages = [language.lower() for language in languages]

# Loop through US.txt and format it. Download from geonames.org.
cities = []
with open('US.txt', 'r') as us_data:
    for line in us_data:
        line_split = line.split("\t")
        cities.append("{}:{}".format(line_split[4], line_split[2]))

# This is the dataset
with open('state-teachers.txt', 'r') as teachers:
    next(teachers)  # skip header

    for line in teachers:
        # Replace all double spaces with single spaces
        while line.find("  ") != -1:
            line = line.replace("  ", " ")

        line_split = line.split(" ")

        # Lat/Lon are the last 2 items
        longitude = line_split.pop().strip()
        latitude = line_split.pop().strip()

        # Search for potential languages and trim off the line as we find them
        teacher_languages = []

        while True:
            language_check = line_split[-1]
            if language_check.lower().replace(",", "").strip() in languages:
                teacher_languages.append(language_check)
                del line_split[-1]
            else:
                break

        # Rejoin everything and then use phone number as the special key to split on
        line = " ".join(line_split)

        phone_start = line.find("(")
        phone = line[phone_start:phone_start+14].strip()

        after_phone = line[phone_start+15:]

        # Certifications can be recognized as acronyms
        # Anything else is assumed to be an area of practice
        certifications = []
        areas_of_practice = []

        specialties = after_phone.split(",")
        for specialty in specialties:
            specialty = specialty.strip()
            if specialty[0:2].upper() == specialty[0:2]:
                certifications.append(specialty)
            else:
                areas_of_practice.append(specialty)

        before_phone = line[0:phone_start-1]
        line_split = before_phone.split(" ")

        # State is the last column before phone
        state = line_split.pop()

        # Name should be the first 2 columns, at least. This is a basic guess.
        name = line_split[0] + " " + line_split[1]

        line_split = line_split[2:]

        # Add initials
        if len(line_split[0].strip()) == 1:
            name += " " + line_split[0].strip()
            line_split = line_split[1:]

        # Combo of all potential word combinations to see if we're dealing with a city or a name
        combos = [" ".join(combo) for combo in set(itertools.permutations(line_split))] + line_split

        line = " ".join(line_split)
        city = ""

        # See if the state:city combo is valid. If so, set it and let everything else be the name
        for combo in combos:
            if "{}:{}".format(state, combo) in cities:
                city = combo
                line = line.replace(combo, "")
                break

        # Remaining data must be a name
        if line.strip() != "":
            name += " " + line

        # Clean up names
        last_name, first_name = [piece.strip() for piece in name.split(",")]

        print first_name, last_name

相关问题更多 >

编程相关推荐

热门问题

热门文章

从lin中提取名称

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >