如何统计邮件中的相似域名并仅打印每个域名一次[python]?

0 投票

3 回答

2026 浏览

数据工程师

提问于 2025-04-18 14:56

我有一个包含10个hotmail邮箱、4个gmail邮箱和3个mail.com邮箱的数据集。我想分析这些邮箱，统计每个域名（比如hotmail、gmail等）的数量，并把结果打印出来。不过我现在的做法有点粗暴。

我知道用Python可以写出简洁优雅的代码（比如用itertools、islice、xrange等）。

我想要的结果是：

hotmail: 10
gmail: 4
mail.com: 3

但是我得到的结果是：

hotmail
10
hotmail
10
...
hotmail
10
gmail
4
gmail
4
gmail
4
gmail
4
等等

def count_domains( emails):

    for email in emails:

        current_email = email.split("@", 2)[1] # splits at @, john@mail.com => mail.com, 
                                               #2nd index in the list
        print(current_email)
        current_domain_counter = 0
        for email2 in emails:
            if current_email == email2.split("@",2)[1]:
                current_domain_counter = current_domain_counter + 1
        #print(current_email current_domain_counter)
        print(current_domain_counter)

数据处理编程技巧数据清洗数据集结果输出相似性检测域名统计邮箱分析

3 个回答

你做得有点多（我觉得是这样）。其实把字符串拆分并不是必要的。你只需要检查整个字符串中是否包含“@gmail.com”、“@hotmail.com”、“@mail.com”等关键词，然后给每个关键词各自加一个计数就可以了。

gmail_counter = 0
hotmail_counter = 0
mail_counter = 0
# Add as many counters as required
for email in emails:
    if email.find("@gmail.com") >= 0
        gmail_counter += 1
    elif email.find("@hotmail.com") >= 0
        hotmail_counter += 1
    elif email.find("@mail.com") >= 0
        mail_counter += 1
    # ...

回答于 2025-04-18 由 Python大师

分享举报

你可以使用 collections.Counter 这个工具：

email=['me@mail.com','you@mail.com',"me@gmail.com","you@gmail.com","them@gmail.com",'you@hotmail.com',"me@hotmail.com","you@hotmail.com","them@hotmail.com"]


from collections import Counter 
def count_domains(emails):
    c = Counter()
    for email in emails:
        current_email = email.split("@", 2)[1] # splits at @, john@mail.com => mail.com, 
        c.update([current_email]) # wrap in list or will end up counting each letter                                     #2nd index in the list
    print(c.most_common()) # print most common domains
    print ("gmail.com count = {}".format(c["gmail.com"]))
    print ("mail.com count = {}".format(c["mail.com"]))
    print ("hotmail.com count = {}".format(c["hotmail.com"]))

print count_domains(email)

[('hotmail.com', 4), ('gmail.com', 3), ('mail.com', 2)]
gmail.com count = 3
mail.com count = 2
hotmail.com count = 4

回答于 2025-04-18 由 Python大师

分享举报

如果你把所有的字符串放到一个列表里，比如叫做 myList，你可以用下面的方式让它们变得唯一，也就是说去掉重复的字符串。

uniqueList = list(set(myList))

之后，你可以用下面的方式来计算字符串出现的次数，比如说想知道第一个字符串出现了多少次。

countFirst = myList.count(uniqueList[0])

你还可以把这些东西组合在一起，比如：

[[domain,myList.count(domain)] for domain in set(myList)]

回答于 2025-04-18 由 Python大师

分享举报

如何统计邮件中的相似域名并仅打印每个域名一次[python]?

3 个回答

撰写回答