在Python中从MySQL按计数查找top x?

2024-04-29 13:20:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样一个csv文件:

nohaelprince@uwaterloo.ca, 01-05-2014
nohaelprince@uwaterloo.ca, 01-05-2014
nohaelprince@uwaterloo.ca, 01-05-2014
nohaelprince@gmail.com, 01-05-2014

我正在阅读上述csv文件和提取域名,也按域名和日期电子邮件地址计数以及。所有这些我需要插入到MySQL表名为domains,我能够成功地做到。你知道吗

问题声明:-现在我需要使用同一个表来报告前50个域,按最近30天的增长百分比排序。而这正是我无法理解的,我该怎么做呢?你知道吗

下面的代码是我能够成功地插入到MySQL数据库,但无法完成上述报告任务,因为我无法理解如何实现这项任务?你知道吗

#!/usr/bin/python
import fileinput
import csv
import os
import sys
import time
import MySQLdb

from collections import defaultdict, Counter

domain_counts = defaultdict(Counter)

# ======================== Defined Functions ======================
def get_file_path(filename):
    currentdirpath = os.getcwd()  
    # get current working directory path
    filepath = os.path.join(currentdirpath, filename)
    return filepath
# ===========================================================
def read_CSV(filepath):

    with open('emails.csv') as f:
        reader = csv.reader(f)
        for row in reader:
            domain_counts[row[0].split('@')[1].strip()][row[1]] += 1

    db = MySQLdb.connect(host="localhost", # your host, usually localhost
                         user="root", # your username
                         passwd="abcdef1234", # your password
                         db="test") # name of the data base
    cur = db.cursor()

    q = """INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE(%s, '%%d-%%m-%%Y'))"""


    for domain, data in domain_counts.iteritems():
        for email_date, email_count in data.iteritems():
             cur.execute(q, (domain, email_count, email_date))

    db.commit()

# ======================= main program =======================================
path = get_file_path('emails.csv') 
read_CSV(path) # read the input file

使用domains表时执行报告任务的正确方法是什么。你知道吗

更新:

这是我的域表:

mysql> describe domains;
+----------------+-------------+------+-----+---------+----------------+
| Field          | Type        | Null | Key | Default | Extra          |
+----------------+-------------+------+-----+---------+----------------+
| id             | int(11)     | NO   | PRI | NULL    | auto_increment |
| domain_name    | varchar(20) | NO   |     | NULL    |                |
| cnt            | int(11)     | YES  |     | NULL    |                |
| date_of_entry  | date        | NO   |     | NULL    |                |
+-------------+-------------+------+-----+---------+----------------+

这是我的数据:

mysql> select * from domains;
+----+---------------+-------+------------+
| id | domain_name   | count | date_entry |
+----+---------------+-------+------------+
|  1 | wawa.com      |     2 | 2014-04-30 |
|  2 | wawa.com      |     2 | 2014-05-01 |
|  3 | wawa.com      |     3 | 2014-05-31 |
|  4 | uwaterloo.ca  |     4 | 2014-04-30 |
|  5 | uwaterloo.ca  |     3 | 2014-05-01 |
|  6 | uwaterloo.ca  |     1 | 2014-05-31 |
|  7 | anonymous.com |     2 | 2014-04-30 |
|  8 | anonymous.com |     4 | 2014-05-01 |
|  9 | anonymous.com |     8 | 2014-05-31 |
| 10 | hotmail.com   |     4 | 2014-04-30 |
| 11 | hotmail.com   |     1 | 2014-05-01 |
| 12 | hotmail.com   |     3 | 2014-05-31 |
| 13 | gmail.com     |     6 | 2014-04-30 |
| 14 | gmail.com     |     4 | 2014-05-01 |
| 15 | gmail.com     |     8 | 2014-05-31 |
+----+---------------+-------+------------+

Tags: csvpathnameimportcomdbdateemail
2条回答

您需要的报告可以在MySQL端用SQL完成,Python可以用来调用查询、导入结果集和打印结果。你知道吗

考虑以下聚合查询,其中子查询和派生表遵循百分比增长公式:

((this month domain total cnt) - (last month domain total cnt))
 / (last month all domains total cnt)

SQL

SELECT  domain_name, pct_growth
FROM (

SELECT t1.domain_name,  
         # SUM OF SPECIFIC DOMAIN'S CNT BETWEEN TODAY AND 30 DAYS AGO  
        (Sum(CASE WHEN t1.date_of_entry >= (CURRENT_DATE - INTERVAL 30 DAY) 
                  THEN t1.cnt ELSE 0 END)               
         -
         # SUM OF SPECIFIC DOMAIN'S CNT AS OF 30 DAYS AGO
         Sum(CASE WHEN t1.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY) 
                  THEN t1.cnt ELSE 0 END) 
        ) /   
        # SUM OF ALL DOMAINS' CNT AS OF 30 DAYS AGO
        (SELECT SUM(t2.cnt) FROM domains t2 
          WHERE t2.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY))
         As pct_growth   

FROM domains t1
GROUP BY t1.domain_name
) As derivedTable

ORDER BY pct_growth DESC
LIMIT 50;

Python

cur = db.cursor()
sql = "SELECT * FROM ..."  # SEE ABOVE 

cur.execute(sql)

for row in cur.fetchall():
   print(row)

如果我没弄错的话,你只需要过去三十天和总数的比率。您可以使用条件聚合来获得它。所以,假设cnt总是大于0

select d.domain_name,
       sum(cnt) as CntTotal,
       sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) as Cnt30Days,
       (sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) / sum(cnt)) as Ratio30Days
from domains d
group by d.domain_name
order by Ratio30Days desc;

相关问题 更多 >