从保存到本地文件系统的Hive查询输出中删除空行

0 投票

3 回答

936 浏览

提问于 2025-04-18 07:52

我在我的开发机器上运行一个Python脚本，这个脚本通过SSH远程连接到一个网关服务器，然后在那台服务器上启动另一个Python脚本，这个脚本会运行Hive查询，并把结果返回给我。我会把结果保存到我的开发机器上，文件格式是datestamp.tsv。

有一些查询需要我对两个集群进行循环处理。问题是，虽然结果被保存了，但输出中有空行，我希望时间戳能放在查询结果的最后。现在我的输出是这样的 -

2014_03_28 PT 588.12    396.73

2014_03_28 DB 0.17      0.0

每次在循环中运行查询后，都会出现一个空行。

我该怎么去掉这个空行，并把时间戳放在最后呢？我想要的输出格式是 -

PT 588.12    396.73 2014_03_28
DB 0.17      0.0  2014_03_28

父脚本：

def get_compute_resources():
  global output
  ensure_directory(pipeline_name, user, star_date, "daily_compute_resources")
  for grid in grids:
    cmd = 'ssh -2 -i /home/abcd/.ssh/id_dsa -l abcd -o StrictHostKeyChecking=no -o CheckHostIP=no hostname "python2.6 /homes/abcd/starling/fetch_daily_user_summary.py -u ' + user + ' -g ' + grid + ' -d ' + starling_date + '" >> /home/abcd/projects/starling/daily_compute_resources/'+ pipeline_name +'/'+ user +'/'+ starling_date +'.tsv'
    resources = make_call(cmd).rstrip()
    print resources

远程机器脚本：

cmd = "/home/y/bin/hive -e 'use star; SELECT ROUND(SUM((map_slot_seconds)/3600/24/2),2), ROUND(SUM((reduce_slots_seconds)/3600/24/2),2) from starling_job_summary where user=%s and grid=%s and dt like %s group by dt;' -hiveconf mapred.job.queue.name=unfunded -hiveconf mapred.reduce.tasks=1" % (user, grid, date)
  resources = Popen(cmd, shell=True, stdout=PIPE).communicate()[0]
  output = output_date+' '+output_grid+' '+resources
  print output

谢谢。

数据处理文件格式时间戳输出格式循环处理远程执行空行删除 hive查询

3 个回答

我觉得你需要把你的 print 语句改一下，让它们最后加上一个逗号：

print output,

根据官方的Python文档：

如果 print 语句最后没有逗号，就会在结尾自动加上一个换行符（'\n'）。

回答于 2025-04-18 由 Python大师

分享举报

多出来的空白可能是因为output_date和resources这两个地方前面或后面有多余的换行符。你可以试试这样做：

print '{date} {grid} {res}'.format(date=output_date.strip(),
                                   grid=grid,
                                   res=resources.strip())

一般来说，使用str.format是创建包含变量数据的字符串的传统方法。在子脚本中，你用%这种写法做了类似的事情，但如果用这种方法，你可以让父脚本更容易阅读。

回答于 2025-04-18 由 Python大师

分享举报

这个应该可以正常工作。它假设你有一个名为 input.txt 的文件，放在你运行 Python 的同一个文件夹里，然后它会把你想要的数据输出到一个叫 output.txt 的文件里。if line.strip() 这个检查会忽略那些完全是空白的行，除此之外，这里唯一有点酷的就是 split() 函数里的 maxsplit 参数，它可以把日期和行中的其他内容分开。

infile = 'input.txt'
outfile = 'output.txt'

with open(infile) as f:
    with open(outfile, mode='w') as output:
        data = f.readlines()
        for line in data:
            if line.strip():
                date, rest = line.split(maxsplit=1)
                date = date.strip()
                rest = rest.strip()
                output.write(rest + ' ' + date + "\n")

可能可以稍微优化一下空白处理，但这样做更简单明了。

输出：

PT 588.12    396.73 2014_03_28
DB 0.17      0.0 2014_03_28

回答于 2025-04-18 由 Python大师

分享举报

从保存到本地文件系统的Hive查询输出中删除空行

3 个回答

撰写回答