如何用python在hadoop中保存文件

3条回答

网友

1楼 · 编辑于 2024-06-08 02:06:33

我感觉您正在写入一个名为'/python'的文件，而您希望它是存储该文件的目录

做什么

hdfs dfs -cat /python

给你看？

如果它显示文件内容，您只需编辑hdfs_路径以包含文件名（您应该首先删除/python和-rm），否则，请使用pydoop（pip install pydoop）并执行以下操作：

import pydoop.hdfs as hdfs

from_path = '/tmp/infile.txt'
to_path ='hdfs://localhost:9000/python/outfile.txt'
hdfs.put(from_path, to_path)

网友

2楼 · 编辑于 2024-06-08 02:06:33

我找到了这个答案here：

import subprocess

def run_cmd(args_list):
        """
        run linux commands
        """
        # import subprocess
        print('Running system command: {0}'.format(' '.join(args_list)))
        proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        s_output, s_err = proc.communicate()
        s_return =  proc.returncode
        return s_return, s_output, s_err 

#Run Hadoop ls command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', 'hdfs_file_path'])
lines = out.split('\n')


#Run Hadoop get command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-get', 'hdfs_file_path', 'local_path'])


#Run Hadoop put command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-put', 'local_file', 'hdfs_file_path'])


#Run Hadoop copyFromLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyFromLocal', 'local_file', 'hdfs_file_path'])

#Run Hadoop copyToLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyToLocal', 'hdfs_file_path', 'local_file'])


hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently
#Run Hadoop remove file command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-skipTrash', 'hdfs_file_path'])


#rm -r
#HDFS Command to remove the entire directory and all of its content from #HDFS.
#Usage: hdfs dfs -rm -r <path>

(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', '-skipTrash', 'hdfs_file_path'])




#Check if a file exist in HDFS
#Usage: hadoop fs -test -[defsz] URI


#Options:
#-d: f the path is a directory, return 0.
#-e: if the path exists, return 0.
#-f: if the path is a file, return 0.
#-s: if the path is not empty, return 0.
#-z: if the file is zero length, return 0.
#Example:

#hadoop fs -test -e filename

hdfs_file_path = '/tmpo'
cmd = ['hdfs', 'dfs', '-test', '-e', hdfs_file_path]
ret, out, err = run_cmd(cmd)
print(ret, out, err)
if ret:
    print('file does not exist')

网友

3楼 · 编辑于 2024-06-08 02:06:33

对于subprocess模块来说，这是一个非常典型的任务。解决方案如下：

put = Popen(["hadoop", "fs", "-put", <path/to/file>, <path/to/hdfs/file], stdin=PIPE, bufsize=-1)
put.communicate()

完整示例

假设您在服务器上，并且与hdfs有一个经过验证的连接（例如，您已经调用了.keytab）。

您刚刚从apandas.DataFrame创建了一个csv，并希望将其放入hdfs。

然后，您可以将文件上载到hdfs，如下所示：

import os 

import pandas as pd

from subprocess import PIPE, Popen


# define path to saved file
file_name = "saved_file.csv"

# create a pandas.DataFrame
sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'], 'Jan': [150, 200, 50]}
df = pd.DataFrame.from_dict(sales)

# save your pandas.DataFrame to csv (this could be anything, not necessarily a pandas.DataFrame)
df.to_csv(file_name)

# create path to your username on hdfs
hdfs_path = os.path.join(os.sep, 'user', '<your-user-name>', file_name)

# put csv into hdfs
put = Popen(["hadoop", "fs", "-put", file_name, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()

csv文件将存在于/user/<your-user-name/saved_file.csv。

注意-如果您是从Hadoop中调用的python脚本创建此文件的，则中间csv文件可能存储在一些随机节点上。由于这个文件（可能）不再需要了，所以最好删除它，以免每次调用脚本时都污染节点。您只需添加os.remove(file_name)作为上述脚本的最后一行即可解决此问题。

相关问题更多 >

编程相关推荐

热门问题

热门文章