我使用Python(3.5)循环遍历一些.msg文件,从中提取数据,其中包含一个下载文件的url和一个文件应该进入的文件夹。我已经成功地从.msg文件中提取了数据,但是现在当我尝试拼凑下载文件的绝对文件路径时,格式会变得奇怪,带有反斜杠和\t\r
以下是代码的简略视图:
for file in files:
file_abs_path = script_dir + '/' + file
print(file_abs_path)
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = outlook.OpenSharedItem(file_abs_path)
pattern = re.compile(r'(?:^|(?<=\n))[^:<\n]*[:<]\s*([^>\n]*)', flags=re.DOTALL)
results = pattern.findall(msg.Body)
# results[0] -> eventID
regexID = re.compile(r'^[^\/\s]*', flags=re.DOTALL)
filtered = regexID.findall(results[0])
eventID = filtered[0]
# print(eventID)
# results[1] -> title
title = results[1].translate(str.maketrans('','',string.punctuation)).replace(' ', '_') #results[1]
title = unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
title = title.decode('UTF-8')
#results[1]
print(title)
# results[2] -> account
regexAcc = re.compile(r'^[^\(\s]*', flags=re.DOTALL)
filtered = regexAcc.findall(results[2])
account = filtered[0]
account = unicodedata.normalize('NFKD', account).encode('ascii', 'ignore')
account = account.decode('UTF-8')
# print(account)
# results[3] -> downloadURL
downloadURL = results[3]
# print(downloadURL)
rel_path = account + '/' + eventID + '_' + title + '.mp4'
rel_path = unicodedata.normalize('NFKD', rel_path).encode('ascii', 'ignore')
rel_path = rel_path.decode('UTF-8')
filename_abs_path = os.path.join(script_dir, rel_path)
# Download .mp4 from a url and save it locally under `file_name`:
with urllib.request.urlopen(downloadURL) as response, open(filename_abs_path, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
# print item [ID - Title] when done
print('[Complete] ' + eventID + ' - ' + title)
del outlook, msg
如你所见,我有一些正则表达式,它从.msg中提取了4条数据。然后,我必须仔细检查每一个,并做一些进一步的微调,但然后有我需要的:
eventID
# 123456
title
# Name_of_item_with_underscord_no_punctuation
account
# nameofaccount
downloadURL
# http://download.com/basicurlandfile.mp4
这是我得到的数据,我把它去掉了,它没有任何奇怪的字符。但是当我尝试为.mp4(文件名和目录)构建路径时:
downloadURL = results[3]
# print(downloadURL)
rel_path = account + '/' + eventID + '_' + title + '.mp4'
rel_path = unicodedata.normalize('NFKD', rel_path).encode('ascii', 'ignore')
rel_path = rel_path.decode('UTF-8')
filename_abs_path = os.path.join(script_dir, rel_path)
# Download .mp4 from a url and save it locally under `file_name`:
with urllib.request.urlopen(downloadURL) as response, open(filename_abs_path, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
执行此操作后,运行代码得到的输出是:
Traceback (most recent call last): File "sfaScript.py", line 65, in <module> with urllib.request.urlopen(downloadURL) as response, open(filename_abs_path, 'wb') as out_file:
OSError: [Errno 22] Invalid argument: 'C:/Users/Kenny/Desktop/sfa_kenny_batch_1\\accountnamehere/123456_Name_of_item_with_underscord_no_punctuation\t\r.mp4'
TL;DR-问题
所以filename_abs_path
不知何故变成了
C:/Users/Kenny/Desktop/sfa_kenny_batch_1\\accountnamehere/123456_Name_of_item_with_underscord_no_punctuation\t\r.mp4
我需要它
C:/Users/Kenny/Desktop/sfa_kenny_batch_1/accountnamehere/123456_Name_of_item_with_underscord_no_punctuation.mp4
感谢您的帮助!你知道吗
看起来您的正则表达式在
title
中捕获了一个制表字符(\t
)和一个换行字符(\r
)解决这个问题的快速方法是:
(在编写文件名之前)
删除所有“空白”字符,包括制表符和回车符。你知道吗
相关问题 更多 >
编程相关推荐