如何将下载和解压缩的文本文件加载到数据框中？

1条回答

网友

1楼 · 发布于 2024-05-26 16:28:29

请参阅代码的内联解释
代码使用pathlib模块查找已解压缩的文件
有20种文章类型，这意味着数据帧字典中有20个键
每个键的值都是一个数据框，其中包含每个项目类型的所有项目。
- 每个数据帧有1000行，每篇文章有一行
总共有20000篇文章
此实现将保持文章的形状。
- 当从dataframe打印一行时，文章将以可读的形式显示，并带有换行符和标点符号
要从各个数据帧创建单个数据帧，请执行以下操作：
- dfc = pd.concat(dd.values()).reset_index(drop=True)
- 这就是最初创建数据帧时添加'type'列的原因。在组合数据框中，项目类型将是可识别的
这回答了如何将所有文件加载到数据帧中的问题
有关处理文本的更多问题，请打开一个新问题

from pathlib import Path
from io import BytesIO
import requests
import pandas as pd
from collections import defaultdict
from zipfile import ZipFile

######################################################################
# download and save zipped files

# location to save files; this create a pathlib object of the path, and patlib objects have methods, like rglob, parts, and is_file
save_path = Path('data/zipped')

zip_file_url = "https://docsia-temp.s3-sa-east-1.amazonaws.com/docsia-desafio-dataset.zip"
res = requests.get(zip_file_url, stream=True)

with ZipFile(BytesIO(res.content), 'r') as zip_ref:
    zip_ref.extractall(save_path)
######################################################################

# find all the files; the methods in this list comprehension are pathlib methods
files = [file for file in list(save_path.rglob('*')) if file.is_file()]

# dict to save dataframes for each file
dd = defaultdict(list)
for file in files:
    
    # extract the type of article from the path
    article_type = file.parts[-2].replace('.', '_')
    
    # open the file
    with file.open(mode='r', encoding='utf-8', errors='ignore') as f:
        # read the lines and combine them into one string inside a list
        f = [' '.join([line for line in f.readlines() if line.strip()])]
        
    # create a dataframe from f
    df = pd.DataFrame(f, columns=['article'])
    
    # add a column for the article type
    df['type'] = article_type
    
    # add the dataframe to the default dict
    dd[article_type].append(df.copy())

# each value of the dict is a list of dataframes, iterate through all keys and create a single dataframe for each key
for k, v in dd.items():
    # for all the article type, combine all the dataframes into a single dataframe
    dd[k] = pd.concat(v).reset_index(drop=True)

print(dd.keys())
[out]:
dict_keys(['alt_atheism', 'comp_graphics', 'comp_os_ms-windows_misc', 'comp_sys_ibm_pc_hardware', 'comp_sys_mac_hardware', 'comp_windows_x', 'misc_forsale', 'rec_autos', 'rec_motorcycles', 'rec_sport_baseball', 'rec_sport_hockey', 'sci_crypt', 'sci_electronics', 'sci_med', 'sci_space', 'soc_religion_christian', 'talk_politics_guns', 'talk_politics_mideast', 'talk_politics_misc', 'talk_religion_misc'])

# print the first article for the alt_atheism key
print(dd['alt_atheism'].iloc[0, 0])
[out]:
Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126
 Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew
 From: mathew <mathew@mantis.co.uk>
 Newsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers
 Subject: Alt.Atheism FAQ: Atheist Resources
 Summary: Books, addresses, music   anything related to atheism
 Keywords: FAQ, atheism, books, music, fiction, addresses, contacts
 Message-ID: <19930329115719@mantis.co.uk>
 Date: Mon, 29 Mar 1993 11:57:19 GMT
 Expires: Thu, 29 Apr 1993 11:57:19 GMT
 Followup-To: alt.atheism
 Distribution: world
 Organization: Mantis Consultants, Cambridge. UK.
 Approved: news-answers-request@mit.edu
 Supersedes: <19930301143317@mantis.co.uk>
 Lines: 290
 Archive-name: atheism/resources
...

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何将下载和解压缩的文本文件加载到数据框中？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >