使用多个分隔符将.txt导入到Dataframe

2024-06-16 14:54:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我想将.txt文件导入熊猫数据框,我的.txt文件:

Ann   Gosh  1234567892008-12-15Irvine                CA45678A9Z5Steve        Ryan      
Yosh   Dave    9876543212009-04-18St. Elf              NY12345P8G0Brad      Tuck     
Clair   Simon    3245674572008-12-29New Jersey             NJ56789R9B3Dan     John

数据帧应如下所示:

FirstN    LastN       SID        Birth        City     States    Postal    TeacherFirstN  TeacherLastN
   Ann     Gosh   123456789  2008-12-15     Irvine       CA        A9Z5           Steve           Ryan 
  Yosh     Dave   987654321  2009-04-18    St. Elf       NY        P8G0            Brad           Tuck
 Clair    Simon   324567457  2008-12-29   New Jersey     NJ        R9B3             Dan           John

我尝试了多种方法,包括:

df =  pd.read_csv('student.txt',  sep='\s+', engine='python', header=None, index_col=False)

要将原始文件导入dataframe,然后计划清理每列的数据,但这太复杂了。你能帮帮我吗?(此处的邮政编码仅为TeacherFirstN之前的4个字符)


Tags: 文件数据txtjohnsimonannelfdave
1条回答
网友
1楼 · 发布于 2024-06-16 14:54:46

您可以先在现有列上设置名称,然后在创建新列时在数据上设置applying regex

为了解决输出中的“单个空格分隔符”问题,您可以将“至少2个空格字符”例如[\s]{2,}定义为分隔符,以解决城市名称中St. Elf的问题

例如:

import pandas as pd 
import re

df =  pd.read_csv(
    'test.txt', 
    sep = '[\s]{2,}', 
    engine = 'python', 
    header = None, 
    index_col = False, 
    names= [
        "FirstN","LastN","FULLSID","TeacherData","TeacherLastN"
    ]
)
sid_pattern = re.compile(r'(\d{9})(\d+-\d+-\d+)(.*)', re.IGNORECASE)
df['SID'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(1), axis = 1)
df['Birth'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(2), axis = 1)
df['City'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(3), axis = 1)

teacherdata_pattern = re.compile(r'(.{2})([\dA-Z]+\d)(.*)', re.IGNORECASE)
df['States'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(1), axis = 1)
df['Postal'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(2)[-4:], axis = 1)
df['TeacherFirstN'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(3), axis = 1)

del df['FULLSID']
del df['TeacherData']

print(df)

输出:

  FirstN  LastN TeacherLastN        SID       Birth        City States Postal TeacherFirstN
0    Ann   Gosh         Ryan  123456789  2008-12-15      Irvine     CA   A9Z5         Steve
1   Yosh   Dave         Tuck  987654321  2009-04-18     St. Elf     NY   P8G0          Brad
2  Clair  Simon         John  324567457  2008-12-29  New Jersey     NJ   R9B3           Dan

相关问题 更多 >