是否仍然可以将特定的文本数据转换为csv格式,并用python给出标题名称?

2024-06-08 23:30:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我在文本文件中有这种格式的数据集

这里的数据集链接是https://drive.google.com/file/d/1RqU2s0dqjd60dcYlxEJ8vnw9_z2fWixd/view?usp=sharing

PMID- 20301691
STAT- Publisher
DA  - 20100320
DRDT- 20210311
CTDT- 20000204
PB  - University of Washington, Seattle
DP  - 1993
TI  - Classic Galactosemia and Clinical Variant Galactosemia
BTI - GeneReviews((R))
AB  - CLINICAL CHARACTERISTICS: The term "galactosemia" refers to disorders of
      galactose metabolism that include classic galactosemia, clinical variant
      galactosemia, and biochemical variant galactosemia (not covered in this chapter).
      This GeneReview focuses on: Classic galactosemia, which can result in
      life-threatening complications including feeding problems, failure to thrive,
      hepatocellular damage, bleeding, and E coli sepsis in untreated infants. If a
      lactose-restricted diet is provided during the first ten days of life, the
      neonatal signs usually quickly resolve and the complications of liver failure,
      sepsis, and neonatal death are prevented; however, despite adequate treatment
      from an early age, children with classic galactosemia remain at increased risk
      for developmental delays, speech problems (termed childhood apraxia of speech and
      dysarthria), and abnormalities of motor function. Almost all females with classic
      galactosemia manifest hypergonadatropic hypogonadism or premature ovarian
      insufficiency (POI). Clinical variant galactosemia, which can result in
      life-threatening complications including feeding problems, failure to thrive,
      hepatocellular damage including cirrhosis, and bleeding in untreated infants.
      This is exemplified by the disease that occurs in African Americans and native
      Africans in South Africa. Persons with clinical variant galactosemia may be
      missed with newborn screening as the hypergalactosemia is not as marked as in
      classic galactosemia and breath testing is normal. If a lactose-restricted diet
      is provided during the first ten days of life, the severe acute neonatal
      complications are usually prevented. African Americans with clinical variant
      galactosemia and adequate early treatment do not appear to be at risk for
      long-term complications, including POI. DIAGNOSIS/TESTING: The diagnosis of
      classic galactosemia and clinical variant galactosemia is established by
      detection of elevated erythrocyte galactose-1-phosphate concentration, reduced
      erythrocyte galactose-1-phosphate uridylyltranserase (GALT) enzyme activity,
      and/or biallelic pathogenic variants in GALT. In classic galactosemia,
      erythrocyte galactose-1-phosphate is usually >10 mg/dL and erythrocyte GALT
      enzyme activity is absent or barely detectable. In clinical variant galactosemia,
      erythrocyte GALT enzyme activity is close to or above 1% of control values but
      probably never >10%-15%. However, in African Americans with clinical variant
      galactosemia, the erythrocyte GALT enzyme activity may be absent or barely
      detectable but is often much higher in liver and in intestinal tissue (e.g., 10% 
      of control values). Virtually 100% of infants with classic galactosemia or
      clinical variant galactosemia can be detected in newborn screening programs that 
      include testing for galactosemia in their panel. However, infants with clinical
      variant galactosemia may be missed if the program only measures blood total
      galactose level and not erythrocyte GALT enzyme activity. MANAGEMENT: Treatment
      of manifestations: Standard of care in any newborn who is "screen-positive" for
      galactosemia is immediate dietary intervention while diagnostic testing is under 
      way. Once a diagnosis is confirmed, restriction of galactose intake is continued 
      and all milk products are replaced with lactose-free formulas (e.g., Isomil((R)) 
      or Prosobee((R))) containing non-galactose carbohydrates; dietary restrictions on
      all lactose-containing foods and other dairy products should continue throughout 
      life, although management of the diet becomes less important after infancy and
      early childhood. In rare instances, cataract surgery may be needed in the first
      year of life. Childhood apraxia of speech and dysarthria require expert speech
      therapy. Developmental assessment at age one year by a psychologist and/or
      developmental pediatrician is recommended in order to formulate a treatment plan 
      with the speech therapist and treating physician. For school-age children, an
      individual education plan and/or professional help with learning skills and
      special classrooms as needed. Hormone replacement therapy as needed for delayed
      pubertal development and/or primary or secondary amenorrhea. Stimulation with
      follicle-stimulating hormone may be useful in producing ovulation in some women. 
      Prevention of secondary complications: Recommended calcium, vitamin D, and
      vitamin K intake to help prevent decreased bone mineralization; standard
      treatment for gastrointestinal dysfunction. Surveillance: Biochemical genetics
      clinic visits every three months for the first year of life or as needed
      depending on the nature of the potential acute complications; every six months
      during the second year of life; yearly thereafter. Routine monitoring for: the
      accumulation of toxic analytes (e.g., erythrocyte galactose-1-phosphate and
      urinary galactitol); cataracts; speech and development; movement disorder; POI;
      nutritional deficiency; and osteoporosis. Agents/circumstances to avoid: Breast
      milk, proprietary infant formulas containing lactose, cow's milk, dairy products,
      and casein or whey-containing foods; medications with lactose and galactose.
      Evaluation of relatives at risk: To allow for earliest possible diagnosis and
      treatment of at-risk sibs: Perform prenatal diagnosis when the GALT pathogenic
      variants in the family are known; or If prenatal testing has not been performed, 
      test the newborn for either the family-specific GALT pathogenic variants or
      erythrocyte GALT enzyme activity. Pregnancy management: Women with classic
      galactosemia should maintain a lactose-restricted diet during pregnancy. GENETIC 
      COUNSELING: Classic galactosemia and clinical variant galactosemia are inherited 
      in an autosomal recessive manner. Couples who have had one affected child have a 
      25% chance of having an affected child in each subsequent pregnancy. Molecular
      genetic carrier testing for at-risk sibs and prenatal testing for pregnancies at 
      increased risk are an option if the GALT pathogenic variants in the family are
      known. If the GALT pathogenic variants in a family are not known, prenatal
      testing can rely on assay of GALT enzyme activity in cultured amniotic fluid
      cells.
CI  - Copyright (c) 1993-2021, University of Washington, Seattle. GeneReviews is a
      registered trademark of the University of Washington, Seattle. All rights
      reserved.
FED - Adam, Margaret P
ED  - Adam MP
FED - Ardinger, Holly H
ED  - Ardinger HH
FED - Pagon, Roberta A

我想把左边的值作为列名,右边的值是行格式

输出应该是

PMID       STAT        DA         CTDT
33237688   Publisher   20201126   20201125

我已尝试将文本转换为CSV,但不起作用

  import pandas as pd

  medical = pd.read_csv("sepsis2015.txt",
                         sep="\n")
  print(medical)

Tags: orandofthetoinforis
2条回答

也许

给定这样的文件:enter image description here 包含以下文本:

PMID- 20301691 
STAT- Publisher
DA  - 20100320
DRDT- 20210311
CTDT- 20000204
PB  - University of Washington, Seattle
DP  - 1993
TI  - Classic Galactosemia and Clinical Variant Galactosemia
BTI - GeneReviews((R))


PMID- 33237688
STAT- Publisher
DA  - 20201126
CTDT- 20201125
PB  - University of Washington, Seattle
DP  - 1993
TI  - MIRAGE Syndrome
BTI - GeneReviews((R))

试试看:

import pandas as pd

df = pd.read_csv('text.csv', sep='-', header=None)

# clean up
df[0] = df[0].str.strip()
df[1] = df[1].str.strip()

# create a dictionary
data = df.groupby(0)[1].apply(list).to_dict()

# create a dataframe and make sure the arrays are equal length
# borrowed from https://stackoverflow.com/a/19736406/9192284
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in data.items() ]))

print(df)

输出:

enter image description here

                BTI      CTDT        DA    DP      DRDT  \
0  GeneReviews((R))  20000204  20100320  1993  20210311   
1  GeneReviews((R))  20201125  20201126  1993       NaN   

                                  PB      PMID       STAT  \
0  University of Washington, Seattle  20301691  Publisher   
1  University of Washington, Seattle  33237688  Publisher   

                                                  TI  
0  Classic Galactosemia and Clinical Variant Gala...  
1                                    MIRAGE Syndrome 

我知道的最简单的方法是:

  • 使用以下命令读取数据文件:

    with open("sepsis2015.txt") as file:
        lines = file.readlines()
    lines = ''.join(lines).split('\n\n')
    

    这将为您提供一份记录列表:

    ['PMID- 20301691 \nSTAT- Publisher\nDA  - 20100320\nDRDT- 20210311\nCTDT- 20000204\nPB  - University of Washington, Seattle\nDP  - 1993\nTI  - Classic Galactosemia and Clinical Variant Galactosemia\nBTI - GeneReviews((R))', '\nPMID- 33237688\nSTAT- Publisher\nDA  - 20201126\nCTDT- 20201125\nPB  - University of Washington, Seattle\nDP  - 1993\nTI  - MIRAGE Syndrome\nBTI - GeneReviews((R))']
    
  • 将存储在lines列表中的数据转换为data字典:

    data = {i: {item.split('-')[0].replace(' ', ''): item.split('-')[1][1:] for item in row.split('\n') if '-' in item} for i, row in enumerate(lines)}
    

    所以你有:

    {0: {'PMID': '20301691', 'STAT': 'Publisher', 'DA': '20100320', 'DRDT': '20210311', 'CTDT': '20000204', 'PB': 'University of Washington, Seattle', 'DP': '1993', 'TI': 'Classic Galactosemia and Clinical Variant Galactosemia', 'BTI': 'GeneReviews((R))'}, 1: {'PMID': '33237688', 'STAT': 'Publisher', 'DA': '20201126', 'CTDT': '20201125', 'PB': 'University of Washington, Seattle', 'DP': '1993', 'TI': 'MIRAGE Syndrome', 'BTI': 'GeneReviews((R))'}}
    
  • 最后,使用以下命令将此词典转换为pandas.DataFrame

    df = pd.DataFrame.from_dict(data, orient = 'index')
    

完整代码

import pandas as pd

with open(r'data/data.csv') as file:
    lines = file.readlines()
lines = ''.join(lines).split('\n\n')

data = {i: {item.split('-')[0].replace(' ', ''): item.split('-')[1][1:] for item in row.split('\n') if '-' in item} for i, row in enumerate(lines)}
print(data)
df = pd.DataFrame.from_dict(data, orient = 'index')
       PMID       STAT        DA      DRDT      CTDT                                 PB    DP                                                      TI               BTI
0  20301691  Publisher  20100320  20210311  20000204  University of Washington, Seattle  1993  Classic Galactosemia and Clinical Variant Galactosemia  GeneReviews((R))
1  33237688  Publisher  20201126       NaN  20201125  University of Washington, Seattle  1993                                         MIRAGE Syndrome  GeneReviews((R))

相关问题 更多 >