Pandas:根据起始/结束拆分点的字符串列表(重叠),将字符串列拆分为组件列

2024-04-28 03:42:24 发布

您现在位置:Python中文网/ 问答频道 /正文

在我的Pandas string dataframe中,在一列中我有一个大字符串,我想将它拆分成单独的字符串,每个字符串都有自己的行一个新的dataframe。第二列是一个标签,同一个标签应该出现在每个字符串组件上。在

起始和结束分割点应该由一组字符串确定。每个组件字符串将以遇到此集合中的一个字符串开始。每个这些字符串的起始点应该在其行的自己的列中,而不应该在拆分的字符串中。在

这里有一个例子

我有这个数据帧

testdf = pd.DataFrame([
    [ 'BACKGROUND\nDiagnostic uncertainty in ALS has serious management implications and delays recruitment into clinical trials. Emerging evidence of presymptomatic disease-burden provides the rationale to develop diagnostic applications based on the evaluation of in-vivo pathological patterns early in the disease.\n\n\nOBJECTIVES\nTo outline and test a diagnostic classification approach based on an array of complementary imaging metrics in key disease-associated anatomical structures.\n\n\nMETHODS\nData from 75 ALS patients and 75 healthy controls were randomly allocated in a training and validation cohort. Spatial masks were created for anatomical foci which best discriminate patients from controls in the training sample. In a virtual brain biopsy, data was then retrieved from these key disease-associated brain regions. White matter diffusivity indices, grey matter T1-signal intensity values and basal ganglia volumes were evaluated as predictor variables in a canonical discriminant function.\n\n\nRESULTS\nFollowing predictor variable selection, a classification specificity of 85.5% and sensitivity of 89.1% was achieved in the training sample and 90% specificity and 90% sensitivity in the validation sample.\n\n\nDISCUSSION\nThis study evaluates disease-associated imaging measures in a dummy diagnostic application. Although larger samples will be required for robust validation, the study confirms the potential of multimodal quantitative imaging in future clinical applications.' , 'Entry1'], 
                       [ '\nProblem statement: The industrialization of the world from whole to s ite as a result of technological innovation made many industries adopt ing Information and Communication Technology (ICT) for processing of all their activities from i nception to completion, especially in the developed nations. But, the developing nations appear to make sluggish progress towards ICT adoption due to apprehensiveness that their fraudulent activities c an easily be traced. \nApproach: The purpose of this study was to evaluate the contractor’s perception t oward ICT innovation acceptance for construction site management and the effectiveness of the innova tion. A 519 questionnaire survey was employed for the data collection, while SPSS version 17.0 wa s used for the descriptive statistic and factorial analysis of the data. \nResults: The findings show ICT innovation was effective for site management but there were positive and negative factors that affec t the ICT innovation based on the contractors view. \nConclusion: By evaluating the ICT innovation, empirical eviden c has been provided for the ‘wait and see contractors’ to adopt ICT in construction site management and by making adequate provisions against the negative factors. ' , 'Entry2'], 
                       ['BACKGROUND AND PURPOSE\nRotator cuff tears are associated with secondary rotator cuff muscle pathology, which is definitive for the prognosis of rotator cuff repair. There is little information regarding the early histological and immunohistochemical nature of these muscle changes in humans. We analyzed muscle biopsies from patients with supraspinatus tendon tears.\n\n\nMETHODS\nSupraspinatus muscle biopsies were obtained from 24 patients undergoing arthroscopic repair of partial- or full-thickness supraspinatus tendon tears. Tissue was formalin-fixed and processed for histology (for assessment of fatty infiltration and other degenerative changes) or immunohistochemistry (to identify satellite cells (CD56+), proliferating cells (Ki67+), and myofibers containing predominantly type 1 or 2 myosin heavy chain (MHC)). Myofiber diameters and the relative content of MHC1 and MHC2 were determined morphometrically.\n\n\nRESULTS\nDegenerative changes were present in both patient groups (partial and full-thickness tears). Patients with full-thickness tears had a reduced density of satellite cells, fewer proliferating cells, atrophy of MHC1+ and MHC2+ myofibers, and reduced MHC1 content.\n\n\nINTERPRETATION\nFull-thickness tears show significantly reduced muscle proliferative capacity, myofiber atrophy, and loss of MHC1 content compared to partial-thickness supraspinatus tendon tears.' ,  'Entry3']
] )
testdf.columns = ['A', 'B']
testdf.head(10)

这个输出

^{pr2}$

我有一套琴弦

listStrings = { 
'\nIntroduction' , '\nCase' , 
'\nLiterature' , '\nBackground',  '\nRelated' , 
'\nMethods' , '\nMethod',
'\nTechniques', '\nMethodology',
'\nResults', '\nResult', '\nExperimental',
'\nExperiments', '\nExperiment',
'\nDiscussion' , '\nLimitations',
'\nConclusion' , '\nConclusions',
'\nConcluding' ,
'Introduction\n' , 'Case\n' , 
'Literature\n' , 'Background\n',  'Related\n' , 
'Methods\n' , 'Method\n',
'Techniques\n', 'Methodology\n',
'Results\n', 'Result\n', 'Experimental\n',
'Experiments\n', 'Experiment\n',
'Discussion\n' , 'Limitations\n',
'Conclusion\n' , 'Conclusions\n',
'Concluding\n' ,
'INTRODUCTION' , 'CASE' , 
'LITERATURE' , 'BACKGROUND',  'RELATED' , 
'METHODS' , 'METHOD',
'TECHNIQUES', 'METHODOLOGY',
'RESULTS', 'RESULT', 'EXPERIMENTAL',
'EXPERIMENTS', 'EXPERIMENT',
'DISCUSSION' , 'LIMITATIONS',
'CONCLUSION' , 'CONCLUSIONS',
'CONCLUDING' ,
'Introduction:' , 'Case:' , 
'Literature:' , 'Background:',  'Related:' , 
'Methods:' , 'Method:',
'Techniques:', 'Methodology:',
'Results:', 'Result:', 'Experimental:',
'Experiments:', 'Experiment:',
'Discussion:' , 'Limitations:',
'Conclusion:' , 'Conclusions:',
'Concluding:' ,
}

在a列中的字符串到达listStrings中的某个字符串之前,不要保存任何内容。一旦到达listStrings中的一个字符串,将listStrings字符串作为它自己的单独列放入新数据帧的行中。然后将listStrings字符串之后的所有内容放入新行,直到该段到达另一个来自listStrings的字符串。然后重复该过程:将该字符串放入新列中,并为新段创建新行,依此类推。在

下面是一个理想结果的例子

testdf2 = pd.DataFrame([
    [ 'BACKGROUND' , '\nDiagnostic uncertainty in ALS has serious management implications and delays recruitment into clinical trials. Emerging evidence of presymptomatic disease-burden provides the rationale to develop diagnostic applications based on the evaluation of in-vivo pathological patterns early in the disease.\n\nOBJECTIVES\nTo outline and test a diagnostic classification approach based on an array of complementary imaging metrics in key disease-associated anatomical structures.\n\n', 'Entry1'],
    ['METHODS', 'Data from 75 ALS patients and 75 healthy controls were randomly allocated in a training and validation cohort. Spatial masks were created for anatomical foci which best discriminate patients from controls in the training sample. In a virtual brain biopsy, data was then retrieved from these key disease-associated brain regions. White matter diffusivity indices, grey matter T1-signal intensity values and basal ganglia volumes were evaluated as predictor variables in a canonical discriminant function.\n\n', 'Entry1'],
    ['RESULTS', '\nFollowing predictor variable selection, a classification specificity of 85.5% and sensitivity of 89.1% was achieved in the training sample and 90% specificity and 90% sensitivity in the validation sample.\n\n', 'Entry1'],
    ['DISCUSSION', '\nThis study evaluates disease-associated imaging measures in a dummy diagnostic application. Although larger samples will be required for robust validation, the study confirms the potential of multimodal quantitative imaging in future clinical applications.' , 'Entry1'], 
                        ['\nResults:', ' The findings show ICT innovation was effective for site management but there were positive and negative factors that affec t the ICT innovation based on the contractors view. ', 'Entry2'],
                         ['\nConclusion:',' By evaluating the ICT innovation, empirical eviden c has been provided for the wait and see contractors to adopt ICT in construction site management and by making adequate provisions against the negative factors.', 'Entry2'], 
                       ['BACKGROUND',  'AND PURPOSE\nRotator cuff tears are associated with secondary rotator cuff muscle pathology, which is definitive for the prognosis of rotator cuff repair. There is little information regarding the early histological and immunohistochemical nature of these muscle changes in humans. We analyzed muscle biopsies from patients with supraspinatus tendon tears.\n\n', 'Entry3'],
                      [ 'METHODS', '\nSupraspinatus muscle biopsies were obtained from 24 patients undergoing arthroscopic repair of partial- or full-thickness supraspinatus tendon tears. Tissue was formalin-fixed and processed for histology (for assessment of fatty infiltration and other degenerative changes) or immunohistochemistry (to identify satellite cells (CD56+), proliferating cells (Ki67+), and myofibers containing predominantly type 1 or 2 myosin heavy chain (MHC)). Myofiber diameters and the relative content of MHC1 and MHC2 were determined morphometrically.\n\n',  'Entry3'],
                      [ 'RESULTS', '\nDegenerative changes were present in both patient groups (partial and full-thickness tears). Patients with full-thickness tears had a reduced density of satellite cells, fewer proliferating cells, atrophy of MHC1+ and MHC2+ myofibers, and reduced MHC1 content.\n\n\nINTERPRETATION\nFull-thickness tears show significantly reduced muscle proliferative capacity, myofiber atrophy, and loss of MHC1 content compared to partial-thickness supraspinatus tendon tears.', 'Entry3']
])
testdf2.columns = ['C' , 'D', 'E']
testdf2.head(20)

会导致

C   D   E
0   BACKGROUND  \nDiagnostic uncertainty in ALS has serious management implications and delays recruitment into clinical trials. Emerging evidence of presymptomatic disease-burden provides the rationale to develop diagnostic applications based on the evaluation of in-vivo pathological patterns early in the disease.\n\nOBJECTIVES\nTo outline and test a diagnostic classification approach based on an array of complementary imaging metrics in key disease-associated anatomical structures.\n\n    Entry1
1   METHODS Data from 75 ALS patients and 75 healthy controls were randomly allocated in a training and validation cohort. Spatial masks were created for anatomical foci which best discriminate patients from controls in the training sample. In a virtual brain biopsy, data was then retrieved from these key disease-associated brain regions. White matter diffusivity indices, grey matter T1-signal intensity values and basal ganglia volumes were evaluated as predictor variables in a canonical discriminant function.\n\n Entry1
2   RESULTS \nFollowing predictor variable selection, a classification specificity of 85.5% and sensitivity of 89.1% was achieved in the training sample and 90% specificity and 90% sensitivity in the validation sample.\n\n  Entry1
3   DISCUSSION  \nThis study evaluates disease-associated imaging measures in a dummy diagnostic application. Although larger samples will be required for robust validation, the study confirms the potential of multimodal quantitative imaging in future clinical applications.  Entry1
4   \nResults:  The findings show ICT innovation was effective for site management but there were positive and negative factors that affec t the ICT innovation based on the contractors view.  Entry2
5   \nConclusion:   By evaluating the ICT innovation, empirical eviden c has been provided for the wait and see contractors to adopt ICT in construction site management and by making adequate provisions against the negative factors.    Entry2
6   BACKGROUND  AND PURPOSE\nRotator cuff tears are associated with secondary rotator cuff muscle pathology, which is definitive for the prognosis of rotator cuff repair. There is little information regarding the early histological and immunohistochemical nature of these muscle changes in humans. We analyzed muscle biopsies from patients with supraspinatus tendon tears.\n\n    Entry3
7   METHODS \nSupraspinatus muscle biopsies were obtained from 24 patients undergoing arthroscopic repair of partial- or full-thickness supraspinatus tendon tears. Tissue was formalin-fixed and processed for histology (for assessment of fatty infiltration and other degenerative changes) or immunohistochemistry (to identify satellite cells (CD56+), proliferating cells (Ki67+), and myofibers containing predominantly type 1 or 2 myosin heavy chain (MHC)). Myofiber diameters and the relative content of MHC1 and MHC2 were determined morphometrically.\n\n Entry3
8   RESULTS \nDegenerative changes were present in both patient groups (partial and full-thickness tears). Patients with full-thickness tears had a reduced density of satellite cells, fewer proliferating cells, atrophy of MHC1+ and MHC2+ myofibers, and reduced MHC1 content.\n\n\nINTERPRETATION\nFull-thickness tears show significantly reduced muscle proliferative capacity, myofiber atrophy, and loss of MHC1 content compared to partial-thickness supraspinatus tendon tears. Entry3

对应于第一个数据帧中每个原始字符串的原始标签列B被赋予新数据帧的字符串段的每一行。在

listStrings中的字符串被排除在段中,而是被提取到它自己的列中。在

最后,listStrings中的字符串可能有重叠。不管是哪一边首先出现重叠,或是为列选择了哪一个,或者是否使用了组合(即“\nResult;”)。如第二个示例数据帧的第4行和第5行所示。在

编辑:

对于一个解决方案,我需要将B列转移到上面给出的示例解决方案中的E列,B列中的每个条目都是一个重要的标签,我需要E列中的每个组件字符串都包含它的原始标签。在


Tags: andoftheto字符串infromfor
1条回答
网友
1楼 · 发布于 2024-04-28 03:42:24

这里有一种方法,我不确定对于大型数据集的效率:

# first we build a big regex pattern
pat = '|'.join(listStrings)

# find all keywords in the series
new_df = testdf.A.str.findall(pat)
# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]
# 1                    [\nResults, \nConclusion]
# 2                [BACKGROUND, METHODS, RESULT]
# Name: A, dtype: object

# find all the chunks by splitting the text with the found keywords
chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True) 
             for i in range(len(testdf))]).stack()

# stack the keywords:
keys = new_df.str.join(' ').str.split(' ', expand=True).stack()

# out return dataframe
# note that we shift the chunks to match the keywords
pd.DataFrame({'D': keys, 'E': chunks.groupby(level=0).shift(-1)})

输出:

^{pr2}$

编辑:

下面是一个解决方案的版本,它给出了问题中指定的精确输出

# first we build a big regex pattern
pat = '|'.join(listStrings)

# find all keywords in the series
new_df = testdf.A.str.findall(pat)
# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]
# 1                    [\nResults, \nConclusion]
# 2                [BACKGROUND, METHODS, RESULT]
# Name: A, dtype: object

# find all the chunks by splitting the text with the found keywords
chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True) 
             for i in range(len(testdf))]).stack()

# stack the keywords:
keys = np.concatenate(new_df.values) # Flatten the keywords array
values = chunks.groupby(level=0).shift(-1).dropna().values
labels = np.concatenate([len(val) * [testdf['B'][ind]] for ind, val in enumerate(new_df.values)]) 
# out return dataframe
# note that we shift the chunks to match the keywords
pd.DataFrame({'C': keys, 'D': values, 'E': labels})

输出:

^{4}$

相关问题 更多 >