Pandas：根据起始/结束拆分点的字符串列表（重叠），将字符串列拆分为组件列

testdf = pd.DataFrame([ [ 'BACKGROUND\nDiagnostic uncertainty in ALS has serious management implications and delays recruitment into clinical trials. Emerging evidence of presymptomatic disease-burden provides the rationale to develop diagnostic applications based on the evaluation of in-vivo pathological patterns early in the disease.\n\n\nOBJECTIVES\nTo outline and test a diagnostic classification approach based on an array of complementary imaging metrics in key disease-associated anatomical structures.\n\n\nMETHODS\nData from 75 ALS patients and 75 healthy controls were randomly allocated in a training and validation cohort. Spatial masks were created for anatomical foci which best discriminate patients from controls in the training sample. In a virtual brain biopsy, data was then retrieved from these key disease-associated brain regions. White matter diffusivity indices, grey matter T1-signal intensity values and basal ganglia volumes were evaluated as predictor variables in a canonical discriminant function.\n\n\nRESULTS\nFollowing predictor variable selection, a classification specificity of 85.5% and sensitivity of 89.1% was achieved in the training sample and 90% specificity and 90% sensitivity in the validation sample.\n\n\nDISCUSSION\nThis study evaluates disease-associated imaging measures in a dummy diagnostic application. Although larger samples will be required for robust validation, the study confirms the potential of multimodal quantitative imaging in future clinical applications.' , 'Entry1'], [ '\nProblem statement: The industrialization of the world from whole to s ite as a result of technological innovation made many industries adopt ing Information and Communication Technology (ICT) for processing of all their activities from i nception to completion, especially in the developed nations. But, the developing nations appear to make sluggish progress towards ICT adoption due to apprehensiveness that their fraudulent activities c an easily be traced. \nApproach: The purpose of this study was to evaluate the contractor’s perception t oward ICT innovation acceptance for construction site management and the effectiveness of the innova tion. A 519 questionnaire survey was employed for the data collection, while SPSS version 17.0 wa s used for the descriptive statistic and factorial analysis of the data. \nResults: The findings show ICT innovation was effective for site management but there were positive and negative factors that affec t the ICT innovation based on the contractors view. \nConclusion: By evaluating the ICT innovation, empirical eviden c has been provided for the ‘wait and see contractors’ to adopt ICT in construction site management and by making adequate provisions against the negative factors. ' , 'Entry2'], ['BACKGROUND AND PURPOSE\nRotator cuff tears are associated with secondary rotator cuff muscle pathology, which is definitive for the prognosis of rotator cuff repair. There is little information regarding the early histological and immunohistochemical nature of these muscle changes in humans. We analyzed muscle biopsies from patients with supraspinatus tendon tears.\n\n\nMETHODS\nSupraspinatus muscle biopsies were obtained from 24 patients undergoing arthroscopic repair of partial- or full-thickness supraspinatus tendon tears. Tissue was formalin-fixed and processed for histology (for assessment of fatty infiltration and other degenerative changes) or immunohistochemistry (to identify satellite cells (CD56+), proliferating cells (Ki67+), and myofibers containing predominantly type 1 or 2 myosin heavy chain (MHC)). Myofiber diameters and the relative content of MHC1 and MHC2 were determined morphometrically.\n\n\nRESULTS\nDegenerative changes were present in both patient groups (partial and full-thickness tears). Patients with full-thickness tears had a reduced density of satellite cells, fewer proliferating cells, atrophy of MHC1+ and MHC2+ myofibers, and reduced MHC1 content.\n\n\nINTERPRETATION\nFull-thickness tears show significantly reduced muscle proliferative capacity, myofiber atrophy, and loss of MHC1 content compared to partial-thickness supraspinatus tendon tears.' , 'Entry3'] ] ) testdf.columns = ['A', 'B'] testdf.head(10)

listStrings = { '\nIntroduction' , '\nCase' , '\nLiterature' , '\nBackground', '\nRelated' , '\nMethods' , '\nMethod', '\nTechniques', '\nMethodology', '\nResults', '\nResult', '\nExperimental', '\nExperiments', '\nExperiment', '\nDiscussion' , '\nLimitations', '\nConclusion' , '\nConclusions', '\nConcluding' , 'Introduction\n' , 'Case\n' , 'Literature\n' , 'Background\n', 'Related\n' , 'Methods\n' , 'Method\n', 'Techniques\n', 'Methodology\n', 'Results\n', 'Result\n', 'Experimental\n', 'Experiments\n', 'Experiment\n', 'Discussion\n' , 'Limitations\n', 'Conclusion\n' , 'Conclusions\n', 'Concluding\n' , 'INTRODUCTION' , 'CASE' , 'LITERATURE' , 'BACKGROUND', 'RELATED' , 'METHODS' , 'METHOD', 'TECHNIQUES', 'METHODOLOGY', 'RESULTS', 'RESULT', 'EXPERIMENTAL', 'EXPERIMENTS', 'EXPERIMENT', 'DISCUSSION' , 'LIMITATIONS', 'CONCLUSION' , 'CONCLUSIONS', 'CONCLUDING' , 'Introduction:' , 'Case:' , 'Literature:' , 'Background:', 'Related:' , 'Methods:' , 'Method:', 'Techniques:', 'Methodology:', 'Results:', 'Result:', 'Experimental:', 'Experiments:', 'Experiment:', 'Discussion:' , 'Limitations:', 'Conclusion:' , 'Conclusions:', 'Concluding:' , }

testdf2 = pd.DataFrame([ [ 'BACKGROUND' , '\nDiagnostic uncertainty in ALS has serious management implications and delays recruitment into clinical trials. Emerging evidence of presymptomatic disease-burden provides the rationale to develop diagnostic applications based on the evaluation of in-vivo pathological patterns early in the disease.\n\nOBJECTIVES\nTo outline and test a diagnostic classification approach based on an array of complementary imaging metrics in key disease-associated anatomical structures.\n\n', 'Entry1'], ['METHODS', 'Data from 75 ALS patients and 75 healthy controls were randomly allocated in a training and validation cohort. Spatial masks were created for anatomical foci which best discriminate patients from controls in the training sample. In a virtual brain biopsy, data was then retrieved from these key disease-associated brain regions. White matter diffusivity indices, grey matter T1-signal intensity values and basal ganglia volumes were evaluated as predictor variables in a canonical discriminant function.\n\n', 'Entry1'], ['RESULTS', '\nFollowing predictor variable selection, a classification specificity of 85.5% and sensitivity of 89.1% was achieved in the training sample and 90% specificity and 90% sensitivity in the validation sample.\n\n', 'Entry1'], ['DISCUSSION', '\nThis study evaluates disease-associated imaging measures in a dummy diagnostic application. Although larger samples will be required for robust validation, the study confirms the potential of multimodal quantitative imaging in future clinical applications.' , 'Entry1'], ['\nResults:', ' The findings show ICT innovation was effective for site management but there were positive and negative factors that affec t the ICT innovation based on the contractors view. ', 'Entry2'], ['\nConclusion:',' By evaluating the ICT innovation, empirical eviden c has been provided for the wait and see contractors to adopt ICT in construction site management and by making adequate provisions against the negative factors.', 'Entry2'], ['BACKGROUND', 'AND PURPOSE\nRotator cuff tears are associated with secondary rotator cuff muscle pathology, which is definitive for the prognosis of rotator cuff repair. There is little information regarding the early histological and immunohistochemical nature of these muscle changes in humans. We analyzed muscle biopsies from patients with supraspinatus tendon tears.\n\n', 'Entry3'], [ 'METHODS', '\nSupraspinatus muscle biopsies were obtained from 24 patients undergoing arthroscopic repair of partial- or full-thickness supraspinatus tendon tears. Tissue was formalin-fixed and processed for histology (for assessment of fatty infiltration and other degenerative changes) or immunohistochemistry (to identify satellite cells (CD56+), proliferating cells (Ki67+), and myofibers containing predominantly type 1 or 2 myosin heavy chain (MHC)). Myofiber diameters and the relative content of MHC1 and MHC2 were determined morphometrically.\n\n', 'Entry3'], [ 'RESULTS', '\nDegenerative changes were present in both patient groups (partial and full-thickness tears). Patients with full-thickness tears had a reduced density of satellite cells, fewer proliferating cells, atrophy of MHC1+ and MHC2+ myofibers, and reduced MHC1 content.\n\n\nINTERPRETATION\nFull-thickness tears show significantly reduced muscle proliferative capacity, myofiber atrophy, and loss of MHC1 content compared to partial-thickness supraspinatus tendon tears.', 'Entry3'] ]) testdf2.columns = ['C' , 'D', 'E'] testdf2.head(20)

C D E 0 BACKGROUND \nDiagnostic uncertainty in ALS has serious management implications and delays recruitment into clinical trials. Emerging evidence of presymptomatic disease-burden provides the rationale to develop diagnostic applications based on the evaluation of in-vivo pathological patterns early in the disease.\n\nOBJECTIVES\nTo outline and test a diagnostic classification approach based on an array of complementary imaging metrics in key disease-associated anatomical structures.\n\n Entry1 1 METHODS Data from 75 ALS patients and 75 healthy controls were randomly allocated in a training and validation cohort. Spatial masks were created for anatomical foci which best discriminate patients from controls in the training sample. In a virtual brain biopsy, data was then retrieved from these key disease-associated brain regions. White matter diffusivity indices, grey matter T1-signal intensity values and basal ganglia volumes were evaluated as predictor variables in a canonical discriminant function.\n\n Entry1 2 RESULTS \nFollowing predictor variable selection, a classification specificity of 85.5% and sensitivity of 89.1% was achieved in the training sample and 90% specificity and 90% sensitivity in the validation sample.\n\n Entry1 3 DISCUSSION \nThis study evaluates disease-associated imaging measures in a dummy diagnostic application. Although larger samples will be required for robust validation, the study confirms the potential of multimodal quantitative imaging in future clinical applications. Entry1 4 \nResults: The findings show ICT innovation was effective for site management but there were positive and negative factors that affec t the ICT innovation based on the contractors view. Entry2 5 \nConclusion: By evaluating the ICT innovation, empirical eviden c has been provided for the wait and see contractors to adopt ICT in construction site management and by making adequate provisions against the negative factors. Entry2 6 BACKGROUND AND PURPOSE\nRotator cuff tears are associated with secondary rotator cuff muscle pathology, which is definitive for the prognosis of rotator cuff repair. There is little information regarding the early histological and immunohistochemical nature of these muscle changes in humans. We analyzed muscle biopsies from patients with supraspinatus tendon tears.\n\n Entry3 7 METHODS \nSupraspinatus muscle biopsies were obtained from 24 patients undergoing arthroscopic repair of partial- or full-thickness supraspinatus tendon tears. Tissue was formalin-fixed and processed for histology (for assessment of fatty infiltration and other degenerative changes) or immunohistochemistry (to identify satellite cells (CD56+), proliferating cells (Ki67+), and myofibers containing predominantly type 1 or 2 myosin heavy chain (MHC)). Myofiber diameters and the relative content of MHC1 and MHC2 were determined morphometrically.\n\n Entry3 8 RESULTS \nDegenerative changes were present in both patient groups (partial and full-thickness tears). Patients with full-thickness tears had a reduced density of satellite cells, fewer proliferating cells, atrophy of MHC1+ and MHC2+ myofibers, and reduced MHC1 content.\n\n\nINTERPRETATION\nFull-thickness tears show significantly reduced muscle proliferative capacity, myofiber atrophy, and loss of MHC1 content compared to partial-thickness supraspinatus tendon tears. Entry3

1条回答

网友

1楼 · 发布于 2024-05-16 03:05:02

这里有一种方法，我不确定对于大型数据集的效率：

# first we build a big regex pattern
pat = '|'.join(listStrings)

# find all keywords in the series
new_df = testdf.A.str.findall(pat)
# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]
# 1                    [\nResults, \nConclusion]
# 2                [BACKGROUND, METHODS, RESULT]
# Name: A, dtype: object

# find all the chunks by splitting the text with the found keywords
chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True) 
             for i in range(len(testdf))]).stack()

# stack the keywords:
keys = new_df.str.join(' ').str.split(' ', expand=True).stack()

# out return dataframe
# note that we shift the chunks to match the keywords
pd.DataFrame({'D': keys, 'E': chunks.groupby(level=0).shift(-1)})

输出：

^{pr2}$

编辑：

下面是一个解决方案的版本，它给出了问题中指定的精确输出

# first we build a big regex pattern
pat = '|'.join(listStrings)

# find all keywords in the series
new_df = testdf.A.str.findall(pat)
# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]
# 1                    [\nResults, \nConclusion]
# 2                [BACKGROUND, METHODS, RESULT]
# Name: A, dtype: object

# find all the chunks by splitting the text with the found keywords
chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True) 
             for i in range(len(testdf))]).stack()

# stack the keywords:
keys = np.concatenate(new_df.values) # Flatten the keywords array
values = chunks.groupby(level=0).shift(-1).dropna().values
labels = np.concatenate([len(val) * [testdf['B'][ind]] for ind, val in enumerate(new_df.values)]) 
# out return dataframe
# note that we shift the chunks to match the keywords
pd.DataFrame({'C': keys, 'D': values, 'E': labels})

输出：

^{4}$

相关问题更多 >

编程相关推荐

热门问题

热门文章