我有一个带有标题的列的数据框(见下面的示例)
import numpy as np
Fairytales_in = {'Titles': ['Fairy Tales',
'Tales.3.2.Dancing Shoes, ballgowns and frogs',
'Tales.2.4.6.Red Riding Hood',
'Fairies.1Your own Fairy godmother',
'Ogres-1.1.The wondrous world of Shrek',
'Witches-1-4Maleficient and the malicious curse',
'Tales.2.1.The big bad wolf',
'Tales.2.Little Red riding Hood',
'Tales.2.4.6.1.Why the huntsman is underrated',
'Tales.5.f.Cinderella and the pumpkin carriage',
'Ogres-1.Best Ogre in town',
'No.3.Great Expectations']}
Fairytales_in = pd.DataFrame.from_dict(Fairytales_in)
我想创建一个新列,它包含与标题列完全相同的字符串,但只有当它是副标题时。(例如,故事集3.2.或食人魔-1.1.或女巫-1-4或故事集5.f)
This would be my expected output:
Fairytales_expected_output = {'Titles': ['Fairy Tales',
'Tales.3.2.Dancing Shoes, ballgowns and frogs',
'Tales.2.4.6.Red Riding Hood',
'Fairies.1Your own Fairy godmother',
'Ogres-1.1.The wondrous world of Shrek',
'Witches-1-4Maleficient and the malicious curse',
'Tales.2.1.The big bad wolf',
'Tales.2.Little Red riding Hood',
'Tales.2.4.6.1.Why the huntsman is underrated',
'Tales.5.f.Cinderella and the pumpkin carriage',
'Ogres-1.Best Ogre in town',
'No.3.Great Expectations'],
'Subheading': ['NaN',
'Tales.3.2.Dancing Shoes, ballgowns and frogs',
'NaN',
'NaN',
'Ogres-1.1.The wondrous world of Shrek',
'Witches-1-4Maleficient and the malicious curse',
'Tales.2.1.The big bad wolf',
'NaN',
'NaN',
'Tales.5.f.Cinderella and the pumpkin carriage',
'NaN',
'NaN']}
Fairytales_expected_output = pd.DataFrame.from_dict(Fairytales_expected_output)
我一直在努力寻找一种方法,使我的模式只匹配副标题。无论我尝试什么,第一级或第三级标题仍然包括在内This question的要求或多或少是一样的,但它是用C#编写的,我无法使它在我的用例中工作
这就是我迄今为止所尝试的:
Fairytales_in['Subheading'] = Fairytales_in.Titles.str.extract(r'(^(?:\w+\.|\-\d{1}\.\d{1}\.)\W*(?:\w+\b\W*){1,100})$')
但正如你所看到的,它并没有产生预期的结果。我一直在尝试使用regex101.com,但我已经在这上面坚持了两天了。如果您能帮我修改图案,我们将不胜感激
你可以用
见regex demo
详细信息
^
-字符串的开头\w+
-1个或多个单词字符(?:[.-](?:\d+|[a-zA-Z]\b)){2}
-两次出现[.-]
-一个点或-
(?:\d+|[a-zA-Z]\b)
-1个或多个数字或后跟单词边界的ASCII字母(?![.-]?\d)
-没有可选的.
或-
后跟当前位置右侧允许的数字.*
-除换行符以外的任何0个或更多字符,尽可能多熊猫测试:
相关问题 更多 >
编程相关推荐