如何从列表中过滤特定的POS标签并将其分离为列表?

2024-04-29 14:11:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有大量的产品描述数据,需要将产品名称和意图与描述分开,我发现在用POS标记标记文本之后分离NNP标记对进一步清理有一定帮助。在

我有以下类似的数据,我只想过滤NNP标记,并希望它们在各自的列表中被过滤,但不能这样做。在

 data = [[('User', 'NNP'),
  ('is', 'VBZ'),
  ('not', 'RB'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('iShopCatalog', 'NN'),
  ('Coala', 'NNP'),
  ('excluding', 'VBG'),
  ('articles', 'NNS'),
  ('from', 'IN'),
  ('VWR', 'NNP')],
 [('Arfter', 'NNP'),
  ('transferring', 'VBG'),
  ('the', 'DT'),
  ('articles', 'NNS'),
  ('from', 'IN'),
  ('COALA', 'NNP'),
  ('to', 'TO'),
  ('SRM', 'VB'),
  ('the', 'DT'),
  ('Category', 'NNP'),
  ('S9901', 'NNP'),
  ('Dummy', 'NNP'),
  ('is', 'VBZ'),
  ('maintained', 'VBN')],
 [('Due', 'JJ'),
  ('to', 'TO'),
  ('this', 'DT'),
  ('the', 'DT'),
  ('user', 'NN'),
  ('is', 'VBZ'),
  ('not', 'RB'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('the', 'DT'),
  ('product', 'NN')],
 [('All', 'DT'),
  ('other', 'JJ'),
  ('users', 'NNS'),
  ('can', 'MD'),
  ('order', 'NN'),
  ('these', 'DT'),
  ('articles', 'NNS')],
 [('She', 'PRP'),
  ('can', 'MD'),
  ('order', 'NN'),
  ('other', 'JJ'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('a', 'DT'),
  ('POETcatalog', 'NNP'),
  ('without', 'IN'),
  ('any', 'DT'),
  ('problems', 'NNS')],
 [('Furtheremore', 'IN'),
  ('she', 'PRP'),
  ('is', 'VBZ'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('the', 'DT'),
  ('Vendor', 'NNP'),
  ('VWR', 'NNP'),
  ('through', 'IN'),
  ('COALA', 'NNP')],
 [('But', 'CC'),
  ('articles', 'NNS'),
  ('from', 'IN'),
  ('all', 'DT'),
  ('other', 'JJ'),
  ('suppliers', 'NNS'),
  ('are', 'VBP'),
  ('not', 'RB'),
  ('orderable', 'JJ')],
 [('I', 'PRP'),
  ('already', 'RB'),
  ('spoke', 'VBD'),
  ('to', 'TO'),
  ('anic', 'VB'),
  ('who', 'WP'),
  ('maintain', 'VBP'),
  ('the', 'DT'),
  ('catalog', 'NN'),
  ('COALA', 'NNP'),
  ('and', 'CC'),
  ('they', 'PRP'),
  ('said', 'VBD'),
  ('that', 'IN'),
  ('the', 'DT'),
  ('reason', 'NN'),
  ('should', 'MD'),
  ('be', 'VB'),
  ('the', 'DT'),
  ('assignment', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('plant', 'NN')],
 [('User', 'NNP'),
  ('is', 'VBZ'),
  ('a', 'DT'),
  ('assinged', 'JJ'),
  ('to', 'TO'),
  ('Universitaet', 'NNP'),
  ('Regensburg', 'NNP'),
  ('in', 'IN'),
  ('Scout', 'NNP'),
  ('but', 'CC'),
  ('in', 'IN'),
  ('P17', 'NNP'),
  ('table', 'NN'),
  ('YESRMCDMUSER01', 'NNP'),
  ('she', 'PRP'),
  ('is', 'VBZ'),
  ('assigned', 'VBN'),
  ('to', 'TO'),
  ('company', 'NN'),
  ('001500', 'CD'),
  ('Merck', 'NNP'),
  ('KGaA', 'NNP')],
 [('Please', 'NNP'),
  ('find', 'VB'),
  ('attached', 'JJ'),
  ('some', 'DT'),
  ('screenshots', 'NNS')]]

我写了以下代码:

^{pr2}$

其输出如下:

    ['User',
     'Coala',
     'VWR',
     'Arfter',
     'COALA',
     'Category',
     'S9901',
     'Dummy',
     'POETcatalog',
     'Vendor',
     'VWR',
     'COALA',
     'COALA',
     'User',
     'Universitaet',
     'Regensburg',
     'Scout',
     'P17',
     'YESRMCDMUSER01',
     'Merck',
     'KGaA',
     'Please']

我想得到的输出是:

[['User',
  'Coala',
  'VWR']
['Arfter',
 'COALA',
 'Category',
 'S9901',
 'Dummy']
[],
[],
['POETcatalog'],
['Vendor',
 'VWR',
 'COALA'],
[],
['COALA'],
['User',
 'Universitaet',
 'Regensburg',
 'Scout',
 'P17',
 'YESRMCDMUSER01',
 'Merck',
'KGaA'],
['Please']]

还试图使用[[] for i in range(len(data)]将其附加到各自的列表中,但无法这样做。在


Tags: thetoinfromcoalaisdtnn
2条回答

您可以使用以下列表理解:

[[j[0] for j in i if j[-1]=="NNP"] for i in data]

输出:

^{pr2}$

列表理解是一种方法。但是@麦迪的回答可能有点难以理解。在

下面是一个更易于阅读的解决方案:

document = [[('User', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('able', 'JJ'), ('to', 'TO'), ('order', 'NN'), ('products', 'NNS'), ('from', 'IN'), ('iShopCatalog', 'NN'), ('Coala', 'NNP'), ('excluding', 'VBG'), ('articles', 'NNS'), ('from', 'IN'), ('VWR', 'NNP')], [('Arfter', 'NNP'), ('transferring', 'VBG'), ('the', 'DT'), ('articles', 'NNS'), ('from', 'IN'), ('COALA', 'NNP'), ('to', 'TO'), ('SRM', 'VB'), ('the', 'DT'), ('Category', 'NNP'), ('S9901', 'NNP'), ('Dummy', 'NNP'), ('is', 'VBZ'), ('maintained', 'VBN')], [('Due', 'JJ'), ('to', 'TO'), ('this', 'DT'), ('the', 'DT'), ('user', 'NN'), ('is', 'VBZ'), ('not', 'RB'), ('able', 'JJ'), ('to', 'TO'), ('order', 'NN'), ('the', 'DT'), ('product', 'NN')], [('All', 'DT'), ('other', 'JJ'), ('users', 'NNS'), ('can', 'MD'), ('order', 'NN'), ('these', 'DT'), ('articles', 'NNS')], [('She', 'PRP'), ('can', 'MD'), ('order', 'NN'), ('other', 'JJ'), ('products', 'NNS'), ('from', 'IN'), ('a', 'DT'), ('POETcatalog', 'NNP'), ('without', 'IN'), ('any', 'DT'), ('problems', 'NNS')], [('Furtheremore', 'IN'), ('she', 'PRP'), ('is', 'VBZ'), ('able', 'JJ'), ('to', 'TO'), ('order', 'NN'), ('products', 'NNS'), ('from', 'IN'), ('the', 'DT'), ('Vendor', 'NNP'), ('VWR', 'NNP'), ('through', 'IN'), ('COALA', 'NNP')], [('But', 'CC'), ('articles', 'NNS'), ('from', 'IN'), ('all', 'DT'), ('other', 'JJ'), ('suppliers', 'NNS'), ('are', 'VBP'), ('not', 'RB'), ('orderable', 'JJ')], [('I', 'PRP'), ('already', 'RB'), ('spoke', 'VBD'), ('to', 'TO'), ('anic', 'VB'), ('who', 'WP'), ('maintain', 'VBP'), ('the', 'DT'), ('catalog', 'NN'), ('COALA', 'NNP'), ('and', 'CC'), ('they', 'PRP'), ('said', 'VBD'), ('that', 'IN'), ('the', 'DT'), ('reason', 'NN'), ('should', 'MD'), ('be', 'VB'), ('the', 'DT'), ('assignment', 'NN'), ('of', 'IN'), ('the', 'DT'), ('plant', 'NN')], [('User', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('assinged', 'JJ'), ('to', 'TO'), ('Universitaet', 'NNP'), ('Regensburg', 'NNP'), ('in', 'IN'), ('Scout', 'NNP'), ('but', 'CC'), ('in', 'IN'), ('P17', 'NNP'), ('table', 'NN'), ('YESRMCDMUSER01', 'NNP'), ('she', 'PRP'), ('is', 'VBZ'), ('assigned', 'VBN'), ('to', 'TO'), ('company', 'NN'), ('001500', 'CD'), ('Merck', 'NNP'), ('KGaA', 'NNP')], [('Please', 'NNP'), ('find', 'VB'), ('attached', 'JJ'), ('some', 'DT'), ('screenshots', 'NNS')]]
output = [[word for word, pos in sentence if pos=='NNP'] for sentence in document]

如果您喜欢更简洁的代码,并且可以围绕嵌套列表理解进行思考,https://stackoverflow.com/a/3633145/610569

^{pr2}$

相关问题 更多 >