将过滤后的ngram写入列表的outfile列表

from nltk import sent_tokenize, word_tokenize from nltk import ngrams from bs4 import BeautifulSoup from string import punctuation import glob import sys punctuation_set = set(punctuation) # Open and read file text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*') for filename in text: with open(filename, encoding='ISO-8859-1', errors="ignore") as f: mytext = f.read() # Extract text from HTML using BeautifulSoup soup = BeautifulSoup(mytext, "lxml") extracted_text = soup.getText() extracted_text = extracted_text.replace('\n', '') # Split the text in sentences (using the NLTK sentence splitter) sentences = sent_tokenize(extracted_text) # Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging) all_tokens = [] for sent in sentences: sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary) tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization) all_tokens.extend(tokenized_sent) # add tagged tokens to list n=3 threegrams = ngrams(all_tokens, n) # Find ngrams with specific pattern for (first, second, third) in threegrams: if first == "a": if second.endswith("bb") and second.startswith("leg"): print(first, second, third)

1条回答

网友

1楼 · 发布于 2024-04-25 07:44:12

首先，标点符号的删除可能更简单，请参见Removing a list of characters in string

>>> from string import punctuation
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> text.translate(None, punctuation)
'The lazy birds flew over the rainbow Well not have known'

但是在你做标记化之前删除标点符号是不正确的，你会看到We'll->；Well，我认为这是不需要的。你知道吗

可能这是一个更好的方法：

>>> from nltk import sent_tokenize, word_tokenize
>>> [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]

但请注意，上面的习语不处理多字符标点符号。你知道吗

例如，我们看到word_tokenize()改变了"->；````，使用上面的习惯用法并没有删除它：

>>> sent = 'He said, "There is no room for room"'
>>> word_tokenize(sent)
['He', 'said', ',', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
>>> [word for word in word_tokenize(sent) if word not in punctuation]
['He', 'said', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]

要处理这个问题，请显式地将punctuation放入一个列表，并将多字符标点添加到其中：

>>> sent = 'He said, "There is no room for room"'
>>> punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> list(punctuation)
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> list(punctuation) + ['...', '``', "''"]
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''"]
>>> p = list(punctuation) + ['...', '``', "''"]
>>> [word for word in word_tokenize(sent) if word not in p]
['He', 'said', 'There', 'is', 'no', 'room', 'for', 'room']

至于获取文档流（您称之为all_tokens），这里有一个简单的方法：

>>> from collections import Counter
>>> from nltk import sent_tokenize, word_tokenize
>>> from string import punctuation
>>> p = list(punctuation) + ['...', '``', "''"]
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> [[word for word in word_tokenize(sent) if word not in p] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]

现在是你真正的问题。你知道吗

您真正需要的不是检查ngram中的字符串，而是考虑regex模式匹配。你知道吗

要找到模式\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b，请参见https://regex101.com/r/zBVgp4/4

>>> import re
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This is a legobatmanbb cave hahaha")
['a legobatmanbb cave']
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This isa legobatmanbb cave hahaha")
[]

现在要将字符串写入文件，可以使用以下习惯用法，请参见https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function：

with open('filename.txt', 'w') as fout:
    print('Hello World', end='\n', file=fout)

事实上，如果您只对没有标记的ngram感兴趣，那么就不需要对文本进行过滤或标记化；p

您只需将代码转换为：

soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
pattern = r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b"

with open('filename.txt', 'w') as fout:
    for interesting_ngram in re.findall(pattern, extracted_text):
        print(interesting_ngram, end='\n', file=fout)

相关问题更多 >

编程相关推荐

热门问题

热门文章