将过滤后的ngram写入列表的outfile列表

2024-04-25 07:44:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我按照某种模式从一堆HTML文件中提取了3g。当我打印它们时,我得到一个列表列表(其中每行是一个3克)。我想把它打印到一个outfile,以便进一步的文本分析,但是当我尝试它时,它只打印前三个gram。我怎样才能把所有的三克打印到输出文件上?(三克的列表)。我最理想的方式是把所有的三个图合并成一个列表,而不是一个三个图有多个列表。非常感谢你的帮助。你知道吗

到目前为止,我的代码是这样的:

from nltk import sent_tokenize, word_tokenize
from nltk import ngrams
from bs4 import BeautifulSoup
from string import punctuation
import glob
import sys
punctuation_set = set(punctuation) 

# Open and read file
text = glob.glob('C:/Users/dell/Desktop/python-for-text-analysis-master/Notebooks/TEXTS/*')   
for filename in text:
with open(filename, encoding='ISO-8859-1', errors="ignore") as f:
    mytext = f.read()  

# Extract text from HTML using BeautifulSoup
soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
extracted_text = extracted_text.replace('\n', '')

# Split the text in sentences (using the NLTK sentence splitter) 
sentences = sent_tokenize(extracted_text)

# Create list of tokens with their POS tags (after pre-processing: punctuation removal, tokenization, POS tagging)
all_tokens = []

for sent in sentences:
    sent = "".join([char for char in sent if not char in punctuation_set]) # remove punctuation from sentence (optional; comment out if necessary)
    tokenized_sent = word_tokenize(sent) # split sentence into tokens (using NLTK word tokenization)
    all_tokens.extend(tokenized_sent) # add tagged tokens to list

n=3
threegrams = ngrams(all_tokens, n)


# Find ngrams with specific pattern
for (first, second, third) in threegrams: 
    if first == "a":
        if second.endswith("bb") and second.startswith("leg"):
            print(first, second, third)

Tags: textinfromimport列表forifword
1条回答
网友
1楼 · 发布于 2024-04-25 07:44:12

首先,标点符号的删除可能更简单,请参见Removing a list of characters in string

>>> from string import punctuation
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> text.translate(None, punctuation)
'The lazy birds flew over the rainbow Well not have known'

但是在你做标记化之前删除标点符号是不正确的,你会看到We'll->;Well,我认为这是不需要的。你知道吗

可能这是一个更好的方法:

>>> from nltk import sent_tokenize, word_tokenize
>>> [[word for word in word_tokenize(sent) if word not in punctuation] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]

但请注意,上面的习语不处理多字符标点符号。你知道吗

例如,我们看到word_tokenize()改变了"->;````,使用上面的习惯用法并没有删除它:

>>> sent = 'He said, "There is no room for room"'
>>> word_tokenize(sent)
['He', 'said', ',', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]
>>> [word for word in word_tokenize(sent) if word not in punctuation]
['He', 'said', '``', 'There', 'is', 'no', 'room', 'for', 'room', "''"]

要处理这个问题,请显式地将punctuation放入一个列表,并将多字符标点添加到其中:

>>> sent = 'He said, "There is no room for room"'
>>> punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> list(punctuation)
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
>>> list(punctuation) + ['...', '``', "''"]
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''"]
>>> p = list(punctuation) + ['...', '``', "''"]
>>> [word for word in word_tokenize(sent) if word not in p]
['He', 'said', 'There', 'is', 'no', 'room', 'for', 'room']

至于获取文档流(您称之为all_tokens),这里有一个简单的方法:

>>> from collections import Counter
>>> from nltk import sent_tokenize, word_tokenize
>>> from string import punctuation
>>> p = list(punctuation) + ['...', '``', "''"]
>>> text = "The lazy bird's flew, over the rainbow. We'll not have known."
>>> [[word for word in word_tokenize(sent) if word not in p] for sent in sent_tokenize(text)]
[['The', 'lazy', 'bird', "'s", 'flew', 'over', 'the', 'rainbow'], ['We', "'ll", 'not', 'have', 'known']]

现在是你真正的问题。你知道吗

您真正需要的不是检查ngram中的字符串,而是考虑regex模式匹配。你知道吗

要找到模式\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b,请参见https://regex101.com/r/zBVgp4/4

>>> import re
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This is a legobatmanbb cave hahaha")
['a legobatmanbb cave']
>>> re.findall(r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b", "This isa legobatmanbb cave hahaha")
[]

现在要将字符串写入文件,可以使用以下习惯用法,请参见https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function

with open('filename.txt', 'w') as fout:
    print('Hello World', end='\n', file=fout)

事实上,如果您只对没有标记的ngram感兴趣,那么就不需要对文本进行过滤或标记化;p

您只需将代码转换为:

soup = BeautifulSoup(mytext, "lxml")
extracted_text = soup.getText()
pattern = r"\ba\b\s\bleg[\w]+bb\b\s\b[\w]+\b"

with open('filename.txt', 'w') as fout:
    for interesting_ngram in re.findall(pattern, extracted_text):
        print(interesting_ngram, end='\n', file=fout)

相关问题 更多 >