如何在文本文件中提取行并查找重复项

2024-05-15 11:08:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个写了很多行的文本文件,在文本文件中有一个叫“@Testrun”的词很多次,把“@Testrun”当作起始点,把它当作一个“@Testrun”,把这两个“@Testrun”之间的行作为一个部分,这些文本可以有3-4个部分。我的问题是如何在这些部分中提取这些线并在这些部分中找到重复的线:

我的文本文件如下所示:

@TestRun
    And user validate message on screen "Switch to paperless" 
    And user click on "Manage accounts" label 
    And user click link with label "View all online services" 
    And user waits for 10 seconds 
    Then page is successfully launched 
    And user click link with label "Go paperless for complete convenience" 
    Then page is successfully launched 
    And user validate message on screen "#EmailAddress" 
    And user clicks on the button "Confirm" 
    Then page is successfully launched 
    And user validate message on screen "#MessageValidate" 
    Then page is successfully launched 
    And user click on "menu open user preferences" label 
    And user clicks on the link "Statement and letter preferences" 
    Then page is successfully launched 
    And user validate "Switch to paperless" button is disabled 
    And user validate message on screen "Online only" 
    When user click on "Log out" label 
    Then page is successfully launched

@TestRun 
    And user click on link "Mobile site" 
    And user set text "#Surname" on textbox name "surname" 
    Then page is successfully launched 
    And user click on link "#Account" 
    Then page is successfully launched 
    And user verify message on screen "#Account" 
    And user verify message on screen "Manage statements" 
    And user verify message on screen "Step 1 of 3" 
    Then page is successfully launched 
    And user verify message on screen "Current format type"  
    And user verify message on screen "Online" 
    When user selects the radio button "Paper" 


@TestRun
Then user wait for page load
And user click on button "Continue to Online Banking"
Then user wait for page load
    And user click on "menu open user preferences" label 
    And user clicks on the link "Statement and letter preferences" 
    Then page is successfully launched 
    And page is successfully launched 
    And user waits for 10 seconds 
@TestRun 
    Then page is successfully launched 
    And user waits for 10 seconds 
    And user click checkbox "Telephone" 
    And user click checkbox "Post" 
    And user clicks on the button "Save" 
    Then page is successfully launched 

我尝试了以下代码,但不起作用:

with open('CustPref.txt') as input_data:
    for line in input_data:
        if line.strip() == '@TestRun ':  
            break
    for line in input_data: 
        if line.strip() == '@TestRun ':
            break
        print line

我得到输出,但它是完全不正确的。 我只得到一行作为输出,这不是预计。怎么可能我能解决这个问题吗


Tags: andmessageforisonpagelinkscreen
3条回答

使用^{}第三方库,我们可以在所需目标之前分割文本。你知道吗

更新:我们可以使用^{}在第一个目标之前删除行。你知道吗

import itertools as it
import more_itertools as mit


with open("CustPref.txt", "r") as f:
    lines = f.readlines()

    pred = lambda x: x.startswith("@TestRun")      # trailing-space protection
    inv_pred = lambda x: not pred(x)

    lines = it.dropwhile(inv_pred, lines)          # optional
    chunks = list(mit.split_before(lines, pred))

print(chunks)

输出(缩写)

[['@TestRun\n',
  '    And user validate message on screen "Switch to paperless" \n',
  ...],
 ['@TestRun \n',
  '    And user click on link "Mobile site" \n',
  ...],
 ['@TestRun\n',
  'Then user wait for page load\n',
  ...],
 ...]

你要解决两个问题:

  • 把它分成相应的部分
  • 删除重复项

拆分:

第一个选项

逐行分析文件:

parts = [] # all lines between 2 @TestRun's 
chunks = [] # all chunks of lines between 2 @TestRun's 

startNow = False # wait till first @TestRun before keeping anything

for line in Text(): # see definition for Text() below - it mimics your open('...')
    if line.strip() == '@TestRun':
        startNow = True
        if len(parts) > 0: # found a Testrun, if parts contains lines append to chunks
            chunks.append(parts)
            parts = []
    elif startNow == True: # check if first TestRun hit, if so append line to parts
        parts.append(line)


print(chunks) # done -> list of list of lines between chunks. 

第二选项

不要将文本逐行拆分,将其作为完整文本阅读,并使用列表理解来拆分:

biggerChunks = [x.strip() for x in TextTT().split("@TestRun") ]
chunkified = [x.splitlines() for x in biggerChunks if len(x.strip()) > 0 ]

首先在@TestRun上拆分,得到一个大文本块列表,然后按行拆分每个文本块。结果大致相同:[[2@TestRun之间的所有行]]


删除重复项(保持顺序)

这里的答案是:how-do-you-remove-duplicates-from-a-list-in-whilst-preserving-order-这是一个SO链接,所以不会在这里重复:)


助手 Text()用于替换打开的文件,TestTT()是整个文本块:

def Text(): # instead of file open, returns list of lines
    return TextTT().splitlines() 

def TextTT(): # unsplit text
    return '''
@TestRun
    And user validate message on screen "Switch to paperless" 
    And user click on "Manage accounts" label 
    And user click link with label "View all online services" 
    And user waits for 10 seconds 
    Then page is successfully launched 
    And user click link with label "Go paperless for complete convenience" 
    Then page is successfully launched 
    And user validate message on screen "#EmailAddress" 
    And user clicks on the button "Confirm" 
    Then page is successfully launched 
    And user validate message on screen "#MessageValidate" 
    Then page is successfully launched 
    And user click on "menu open user preferences" label 
    And user clicks on the link "Statement and letter preferences" 
    Then page is successfully launched 
    And user validate "Switch to paperless" button is disabled 
    And user validate message on screen "Online only" 
    When user click on "Log out" label 
    Then page is successfully launched

@TestRun 
    And user click on link "Mobile site" 
    And user set text "#Surname" on textbox name "surname" 
    Then page is successfully launched 
    And user click on link "#Account" 
    Then page is successfully launched 
    And user verify message on screen "#Account" 
    And user verify message on screen "Manage statements" 
    And user verify message on screen "Step 1 of 3" 
    Then page is successfully launched 
    And user verify message on screen "Current format type"  
    And user verify message on screen "Online" 
    When user selects the radio button "Paper" 


@TestRun
Then user wait for page load
And user click on button "Continue to Online Banking"
Then user wait for page load
    And user click on "menu open user preferences" label 
    And user clicks on the link "Statement and letter preferences" 
    Then page is successfully launched 
    And page is successfully launched 
    And user waits for 10 seconds 
@TestRun 
    Then page is successfully launched 
    And user waits for 10 seconds 
    And user click checkbox "Telephone" 
    And user click checkbox "Post" 
    And user clicks on the button "Save" 
    Then page is successfully launched 
'''

请参阅注释以获取解释-您可以使用f.e。itertools.chain公司如果需要的话重新组合内线

一个简单的方法是记住你已经看到的台词。您可以将它们收集到一个列表中,但使用字典或集合会更有效。你知道吗

一次读一行。如果这一行(不是一个新的TestRun头并且)以前已经看到过,不要打印它。如果它是一个TestRun头文件,请忘记您看到的内容。打印所有在循环中走得这么远的东西。从下一行开始。你知道吗

with open('CustPref.txt') as input_data:
    seen = set()
    for line in input_data:
        # trim trailing newline
        line = line.rstrip('\n')
        if line == '@TestRun ':   # really sure about the trailing space?
            seen = set()          # who am I? what day is it?
        elif line in seen:
            # skip the rest of the for loop and start over
            continue
        else:
            seen.add(line)
        print(line)

在编程上,按此顺序检查“if it is@TestRun,else if already seen,else add to seen”是有意义的,这样您就不必两次检查它是否是@TestRun。我想在上面的论述中保持更自然的秩序,使之更简单。你知道吗

相关问题 更多 >