从具有自定义分隔符的大型文本文件中提取特定分隔符之间的文本部分，并使用Python将其写入另一个文件

-CITE- 13 USC Sec. 1 1/15/2013 -EXPCITE- TITLE 13 - CENSUS CHAPTER 1 - ADMINISTRATION SUBCHAPTER I - GENERAL PROVISIONS -HEAD- Sec. 1. Definitions -STATUTE- As used in this title, unless the context requires another meaning or unless it is otherwise provided - (1) "Bureau" means the Bureau of the Census; (2) "Secretary" means the Secretary of Commerce; and (3) "respondent" includes a corporation, company, association, firm, partnership, proprietorship, society, joint stock company, individual, or other organization or entity which reported information, or on behalf of which information was reported, in response to a questionnaire, inquiry, or other request of the Bureau. -SOURCE- (Aug. 31, 1954, ch. 1158, 68 Stat. 1012; Pub. L. 94-521, Sec. 1, Oct. 17, 1976, 90 Stat. 2459.) -MISC1- <some text> -End- -CITE- 13 USC Sec. 2 1/15/2013 -EXPCITE- TITLE 13 - CENSUS CHAPTER 1 - ADMINISTRATION SUBCHAPTER I - GENERAL PROVISIONS -HEAD- Sec. 2. Bureau of the Census -STATUTE- The Bureau is continued as an agency within, and under the jurisdiction of, the Department of Commerce. -SOURCE- (Aug. 31, 1954, ch. 1158, 68 Stat. 1012.) -MISC1- <some text> -End-

2条回答

网友

1楼 · 编辑于 2024-04-24 03:06:13

所以，对于每一行，如果它以一个连字符开头，接着是一些大写文本，然后是另一个连字符，那么它就是一个标记，它指出我们处于某种新的部分中。这可以使用正则表达式来完成：

current_section_type = None
r= re.compile("^-([A-Z]*)-")
for line in f.readlines():
  m=r.match(line)
  if m:
    current_section_type = m.group(1)
  else:
    if current_section_type == "STATUTE":
      print line.strip()

网友

2楼 · 编辑于 2024-04-24 03:06:13

我会逐行阅读文本并自己解析。这样您就可以将大量输入作为流处理。使用多行regexp有更好的解决方案，但是这些解决方案总是无法将输入作为流来处理。在

#!/usr/bin/env python

import sys, re

# states for our state machine:
OUTSIDE = 0
INSIDE = 1
INSIDE_AFTER_STATUTE = 2

def eachCite(stream):
  state = OUTSIDE
  for lineNumber, line in enumerate(stream):
    if state in (INSIDE, INSIDE_AFTER_STATUTE):
      capture += line
    if re.match('^-CITE-', line):
      if state == OUTSIDE:
        state = INSIDE
        capture = line
      elif state in (INSIDE, INSIDE_AFTER_STATUTE):
        raise Exception("-CITE- in -CITE-??", lineNumber)
      else:
        raise NotImplementedError(state)
    elif re.match('^-End-', line):
      if state == OUTSIDE:
        raise Exception("-End- without -CITE-??", lineNumber)
      elif state == INSIDE:
        yield False, capture
        state = OUTSIDE
      elif state == INSIDE_AFTER_STATUTE:
        yield True, capture
        state = OUTSIDE
      else:
        raise NotImplementedError(state)
    elif re.match('^-STATUTE-', line):
      if state == OUTSIDE:
        raise Exception("-STATUTE- without -CITE-??", lineNumber)
      elif state == INSIDE:
        state = INSIDE_AFTER_STATUTE
      elif state == INSIDE_AFTER_STATUTE:
        raise Exception("-STATUTE- after -STATUTE-??", lineNumber)
      else:
        raise NotImplementedError(state)
  if state != OUTSIDE:
    raise Exception("EOF in -CITE-??")

for withStatute, cite in eachCite(sys.stdin):
  if withStatute:
    print "found cite with statute:"
    print cite

如果您不想处理sys.stdin，可以这样做：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章