从web scrape搜索文本，并将接下来的4行转换为python数据帧

Example text paragraph. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Section 1 - "Summary" 1 - sdgge 2 - hjsdhdc 3 - sahdfda 4 - sahfdfds Example text paragraph. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Section 2 - "Introduction" 1 - abcdef 2 - jhfdgsa 3 - sadgffe 4 - sdjddasd

1条回答

网友

1楼 · 发布于 2024-05-13 20:20:45

编辑

我将通过使用正则表达式逻辑迭代每一行来实现这一点：

my_lines = my_text.split("\n")

# instantiate an empty list of records
records = []

my_patterns = {
    "section":re.compile("section\s*[0-9]+", re.I),
    "1":re.compile('1 - ([a-z]+)'),
    "2":re.compile('2 - ([a-z]+)'),
    "3":re.compile('3 - ([a-z]+)'),
    "4":re.compile('4 - ([a-z]+)')
    }
rec = {}

# Loop through each line, perform logic
for x in my_lines:

  for key, pattern in my_patterns.items():
    if pattern.search(x):

      # get the value
      my_value = pattern.findall(x)[0]

      if key == "section":
        # save the records
        if len(rec)>0:
            records.append(rec)

        # start a new record
        rec = {}


      # always add to the record
      rec[key] = my_value

# # when done looping, add the last record
records.append(rec)

# # convert to a dataframe
df = pd.DataFrame(records)


    section     1       2       3       4
0   Section 1   sdgge   hjsdhdc sahdfda sahfdfds
1   Section 2   abcdef  jhfdgsa sadgffe sdjddasd

相关问题更多 >

编程相关推荐

热门问题

热门文章