如何用正则表达式从通话记录中提取(说话人、文本)元组?

2024-04-20 12:02:20 发布

您现在位置:Python中文网/ 问答频道 /正文

在我的硕士论文中,我需要从公司盈利电话记录中提取(说话人,文本)元组。在

成绩单的形式如下:

OPERATOR: Some text with numbers, special characters and linebreaks.

NAME, COMPANY, POSITION: Some text with numbers, special characters and linebreaks.

NAME: Some text with numbers, special characters and linebreaks.

我想从文档中提取所有(speaker,text)元组。例如:

^{pr2}$

到目前为止,我已经在Python中尝试过使用re.findall函数的不同正则表达式。在

以下是摘录示例:

example = """OPERATOR: Good day, ladies and gentlemen, and welcome to the first-quarter 2012
Agilent Technologies earnings conference call. My name is Keith, and I will be
your operator for today. At this time, all participants are in a listen-only
mode. Later on, we will have a question and answer session. (Operator
Instructions) As a reminder, today's conference is being recorded for replay
purposes.

And I would now like to turn the conference over to your host for today, Ms.
Alicia Rodriguez, Vice President of Investor Relations. Please go ahead, ma'am.

ALICIA RODRIGUEZ, VP - IR, AGILENT TECHNOLOGIES INC: Thank you, Keith, and
welcome, everyone, to Agilent's first quarter conference call for fiscal-year
2012. With me are Agilent's President and CEO, Bill Sullivan, as well as Senior
Vice President and CFO, Didier Hirsch. Joining in the Q&A after Didier's
comments will be Agilent's Chief Operating Officer, Ron Nersesian, and the
Presidents of our Electronic Measurement, Life Sciences, and Chemical Analysis
Groups -- Guy Sene, Nick Roelofs, and Mike McMullen.

You can find the press release and information to supplement today's discussion
on our website at www.investor.agilent.com. While there, please click on the
link for financial results, where you will find revenue breakouts and historical
financials for Agilent's operations. We will also post a copy of the prepared
remarks following this call. For any non-GAAP financial measures, you will find
the most directly comparable GAAP financial metrics and reconciliations on our
website.

We will make forward-looking statements about the financial performance of the
Company. These statements are subject to risks and uncertainties, and are only
valid as of today. The Company assumes no obligation to update them. Please look
at the Company's recent SEC filings for a more complete picture of our risks and
other factors.

Before turning the call over to Bill, I would like to remind you that Agilent
will host its annual analysts meeting in New York City on March 8. Details about
the meeting and webcast will be available on the Agilent investor relations
website two weeks prior.

And now, I'd like to turn the call over to Bill.

BILL SULLIVAN, PRESIDENT AND CEO, AGILENT TECHNOLOGIES INC: Thanks, Alicia, and
hello, everyone. Agilent's Q1 orders of $1.62 billion were flat versus last
year. Q1 revenues of $1.64 billion were up 7% year-over-year. Non-GAAP EPS was
$0.69 per share, and operating margin was 19%."""

我的密码是:

import re

# First approach:
r = re.compile(r"^([^a-z:]+?):([\s\S]+?)", flags=re.MULTILINE)
re.findall(r, example)

# Second approach:
r = re.compile(r"^([^a-z:]+?):([\s\S]+)", flags=re.MULTILINE)
re.findall(r, example)

第一种(非贪婪)方法的问题是它不能捕获说话人的全文。在

第二种(贪婪)方法的问题是,当下一个演讲者出现时,它不会停止。在

编辑:附加信息

  • 文本组也可以包含双点。在某些情况下,一行的第一个字后会出现一个双点,例如“For\example:…”
  • 演讲者组也可以覆盖多行,例如当公司名称和职位描述很长时

Tags: andofthetotextrefortoday