用Python解析文本:非结构化但相似的信息有不同格式

4 投票
1 回答
1801 浏览
提问于 2025-04-16 15:25

我正在尝试用Python解析成千上万的规格表文本文件,这些文件包含公司、材料、化学属性等信息(具体来说是材料安全数据表)。这些文本文件的信息格式相似,但结构比较松散,虽然人类能读懂,但不容易被程序解析(比如,它们不是XML或CSV格式)。简单来说,这些信息就像是乱七八糟的。

最初,这些数据是由不同公司里的不同人手动输入的。然后又有一群人把这些信息转录到文本文件中(用光学字符识别技术把它们变成txt文件)。

有没有什么解析库或者模式可以提取这种类型的信息?(这似乎是一个“常见”的数据录入问题。)肯定会用到很多正则表达式。我对自然语言处理库没有任何经验。它们适合解决这个问题吗?

我最初的想法是尝试把文件分成不同的类别,然后为每种格式创建一组解析函数。不幸的是,这可能只适用于问题的一小部分,而不同的情况可能会迅速变得复杂。

由于这个问题比较普遍,我会提供一些例子来说明问题。

地址信息
每个文件包含公司信息,比如名称和地址。这些信息可能有标识符,也可能没有,可能在一行上,也可能不在一行上,等等。总之,组合方式似乎五花八门。

例子(带字段信息):

MANUFACTURER: Foo Bar Inc.  
ADDRESS: 123 Foo St.  
Bar, CA 90012

例子(不带字段信息):

Foo Bar Inc.  
123 Foo St.  
Bar, CA 90012

例子(有时信息之间有额外的行):

FOO BAR INC.

123 FOO ST.

BAR, CA 90012

例子(字段名称不一致):

MANUFACTURER'S NAME: FOO BAR INC.  
CREATIVE DIVISION  
ADDRESS: 123 FOO ST.  
CITY, STATE & ZIP: BAR, CALIFORNIA 90012  
PHONE NUMBER: 310-111-2222

章节信息
规格表也有类似的章节,但顺序、标题、数字类型和分隔符都不一致。

例子:

========================================
SECTION 1 -- MATERIALS
========================================

例子:

Section I. Materials
------------------------------------------

例子:

----- Section 3       Materials

有时文件的宽度会改变,导致下一行换行。

例子:

===================================================
1.    Materials
===================================================

变成:

=========================================
==========
1.    Materials
=========================================
==========

这里是一个完整的例子:
希望这能澄清解析文件时遇到的问题。你会注意到行的换行、信息分散在不同的行等情况。并不是所有的文件都有完全相同的结构,有些会有不同的格式,信息放在不同的位置。这里有一个链接到纸质版的副本

MATERIAL SAFETY DATA SHEET

=================================================================
=========
SECTION I-PRODUCT AND PREPARATION INFORMATION
=================================================================
=========

MANUFACTURER:         Some Company Inc     EMERGENCY AND
INFORMATION
TELEPHONE
(111)222-3333
ADDRESS:              Some Road
City, ST
12346

IDENTITY (AS USED ON
LABEL AND LIST):      Some Identity

PREPARATION DATE:     Some Date

=================================================================
=========
SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION
=================================================================
=========

OSHA
ACGIH
HAZARDOUS COMPONENTS             CAS#       PEL   TWA        TLV
%
(SPECIFIC CHEMICAL IDENTITY;
COMMON NAME(S)
-----------------------------------------------------------------
---------

Some Chemical             111-22-3   15    10         10
12.34


=================================================================
=========
SECTION III-PHYSICAL/CHEMICAL CHARACTERISTICS
=================================================================
=========

Boiling Point:              N/A  Specific Gravity (H20=1):   N/A
Vapor Pressure (mm Hg):     N/A  Melting Point:              N/A
Vapor Density (AIR=1)       N/A  Evaporation Rate
(Butyl Acetate=1)           N/A
Solubility in Water:        None

Appearance:  Solid, various colors, may have slight
odor.

N/A = Not applicable

=================================================================
=========
SECTION IV-FIRE AND EXPLOSION HAZARD DATA
=================================================================
=========

FLASH POINT (METHOD USED):  None
FLAMMABLE LIMITS:  None          LEL:  N/A        UEL:  N/A
EXTINGUISHING MEDIA:  None
SPECIAL FIRE FIGHTING PROCEDURES:  None required.
UNUSUAL FIRE AND EXPLOSION HAZARDS:  None.

=================================================================
=========
SECTION V-REACTIVITY DATA
=================================================================
=========

STABILITY:  Stable
CONDITIONS TO AVOID:  None
INCOMPATIBILITY (MATERIALS TO AVOID):  None
HAZARDOUS POLYMERIZATION:  Will not occur

=================================================================
=========
SECTION VI-HEALTH HAZARD DATA
=================================================================
=========

ROUTES OF ENTRY:

INHALATION:  Yes
SKIN:  Possibly
INGESTION:  Possibly
EYES:  Possibly

HEALTH HAZARDS (ACUTE AND CHRONIC):  Pneumoconiosis, silicosis,
emphysema,
nose and throat irritation, eye irritation, skin irritation in
some.

CARCINOGENICITY:  No applicable information found.

SIGNS AND SYMPTOMS OF EXPOSURE:  Coughing, sneezing; irritation
of the
mucous membranes; eye irritation; skin irritation or rash, dry
throat.

MEDICAL CONDITIONS GENERALLY AGGRAVATED BY EXPOSURE:  Nasal,
bronchial or
pulmonary conditions which tend to restrict breathing, skin
abrasions.

EMERGENCY AND FIRST AID PROCEDURES:  Remove to fresh air,
irrigate eyes,
wash with soap and water, contact physician if necessary.

=================================================================
=========
SECTION VII-PRECAUTIONS FOR SAFE HANDLING AND USE
=================================================================
=========

STEPS TO BE TAKEN IN CASE MATERIAL IS RELEASED OR SPILLED:
Normal clean-up
procedures.

WASTE DISPOSAL METHOD:  Standard landfill methods consistent with
applicable state and federal regulations.

PRECAUTIONS TO BE TAKEN IN HANDLING AND STORING:  Use caution not
to drop,
crush, break or chip.

OTHER PRECAUTIONS:  Do not use at speeds greater than the
not-to-exceed
speed printed on the hub assembly.

=================================================================
=========
SECTION VIII-CONTROL MEASURES
=================================================================
=========

RESPIRATORY PROTECTION (SPECIFY TYPE):  OSHA or NIOSH approved
respirators
may be required.

VENTILATION:  Local exhaust recommended.  Special:  N/A.
Mechanical:  Useful.  Other:  N/A.

PROTECTIVE GLOVES:  May be useful.

EYE PROTECTION:  Recommended.

OTHER PROTECTIVE CLOTHING OR EQUIPMENT:  Not required.

WORK/HYGIENIC PRACTICES:  Keep clothing and area clean.  Wash to
remove

1 个回答

3

我会写一个循环,里面有很多状态变量,用来处理每一行数据,并用这些状态变量来跟踪当前的情况。在这个循环里的条件判断(if)就像人类在手动解析文件时会问的问题一样。

"
for line in file:
    Is there a colon in line?
        field_name = normalize(informaton before the colon)
        data = information after the colon
    else: 
        field_name = next_field_in_list(previous_field)
        data = line
"

接下来就是这样。如果从例子中看,我不太明白你是否至少有一个固定的字段顺序,或者每条记录的字段数量上限,或者一个明确的记录分隔符。如果没有这些,我觉得写起来会更困难。

撰写回答