用Python解析文本:非结构化但相似的信息有不同格式
我正在尝试用Python解析成千上万的规格表文本文件,这些文件包含公司、材料、化学属性等信息(具体来说是材料安全数据表)。这些文本文件的信息格式相似,但结构比较松散,虽然人类能读懂,但不容易被程序解析(比如,它们不是XML或CSV格式)。简单来说,这些信息就像是乱七八糟的。
最初,这些数据是由不同公司里的不同人手动输入的。然后又有一群人把这些信息转录到文本文件中(用光学字符识别技术把它们变成txt文件)。
有没有什么解析库或者模式可以提取这种类型的信息?(这似乎是一个“常见”的数据录入问题。)肯定会用到很多正则表达式。我对自然语言处理库没有任何经验。它们适合解决这个问题吗?
我最初的想法是尝试把文件分成不同的类别,然后为每种格式创建一组解析函数。不幸的是,这可能只适用于问题的一小部分,而不同的情况可能会迅速变得复杂。
由于这个问题比较普遍,我会提供一些例子来说明问题。
地址信息
每个文件包含公司信息,比如名称和地址。这些信息可能有标识符,也可能没有,可能在一行上,也可能不在一行上,等等。总之,组合方式似乎五花八门。
例子(带字段信息):
MANUFACTURER: Foo Bar Inc.
ADDRESS: 123 Foo St.
Bar, CA 90012
例子(不带字段信息):
Foo Bar Inc.
123 Foo St.
Bar, CA 90012
例子(有时信息之间有额外的行):
FOO BAR INC.
123 FOO ST.
BAR, CA 90012
例子(字段名称不一致):
MANUFACTURER'S NAME: FOO BAR INC.
CREATIVE DIVISION
ADDRESS: 123 FOO ST.
CITY, STATE & ZIP: BAR, CALIFORNIA 90012
PHONE NUMBER: 310-111-2222
章节信息
规格表也有类似的章节,但顺序、标题、数字类型和分隔符都不一致。
例子:
========================================
SECTION 1 -- MATERIALS
========================================
例子:
Section I. Materials
------------------------------------------
例子:
----- Section 3 Materials
有时文件的宽度会改变,导致下一行换行。
例子:
===================================================
1. Materials
===================================================
变成:
=========================================
==========
1. Materials
=========================================
==========
这里是一个完整的例子:
希望这能澄清解析文件时遇到的问题。你会注意到行的换行、信息分散在不同的行等情况。并不是所有的文件都有完全相同的结构,有些会有不同的格式,信息放在不同的位置。这里有一个链接到纸质版的副本。
MATERIAL SAFETY DATA SHEET
=================================================================
=========
SECTION I-PRODUCT AND PREPARATION INFORMATION
=================================================================
=========
MANUFACTURER: Some Company Inc EMERGENCY AND
INFORMATION
TELEPHONE
(111)222-3333
ADDRESS: Some Road
City, ST
12346
IDENTITY (AS USED ON
LABEL AND LIST): Some Identity
PREPARATION DATE: Some Date
=================================================================
=========
SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION
=================================================================
=========
OSHA
ACGIH
HAZARDOUS COMPONENTS CAS# PEL TWA TLV
%
(SPECIFIC CHEMICAL IDENTITY;
COMMON NAME(S)
-----------------------------------------------------------------
---------
Some Chemical 111-22-3 15 10 10
12.34
=================================================================
=========
SECTION III-PHYSICAL/CHEMICAL CHARACTERISTICS
=================================================================
=========
Boiling Point: N/A Specific Gravity (H20=1): N/A
Vapor Pressure (mm Hg): N/A Melting Point: N/A
Vapor Density (AIR=1) N/A Evaporation Rate
(Butyl Acetate=1) N/A
Solubility in Water: None
Appearance: Solid, various colors, may have slight
odor.
N/A = Not applicable
=================================================================
=========
SECTION IV-FIRE AND EXPLOSION HAZARD DATA
=================================================================
=========
FLASH POINT (METHOD USED): None
FLAMMABLE LIMITS: None LEL: N/A UEL: N/A
EXTINGUISHING MEDIA: None
SPECIAL FIRE FIGHTING PROCEDURES: None required.
UNUSUAL FIRE AND EXPLOSION HAZARDS: None.
=================================================================
=========
SECTION V-REACTIVITY DATA
=================================================================
=========
STABILITY: Stable
CONDITIONS TO AVOID: None
INCOMPATIBILITY (MATERIALS TO AVOID): None
HAZARDOUS POLYMERIZATION: Will not occur
=================================================================
=========
SECTION VI-HEALTH HAZARD DATA
=================================================================
=========
ROUTES OF ENTRY:
INHALATION: Yes
SKIN: Possibly
INGESTION: Possibly
EYES: Possibly
HEALTH HAZARDS (ACUTE AND CHRONIC): Pneumoconiosis, silicosis,
emphysema,
nose and throat irritation, eye irritation, skin irritation in
some.
CARCINOGENICITY: No applicable information found.
SIGNS AND SYMPTOMS OF EXPOSURE: Coughing, sneezing; irritation
of the
mucous membranes; eye irritation; skin irritation or rash, dry
throat.
MEDICAL CONDITIONS GENERALLY AGGRAVATED BY EXPOSURE: Nasal,
bronchial or
pulmonary conditions which tend to restrict breathing, skin
abrasions.
EMERGENCY AND FIRST AID PROCEDURES: Remove to fresh air,
irrigate eyes,
wash with soap and water, contact physician if necessary.
=================================================================
=========
SECTION VII-PRECAUTIONS FOR SAFE HANDLING AND USE
=================================================================
=========
STEPS TO BE TAKEN IN CASE MATERIAL IS RELEASED OR SPILLED:
Normal clean-up
procedures.
WASTE DISPOSAL METHOD: Standard landfill methods consistent with
applicable state and federal regulations.
PRECAUTIONS TO BE TAKEN IN HANDLING AND STORING: Use caution not
to drop,
crush, break or chip.
OTHER PRECAUTIONS: Do not use at speeds greater than the
not-to-exceed
speed printed on the hub assembly.
=================================================================
=========
SECTION VIII-CONTROL MEASURES
=================================================================
=========
RESPIRATORY PROTECTION (SPECIFY TYPE): OSHA or NIOSH approved
respirators
may be required.
VENTILATION: Local exhaust recommended. Special: N/A.
Mechanical: Useful. Other: N/A.
PROTECTIVE GLOVES: May be useful.
EYE PROTECTION: Recommended.
OTHER PROTECTIVE CLOTHING OR EQUIPMENT: Not required.
WORK/HYGIENIC PRACTICES: Keep clothing and area clean. Wash to
remove
1 个回答
我会写一个循环,里面有很多状态变量,用来处理每一行数据,并用这些状态变量来跟踪当前的情况。在这个循环里的条件判断(if
)就像人类在手动解析文件时会问的问题一样。
"
for line in file:
Is there a colon in line?
field_name = normalize(informaton before the colon)
data = information after the colon
else:
field_name = next_field_in_list(previous_field)
data = line
"
接下来就是这样。如果从例子中看,我不太明白你是否至少有一个固定的字段顺序,或者每条记录的字段数量上限,或者一个明确的记录分隔符。如果没有这些,我觉得写起来会更困难。