Pig:Python UDF搜索文本中的关键字/字符串

0 投票
1 回答
1432 浏览
提问于 2025-04-18 00:47

我有两个文件,一个里面是一些关键词或字符串的列表:

blue fox
the
lazy dog
orange
of
file

另一个文件里面是一些文本:

The blue fox jumped
over the lazy dog
this file has nothing important
lines repeat
this line does not match

我想把第一个文件里的字符串拿出来,去找第二个文件中和这些字符串匹配的行。所以我写了一个Pig脚本,并用到了一个Python的自定义函数:

register match.py using jython as match;
A = LOAD 'words.txt' AS (word:chararray);
B = LOAD 'text.txt' AS (line:chararray);
C = GROUP A ALL;
D = FOREACH B generate match.match(C.$1,line);
dump D;

#match.py
@outputSchema("str:chararray")
def match(wordlist,line):
    linestr = str(line)
    for word in wordlist:
            wordstr = str(word)
            if re.search(wordstr,linestr):
                    return line

结果出现了错误:

"2014-04-01 06:22:34,775 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function"

Detailed Error log:

Backend error message
---------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
        at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
        at o

Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function
        at org.apache.pig.PigServer.openIterator(PigServer.java:828)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
        at org.apache.pig.Main.run(Main.java:538)
        at org.apache.pig.Main.main(Main.java:157)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
        at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
================================================================================

1 个回答

1

我怀疑在我的CDH4.x集群中,"re"模块在jython里是不可用的。我在Python的用户自定义函数(UDF)上没花太多时间,最后是通过写Java的用户自定义函数解决了这个问题。请原谅我的Java代码,因为我还是个新手,可能不是最有效率或者最美观的代码(里面肯定还有一些bug):

package pigext;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.io.IOException;
import java.util.*;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;

public class matchList extends EvalFunc<String> {

  public String exec(Tuple input) throws IOException {
try {
        String line = (String)input.get(0);
        DataBag bag = (DataBag)input.get(1);
        Iterator it = bag.iterator();
        String output = "";
        while (it.hasNext()){
                Tuple t = (Tuple)it.next();
                if (t != null && t.size() > 0 && t.get(0) != null && line != null ) 
                        {
                          String cmd = t.get(0).toString();
                          if ( line.toLowerCase().matches(cmd.toLowerCase()) ) {
                                return (line + "," + cmd);
                                }                         
                        }
         }
        return output;
        } catch (Exception e) {
           throw new IOException("Failed to process row", e);
        }

} }

使用的方法是准备一个文件,里面写满了你想要搜索的正则表达式,每行一个,当然还需要你的目标文本文件。所以,正则表达式文件叫做"wordstext.txt",内容是:

.*?this +blah.*?

而你的文本文件叫"text.txt",内容是:

this blah starts with blah
this    blah has way too many spaces
that won't match
thisblahshouldnotmatch
thisblah should not match either
the line here is this blah
line here has this blah in the middle
line here has this    blah with extra spaces
only has blah
only has this

然后,Pig脚本会是:

REGISTER pigext.jar;
A = LOAD 'wordstest.txt' AS (cmd:chararray);
B = LOAD 'text.txt' AS (line:chararray);
C = GROUP A ALL;
D = FOREACH B generate pigext.matchList(line,C.$1);
dump D;

撰写回答