有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java如何将较小的ORC文件合并成较大的ORC文件?

SO和web上的大多数问题/答案都讨论使用Hive将一堆小的ORC文件合并成一个更大的文件,然而,我的ORC文件是日志文件,按天分开,我需要将它们分开。我每天只想“汇总”ORC文件(它们是HDFS中的目录)

我很可能需要用Java编写解决方案,并且遇到了OrcFileMergeOperator,这可能是我需要使用的,但现在说还为时过早

解决这个问题的最佳方法是什么


共 (2) 个答案

  1. # 1 楼答案

    你不需要重新发明轮子

    ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE可用于将较小的ORC文件合并为较大的文件,因为Hive 0.14.0.合并发生在条带级别,从而避免了对数据进行解压缩和解码。它工作得很快。我建议创建一个按天分区的外部表(分区是目录),然后合并它们,指定PARTITION (day_column)作为分区规范

    请看这里:LanguageManual+ORC

  2. # 2 楼答案

    这里有很好的答案,但这些都不允许我运行cron作业,这样我就可以每天进行汇总。我们每天都有日志文件写入HDFS,我不想每天进来时都在Hive中运行查询

    对我来说,我最终做的事情似乎更直截了当。我编写了一个Java程序,使用ORC库扫描目录中的所有文件,并创建这些文件的列表。然后打开一个新的Writer,它是“组合”文件(以“.”开头)所以它对蜂巢是隐藏的,否则蜂巢就会失败)。然后,程序打开列表中的每个文件,读取内容并写入组合文件。读取所有文件后,它会删除这些文件。我还增加了在需要时一次运行一个目录的功能

    注意:您需要一个模式文件。日志日志可以以json“journalctl-o json”格式输出,然后您可以使用ApacheORC工具生成模式文件,也可以手动生成一个模式文件。ORC的自动发电机很好,但手动发电机总是更好

    注意:要按原样使用这段代码,您需要一个有效的keytab并在类路径中添加-Dkeytab=

    import java.io.FileNotFoundException;
    import java.io.IOException;
    import java.io.InputStream;
    import java.net.InetAddress;
    import java.util.ArrayList;
    import java.util.List;
    
    import org.apache.commons.io.IOUtils;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileStatus;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
    import org.apache.hadoop.security.UserGroupInformation;
    import org.apache.orc.OrcFile;
    import org.apache.orc.Reader;
    import org.apache.orc.RecordReader;
    import org.apache.orc.TypeDescription;
    import org.apache.orc.Writer;
    
    import com.cloudera.org.joda.time.LocalDate;
    
    public class OrcFileRollUp {
    
      private final static String SCHEMA = "journald.schema";
      private final static String UTF_8 = "UTF-8";
      private final static String HDFS_BASE_LOGS_DIR = "/<baseDir>/logs";
      private static final String keytabLocation = System.getProperty("keytab");
      private static final String kerberosUser = "<userName>";
      private static Writer writer;
    
      public static void main(String[] args) throws IOException {
    
        Configuration conf = new Configuration();
        conf.set("hadoop.security.authentication", "Kerberos");
    
        InetAddress myHost = InetAddress.getLocalHost();
        String kerberosPrincipal = String.format("%s/%s", kerberosUser, myHost.getHostName());
        UserGroupInformation.setConfiguration(conf);
        UserGroupInformation.loginUserFromKeytab(kerberosPrincipal, keytabLocation);
    
        int currentDay = LocalDate.now().getDayOfMonth();
        int currentMonth = LocalDate.now().getMonthOfYear();
        int currentYear = LocalDate.now().getYear();
    
        Path path = new Path(HDFS_BASE_LOGS_DIR);
    
        FileSystem fileSystem = path.getFileSystem(conf);
        System.out.println("The URI is: " + fileSystem.getUri());
    
    
        //Get Hosts:
        List<String> allHostsPath = getHosts(path, fileSystem);
    
        TypeDescription schema = TypeDescription.fromString(getSchema(SCHEMA)
            .replaceAll("\n", ""));
    
        //Open each file for reading and write contents
        for(int i = 0; i < allHostsPath.size(); i++) {
    
          String outFile = "." + currentYear + "_" + currentMonth + "_" + currentDay + ".orc.working";            //filename:  .2018_04_24.orc.working
    
          //Create list of files from directory and today's date OR pass a directory in via the command line in format 
          //hdfs://<namenode>:8020/HDFS_BASE_LOGS_DIR/<hostname>/2018/4/24/
          String directory = "";
          Path outFilePath;
          Path argsPath;
          List<String> orcFiles;
    
          if(args.length == 0) {
            directory = currentYear + "/" + currentMonth + "/" + currentDay;
            outFilePath = new Path(allHostsPath.get(i) + "/" + directory + "/" + outFile);
            try {
              orcFiles = getAllFilePath(new Path(allHostsPath.get(i) + "/" + directory), fileSystem);
            } catch (Exception e) {
              continue;
            }
          } else {
            outFilePath = new Path(args[0] + "/" + outFile);
            argsPath = new Path(args[0]);
            try {
              orcFiles = getAllFilePath(argsPath, fileSystem);
            } catch (Exception e) {
              continue;
            }
          }
    
          //Create List of files in the directory
    
          FileSystem fs = outFilePath.getFileSystem(conf);
    
          //Writer MUST be below ^^ or the combination file will be deleted as well.
          if(fs.exists(outFilePath)) {
            System.out.println(outFilePath + " exists, delete before continuing.");
          } else {
           writer = OrcFile.createWriter(outFilePath, OrcFile.writerOptions(conf)
                .setSchema(schema));
          }
    
          for(int j = 0; j < orcFiles.size(); j++ ) { 
            Reader reader = OrcFile.createReader(new Path(orcFiles.get(j)), OrcFile.readerOptions(conf));
    
            VectorizedRowBatch batch = reader.getSchema().createRowBatch();
            RecordReader rows = reader.rows();
    
            while (rows.nextBatch(batch)) {
              if (batch != null) {
                 writer.addRowBatch(batch);
              }
            }
            rows.close();
            fs.delete(new Path(orcFiles.get(j)), false);
          }
          //Close File
          writer.close();
    
          //Remove leading "." from ORC file to make visible to Hive
          outFile = fileSystem.getFileStatus(outFilePath)
                                          .getPath()
                                          .getName();
    
          if (outFile.startsWith(".")) {
            outFile = outFile.substring(1);
    
            int lastIndexOf = outFile.lastIndexOf(".working");
            outFile = outFile.substring(0, lastIndexOf);
          }
    
          Path parent = outFilePath.getParent();
    
          fileSystem.rename(outFilePath, new Path(parent, outFile));
    
          if(args.length != 0)
            break;
        }
      }
    
      private static String getSchema(String resource) throws IOException {
        try (InputStream input = OrcFileRollUp.class.getResourceAsStream("/" + resource)) {
          return IOUtils.toString(input, UTF_8);
        }
      }
    
      public static List<String> getHosts(Path filePath, FileSystem fs) throws FileNotFoundException, IOException {
        List<String> hostsList = new ArrayList<String>();
        FileStatus[] fileStatus = fs.listStatus(filePath);
        for (FileStatus fileStat : fileStatus) {
          hostsList.add(fileStat.getPath().toString());
        }
        return hostsList;
      }
    
      private static List<String> getAllFilePath(Path filePath, FileSystem fs) throws FileNotFoundException, IOException {
        List<String> fileList = new ArrayList<String>();
        FileStatus[] fileStatus = fs.listStatus(filePath);
        for (FileStatus fileStat : fileStatus) {
          if (fileStat.isDirectory()) {
            fileList.addAll(getAllFilePath(fileStat.getPath(), fs));
          } else {
            fileList.add(fileStat.getPath()
                                 .toString());
          }
        }
        for(int i = 0; i< fileList.size(); i++) {
          if(!fileList.get(i).endsWith(".orc"))
            fileList.remove(i);
        }
    
        return fileList;
      }
    
    }