java Camel，使用字段条件拆分带有头的大型XML文件

2 月，3 周 Questions & Answers 1592

我正在尝试设置一个Apache Camel路由，它输入一个大的XML文件，然后使用字段条件将负载拆分为两个不同的文件。也就是说，如果一个ID字段以1开头，它会进入一个输出文件，否则会进入另一个输出文件。使用Camel不是必须的，我也研究了XSLT和常规Java选项，但我觉得这应该是可行的

我已经介绍了拆分实际负载，但我在确保每个文件中也包含父节点（包括头）方面遇到了问题。由于文件可能很大，我希望确保有效负载使用流。我觉得我已经在这里读了数百个不同的问题，博客条目等等，几乎每个案例都涉及将整个文件加载到内存中，将文件平均分割为部分，或者仅单独使用有效负载节点

我的原型XML文件如下所示：

<root>
    <header>
        <title>Testing</title>
    </header>
    <orders>
        <order>
            <id>11</id>
            <stuff>One</stuff>
        </order>
        <order>
            <id>20</id>
            <stuff>Two</stuff>
        </order>
        <order>
            <id>12</id>
            <stuff>Three</stuff>
        </order>
    </orders> 
</root>

结果应该是两个文件-条件为真（id以1开头）：

<root>
    <header>
        <title>Testing</title>
    </header>
    <orders>
        <order>
            <id>11</id>
            <stuff>One</stuff>
        </order>
        <order>
            <id>12</id>
            <stuff>Three</stuff>
        </order>
    </orders> 
</root>

条件错误：

<root>
    <header>
        <title>Testing</title>
    </header>
    <orders>
        <order>
            <id>20</id>
            <stuff>Two</stuff>
        </order>
    </orders> 
</root>

我的原型路线：

from("file:" + inputFolder)
.log("Processing file ${headers.CamelFileName}")
.split()
    .tokenizeXML("order", "*") // Includes parent in every node
    .streaming()
    .choice()
        .when(body().contains("id>1"))
            .to("direct:ones")
            .stop()
        .otherwise()
            .to("direct:others")
            .stop()
    .end()
.end();

from("direct:ones")
//.aggregate(header("ones"), new StringAggregator()) // missing end condition
.to("file:" + outputFolder + "?fileName=ones-${in.header.CamelFileName}&fileExist=Append");

from("direct:others")
//.aggregate(header("others"), new StringAggregator()) // missing end condition
.to("file:" + outputFolder + "?fileName=others-${in.header.CamelFileName}&fileExist=Append");

除了为每个节点添加父标记（页眉和页脚，如果愿意的话）之外，这是有意的。只使用tokenizeXML中的节点只返回节点本身，但我不知道如何添加页眉和页脚。最好是将父标记流式传输到页眉和页脚属性中，并在拆分前后添加它们

我该怎么做？我是否需要首先标记父标记，这是否意味着将文件流化两次

最后，你可能会注意到结尾的聚合。我不想在写入文件之前聚合每个节点，因为这样做会破坏流式传输的目的，并使整个文件内存不足，但我认为，在写入文件之前聚合多个节点，可以获得一些性能，以减少每个节点写入驱动器时的性能影响。我不确定这样做是否有意义

# 1 楼答案

我没法用Camel。或者更确切地说，当使用纯Java提取头文件时，我已经具备了继续操作所需的一切，使拆分和交换回Camel看起来很麻烦。有很多可能的方法可以改进这一点，但这是我分割XML负载的解决方案

在这两种类型的输出流之间切换并不是很好，但它简化了其他一切的使用。同样值得注意的是，我选择equalsIgnoreCase来检查标记名，尽管XML通常区分大小写。对我来说，这降低了出错的风险。最后，按照普通字符串正则表达式，确保正则表达式使用通配符匹配整个字符串

/**
 * Splits a XML file's payload into two new files based on a regex condition. The payload is a specific XML tag in the
 * input file that is repeated a number of times. All tags before and after the payload are added to both files in order
 * to keep the same structure.
 * 
 * The content of each payload tag is compared to the regex condition and if true, it is added to the primary output file.
 * Otherwise it is added to the secondary output file. The payload can be empty and an empty payload tag will be added to
 * the secondary output file. Note that the output will not be an unaltered copy of the input as self-closing XML tags are
 * altered to corresponding opening and closing tags.
 * 
 * Data is streamed from the input file to the output files, keeping memory usage small even with large files.
 * 
 * @param inputFilename Path and filename for the input XML file
 * @param outputFilenamePrimary Path and filename for the primary output file
 * @param outputFilenameSecondary Path and filename for the secondary output file
 * @param payloadTag XML tag name of the payload
 * @param payloadParentTag XML tag name of the payload's direct parent
 * @param splitRegex The regex split condition used on the payload content
 * @throws Exception On invalid filenames, missing input, incorrect XML structure, etc.
 */
public static void splitXMLPayload(String inputFilename, String outputFilenamePrimary, String outputFilenameSecondary, String payloadTag, String payloadParentTag, String splitRegex) throws Exception {

    XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
    XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newInstance();
    XMLEventReader xmlEventReader = null;
    FileInputStream fileInputStream = null;
    FileWriter fileWriterPrimary = null;
    FileWriter fileWriterSecondary = null;
    XMLEventWriter xmlEventWriterSplitPrimary = null;
    XMLEventWriter xmlEventWriterSplitSecondary = null;

    try {
        fileInputStream = new FileInputStream(inputFilename);
        xmlEventReader = xmlInputFactory.createXMLEventReader(fileInputStream);

        fileWriterPrimary = new FileWriter(outputFilenamePrimary);
        fileWriterSecondary = new FileWriter(outputFilenameSecondary);
        xmlEventWriterSplitPrimary = xmlOutputFactory.createXMLEventWriter(fileWriterPrimary);
        xmlEventWriterSplitSecondary = xmlOutputFactory.createXMLEventWriter(fileWriterSecondary);

        boolean isStart = true;
        boolean isEnd = false;
        boolean lastSplitIsPrimary = true;

        while (xmlEventReader.hasNext()) {
            XMLEvent xmlEvent = xmlEventReader.nextEvent();

            // Check for start of payload element
            if (!isEnd && xmlEvent.isStartElement()) {
                StartElement startElement = xmlEvent.asStartElement();
                if (startElement.getName().getLocalPart().equalsIgnoreCase(payloadTag)) {
                    if (isStart) {
                        isStart = false;
                        // Flush the event writers as we'll use the file writers for the payload
                        xmlEventWriterSplitPrimary.flush();
                        xmlEventWriterSplitSecondary.flush();
                    }

                    String order = getTagAsString(xmlEventReader, xmlEvent, payloadTag, xmlOutputFactory);
                    if (order.matches(splitRegex)) {
                        lastSplitIsPrimary = true;
                        fileWriterPrimary.write(order);
                    } else {
                        lastSplitIsPrimary = false;
                        fileWriterSecondary.write(order);
                    }
                }
            }
            // Check for end of parent tag
            else if (!isStart && !isEnd && xmlEvent.isEndElement()) {
                EndElement endElement = xmlEvent.asEndElement();
                if (endElement.getName().getLocalPart().equalsIgnoreCase(payloadParentTag)) {
                    isEnd = true;
                }
            }
            // Is neither start or end and we're handling payload (most often white space)
            else if (!isStart && !isEnd) {
                // Add to last split handled
                if (lastSplitIsPrimary) {
                    xmlEventWriterSplitPrimary.add(xmlEvent);
                    xmlEventWriterSplitPrimary.flush();
                } else {
                    xmlEventWriterSplitSecondary.add(xmlEvent);
                    xmlEventWriterSplitSecondary.flush();
                }
            }

            // Start and end is added to both files
            if (isStart || isEnd) {
                xmlEventWriterSplitPrimary.add(xmlEvent);
                xmlEventWriterSplitSecondary.add(xmlEvent);
            }
        }

    } catch (Exception e) {
        logger.error("Error in XML split", e);
        throw e;
    } finally {
        // Close the streams
        try {
            xmlEventReader.close();
        } catch (XMLStreamException e) {
            // ignore
        }
        try {
            xmlEventReader.close();
        } catch (XMLStreamException e) {
            // ignore
        }
        try {
            xmlEventWriterSplitPrimary.close();
        } catch (XMLStreamException e) {
            // ignore
        }
        try {
            xmlEventWriterSplitSecondary.close();
        } catch (XMLStreamException e) {
            // ignore
        }
        try {
            fileWriterPrimary.close();
        } catch (IOException e) {
            // ignore
        }
        try {
            fileWriterSecondary.close();
        } catch (IOException e) {
            // ignore
        }
    }
}

/**
 * Loops through the events in the {@code XMLEventReader} until the specific XML end tag is found and returns everything
 * contained within the XML tag as a String.
 * 
 * Data is streamed from the {@code XMLEventReader}, however the String can be large depending of the number of children
 * in the XML tag.
 * 
 * @param xmlEventReader The already active reader. The starting tag event is assumed to have already been read
 * @param startEvent The starting XML tag event already read from the {@code XMLEventReader}
 * @param tag The XML tag name used to find the starting XML tag
 * @param xmlOutputFactory Convenience include to avoid creating another factory
 * @return String containing everything between the starting and ending XML tag, the tags themselves included
 * @throws Exception On incorrect XML structure
 */
private static String getTagAsString(XMLEventReader xmlEventReader, XMLEvent startEvent, String tag, XMLOutputFactory xmlOutputFactory) throws Exception {
    StringWriter stringWriter = new StringWriter();
    XMLEventWriter xmlEventWriter = xmlOutputFactory.createXMLEventWriter(stringWriter);

    // Add the start tag
    xmlEventWriter.add(startEvent);

    // Add until end tag
    while (xmlEventReader.hasNext()) {
        XMLEvent xmlEvent = xmlEventReader.nextEvent();

        // End tag found
        if (xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().getLocalPart().equalsIgnoreCase(tag)) {
            xmlEventWriter.add(xmlEvent);
            xmlEventWriter.close();
            stringWriter.close();

            return stringWriter.toString();
        } else {
            xmlEventWriter.add(xmlEvent);
        }
    }

    xmlEventWriter.close();
    stringWriter.close();
    throw new Exception("Invalid XML, no closing tag for <" + tag + "> found!");
}

Python中文网

有 Java 编程相关的问题?

java Camel，使用字段条件拆分带有头的大型XML文件

共 (1) 个答案

# 1 楼答案