有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

在xml字符串和java字符串之间提取正则表达式子字符串

我有一个大字符串,它是XML的表示形式。我试图提取一个节点数据,如下所示:

        String textToExtract = "<FnAnno>\r\n" + 
                "   <PropDesc F_ANNOTATEDID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_BACKCOLOR=\"0\" F_BORDER_BACKMODE=\"2\" F_BORDER_COLOR=\"0\" F_BORDER_STYLE=\"0\" F_BORDER_WIDTH=\"1\" F_CLASSID=\"{5CF11941-018F-11D0-A87A-00A0246922A5}\" F_CLASSNAME=\"Text\" F_CREATOR=\"req92333\" F_ENTRYDATE=\"2018-06-19T13:15:43.0000000-05:00\" F_FONT_BOLD=\"true\" F_FONT_ITALIC=\"false\" F_FONT_NAME=\"arial\" F_FONT_SIZE=\"12\" F_FONT_STRIKETHROUGH=\"false\" F_FONT_UNDERLINE=\"false\" F_FORECOLOR=\"0\" F_HASBORDER=\"true\" F_HEIGHT=\"0\" F_ID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_LEFT=\"3.430379746835443\" F_MODIFYDATE=\"2018-06-19T13:15:49.0000000-05:00\" F_MULTIPAGETIFFPAGENUMBER=\"1\" F_NAME=\"-1-1\" F_PAGENUMBER=\"1\" F_TEXT_BACKMODE=\"2\" F_TOOLTIP=\"0043007200650061007400650064002000420079003A002000720065007100390032003300330033002C0020002000430072006500610074006500640020004F006E003A002000320030003100380020004A0075006E0065002000310039002C002000310033003A00310035003A00340033002C0020005500540043002D0035\" F_TOOLTIPTRANSFERENCODING=\"hex\" F_TOP=\"1.3291139240506329\" F_WIDTH=\"0\">\r\n" + 
                "       <F_CUSTOM_BYTES/>\r\n" + 
                "       <F_POINTS/>\r\n" + 
                "       <F_TEXT Encoding=\"unicode\">005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007500720061006a00200036002f00310039002f00320030003100380029</F_TEXT>\r\n" + 
                "   </PropDesc>\r\n" + 
                "</FnAnno>";
String      extractedString =textToExtract.substring(textToExtract.indexOf("=\"unicode\">"),textToExtract.indexOf("</F_TEXT>")).replaceFirst("=\"unicode\">", "");

结果是0054006800690070020006900730200061002000740065007300740020000A00280041006200680069006C0061007300680020004D0750007400680075000720061006A00200036002F00310039002F0032003003100380029

为了提高效率,我想使用Pattern和matcher来提取子字符串。下面是我正在努力的代码:

    Pattern pattern = Pattern.compile("\\bEncoding=.*?\\.*F_TEXT\\b");
    Matcher matcher = pattern.matcher(textToExtract);
    while (matcher.find()){
        extractedString = (matcher.group());
    }   

上面的结果是Encoding=“unicode”>;005400680069007,我需要再次截断

如何只获取<F_TEXT Encoding=\"unicode\"> and </F_TEXT>之间的数据?我在学校学习正则表达式时遇到过问题,甚至现在在工作中也遇到了问题:(我想我需要多练习。)

谢谢


共 (2) 个答案

  1. # 1 楼答案

    不要使用正则表达式来解析XML。使用XML解析器

    要“提高效率”,请使用SAX,例如:

    String textToExtract = "<FnAnno>\r\n" + 
                           "   <PropDesc F_ANNOTATEDID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_BACKCOLOR=\"0\" F_BORDER_BACKMODE=\"2\" F_BORDER_COLOR=\"0\" F_BORDER_STYLE=\"0\" F_BORDER_WIDTH=\"1\" F_CLASSID=\"{5CF11941-018F-11D0-A87A-00A0246922A5}\" F_CLASSNAME=\"Text\" F_CREATOR=\"req92333\" F_ENTRYDATE=\"2018-06-19T13:15:43.0000000-05:00\" F_FONT_BOLD=\"true\" F_FONT_ITALIC=\"false\" F_FONT_NAME=\"arial\" F_FONT_SIZE=\"12\" F_FONT_STRIKETHROUGH=\"false\" F_FONT_UNDERLINE=\"false\" F_FORECOLOR=\"0\" F_HASBORDER=\"true\" F_HEIGHT=\"0\" F_ID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_LEFT=\"3.430379746835443\" F_MODIFYDATE=\"2018-06-19T13:15:49.0000000-05:00\" F_MULTIPAGETIFFPAGENUMBER=\"1\" F_NAME=\"-1-1\" F_PAGENUMBER=\"1\" F_TEXT_BACKMODE=\"2\" F_TOOLTIP=\"0043007200650061007400650064002000420079003A002000720065007100390032003300330033002C0020002000430072006500610074006500640020004F006E003A002000320030003100380020004A0075006E0065002000310039002C002000310033003A00310035003A00340033002C0020005500540043002D0035\" F_TOOLTIPTRANSFERENCODING=\"hex\" F_TOP=\"1.3291139240506329\" F_WIDTH=\"0\">\r\n" + 
                           "       <F_CUSTOM_BYTES/>\r\n" + 
                           "       <F_POINTS/>\r\n" + 
                           "       <F_TEXT Encoding=\"unicode\">005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007500720061006a00200036002f00310039002f00320030003100380029</F_TEXT>\r\n" + 
                           "   </PropDesc>\r\n" + 
                           "</FnAnno>";
    
    StringBuilder buf = new StringBuilder();
    
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser parser = factory.newSAXParser();
    parser.parse(new InputSource(new StringReader(textToExtract)), new DefaultHandler() {
        private boolean captureText;
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
            this.captureText = qName.equals("F_TEXT");
        }
        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            this.captureText = false;
        }
        @Override
        public void characters(char[] ch, int start, int length) throws SAXException {
            if (this.captureText)
                buf.append(ch, start, length);
        }
    });
    
    System.out.println(buf.toString());
    

    输出

    005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007500720061006a00200036002f00310039002f00320030003100380029
    
  2. # 2 楼答案

    如果总是要在相同的XML标记之间检索数据,那么就不必担心将其解析为数据结构。你的想法是对的。如果你追求的是速度,那就抓住你知道会出现的标记之间的线

    然而,你的方式是浪费一些周期

    textToExtract.substring(textToExtract.indexOf("=\"unicode\">"),textToExtract.indexOf("</F_TEXT>")).replaceFirst("=\"unicode\">", "");

    我们来分析一下:

    // loops through the array until "=\"unicode\">" is found
    int startIndex = textToExtract.indexOf("=\"unicode\">");
    // loops through the array again, until "</F_TEXT>" is found
    int endIndex = textToExtract.indexOf("</F_TEXT>");
    //loop through the array, copying the bytes to a new array to form a new String
    String substr = textToExtract.substring(startIndex,endIndex);
    //loop through the array to find and replace "=\"unicode\">" with nothing
    String data = substr.replaceFirst("=\"unicode\">", "");
    

    你经常在同一个数组中循环

    一旦你知道起点在哪里,就不需要再从头开始搜索了。相反,从这个起点开始看。然后,一旦你有了子串的起点和终点,你就可以简单地得到它

    // we know what precedes the substring we want
    String anchor = "<F_TEXT Encoding=\"unicode\">";
    // so we use it to get the start point, looping once, up to that point
    int start = textToExtract.indexOf(anchor)+anchor.length();
    // we know the end point won't be before the start point, so start where it left off
    int end = start;
    // count each character from that point until the next XML tag starts
    while (textToExtract.charAt(end) != '<') { end++; }
    // now we have what we need to simply get the substring
    String data = textToExtract.substring(start,end);
    

    这将使性能提高约60%

    编辑:为了完整起见,让我们谈谈正则表达式

    Regex非常棒,脚本也很有趣,但对于这样的东西来说效率很低。如果你可以避免使用正则表达式,那么就这样做。我倾向于使用它只是为了“快速和肮脏”——在编码时间方面,而不是在执行时间方面。了解正则表达式引擎的工作原理。这真的很有趣,但你会明白为什么这是最后的选择

        /* this pattern will look for the XML tag.
        ** then, it will match [^>]+
        ** [...] will match a single character that matches SOMETHING inside the "character class."
        ** [^...] will match a single character that is NOT something inside the character class.
        ** [^>]+ will match as many characters as it can that do not match '>'
        ** putting this expression inside brackets tells the engine we want to capture it to be referenced later.
        ** '<' at the end just ensures we capture up until that point.
        */
        // create the pattern
        Pattern pattern = Pattern.compile("<F_TEXT Encoding=\"unicode\">([^>]+)<");
        // get a matcher for it
        Matcher matcher = pattern.matcher(textToExtract);
        // if we find a match
        if (matcher.find()) {
            // we can use group(1) to refer to our first capture group
            // group(0) will always return the full string matched, but we don't want the tags.
            String data= matcher.group(1);
    
        }