有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

htmlunit java如何从javascript解析内容结果?还有一个htmlunit错误

这是我要刮的一页:https://www.tokopedia.com/berkahcell2/promo-termurah-vr-virtual-reality-box-v-2-0-remote-bluetooth-gamepad/review?src=topads

我想通过“ulasan terbaru”下的注释文本刮取,我认为这是一个javascript的结果(尽管我可能错了,我不完全确定如何通过inspect元素检查它),除此之外,我也不确定HTMLUnit中的几件事

我读过这篇文章,是为了获取使用HTMLUnit而不是Jsoup所需的javascript内容。我已经阅读了http://htmlunit.10904.n7.nabble.com/Selecting-a-div-by-class-name-td25787.html试图按类刮取div的注释,但是我没有得到任何输出

    public static void comment(String url) throws IOException{

        WebClient client = new WebClient();
        client.setCssEnabled(true);
        client.setJavaScriptEnabled(true);
        
        try {
            HtmlPage page = client.getPage(url);
            List<?> date = page.getByXPath("//div/@class='list-box-comment'");
            System.out.println(date.size());
            for(int i =0 ; i<date.size();i++){
                System.out.println(date.get(i).asText());
            }
        }
        catch(Exception e){
                e.printStackTrace();
            }

    }

这是我的代码中处理注释刮取的部分,我做得对吗?。但我有两个问题:

  1. 在“asText()”中,它表示“无法解析方法asText()”
  2. 即使在没有“asText()”的情况下运行,我也会将此视为错误:
com.gargoylesoftware.htmlunit.ObjectInstantiationException: unable to create HTML parser
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:418)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:342)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:203)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
    at ReviewScraping.comment(ReviewScraping.java:86)
    at ReviewScraping.main(ReviewScraping.java:108)
Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe' is not recognized.
    at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:411)
    ... 11 more

我希望我能展示所有的评论

/edit我使用Intellij作为我的IDE,并且使用Maven将对HTMLUnit的依赖项包含在我的Intellij项目结构中


共 (1) 个答案

  1. # 1 楼答案

    关于您的代码:

    public static void main(String[] args) throws IOException {
        final String url = "https://www.tokopedia.com/berkahcell2/promo-termurah-vr-virtual-reality-box-v-2-0-remote-bluetooth-gamepad/review?src=topads";
    
        try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
            webClient.getOptions().setThrowExceptionOnScriptError(false);
    
            HtmlPage page = webClient.getPage(url);
            webClient.waitForBackgroundJavaScript(40_000);
    
            System.out.println(page.asXml());
    
            List<DomNode> date = page.getByXPath("//div[@class='list-box-comment']");
            System.out.println(date.size());
    
            for(int i = 0 ; i < date.size();i++){
                System.out.println(date.get(i).asText());
            }
        }
    }
    

    现在,页面本身的问题是:

    已经做了一些测试,看起来页面也会在实际浏览器中产生错误(请检查浏览器控制台)。但是使用HtmlUnit会遇到更多问题(可能是因为缺少对某些javascript特性的支持)。通常这类页面使用了很多很多行js代码——我要找出哪里出了问题,这将非常耗时。如果您希望修复此问题,请尝试找到问题的真正原因(请参阅http://htmlunit.sourceforge.net/submittingJSBugs.html了解一些提示)并提交错误报告