有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

网页抓取Java Jsoup网页抓取

我试图通过以下方式获得此代码的结果:

title: Ben 10 Ultimate Alien

comment:taseen_shafquattaseen_shafquat : is there go na a season 4 for this series

title: Akira

comment: dragon3476dragon3476 : one of my most fav animations excellent bit o work and about my 300th watch , i still got the orginal poster from when it came out + dvd and vid and even the t-shirt so yeah i couldn't say anything bad about such a great animation 5/5

但是,我是这样理解的:

title: Ben 10 Ultimate Alien

title: taseen_shafquattaseen_shafquat : is there go na a season 4 for this series

title: Akira

title: dragon3476dragon3476 : one of my most fav animations excellent bit o work and about my 300th watch , i still got the orginal poster from when it came out + dvd and vid and even the t-shirt so yeah i couldn't say anything bad about such a great animation 5/5

代码

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;

import java.io.*;
import java.util.List;

public class WebScraper {

    public static void main(String[] args) throws Exception {
        String url = "http://www.1channel.ch/latest_comments.php";
        Document doc = Jsoup.connect(url).get();
        for (Element E : doc.select("div.latest_comments > a, div.latest_comments > p")) {

         System.out.print("title: "+ E.getElementsByTag("a").text());
         System.out.println(  E.getElementsByTag("p").text());
          //    System.out.println(T);
            System.out.print("\n");

            try 
            {
            PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("/Users/samualdoku/Desktop/Twitter/scraped.txt", true)));
            out.println(E.text());
            out.close();
             } catch (IOException e) {
            }  
        }

    }

}

这就是我正在尝试的html。我认为问题在于href内的span标记。它包含评论者用户名。我为标题调用了getElementsByTag("a"),因为标题在锚标记中。如何去掉span标记,因为它将标题打印在用户名前面,而用户名不应该这样

 <div class="latest_comments com_class_tv">
    <a href="/tv-2733767-Dallas/season-1-episode-3">Dallas</a>
    ( 6 minutes ago )
    <p>
        <span class="latest_comments_poster">
          <a href="/profile/jowar">jowar</a>
          :
        </span>
        i just started watchin...eeing as its 34nyrs old
    </p>
</div>

共 (1) 个答案

  1. # 1 楼答案

    试试这个

    public static void main(String[] args) throws Exception {
     String url = "http://www.1channel.ch/latest_comments.php";
     Document doc = Jsoup.connect(url).get();
     for (Element E : doc.select("div.latest_comments)) {
    
      System.out.print("title: "+ E.select("a").text());
      System.out.println("comment: " + E.select("p").text());
    
     }
    }