java如何修改以解析谷歌新闻搜索文章标题、预览和URL？

2 周，6 日 Questions & Answers 583

我想解析谷歌新闻搜索：1）文章名称2）预览3）URL

为了实现这一点，我应该对web结构进行修改

Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");

主要是：

( ".g>.r>.a")

如何修改它

完整代码：

public static void main(String[] args) throws UnsupportedEncodingException, IOException { String google = "http://www.google.com/search?q="; String search = "stackoverflow"; String charset = "UTF-8"; String news="&tbm=nws"; String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage! Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a"); for (Element link : links) { String title = link.text(); String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>". url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8"); if (!url.startsWith("http")) { continue; // Ads/news/etc. } System.out.println("Title: " + title); System.out.println("URL: " + url); } }

# 1 楼答案

如何选择正确的元素（使用chrome）

第一步：在浏览器中禁用javascript（例如，为了方便起见，使用像uMatrix这样的附加组件），以便看到与jsoup相同的结果

现在，右键单击一个元素，然后选择inspect或使用Ctrl+Shift+I打开开发工具。当您将鼠标悬停在Elements选项卡中的源代码上时，您会在呈现的页面中看到相关的元素。右键单击源代码中的n元素可提供复制->；复制选择器。这是一个很好的起点，但有时过于严格。在这里，它给出了选择器#rso > div:nth-child(3)，因此id为rso的元素中的第三个直接子div。这太具体了，所以我们概括一下：

我们为id为rso#rso > div的元素选择所有直接子div

然后我们抓取标题锚h3 > a、文本节点和属性href结果的标题和url

接下来，我们用类st（div.st）获取内部div，该类在其textnode中包含预览。如果缺少该div，我们将跳过该元素

在请求中使用.data("key","value")，我们不需要手动编码

示例代码

String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
String searchTerm = "stackoverflow";
int numberOfResultpages = 2; // grabs first two pages of search results
String searchUrl = "https://www.google.com/search?";

Document doc;

for (int i = 0; i < numberOfResultpages; i++) {

    try {
        doc = Jsoup.connect(searchUrl)
                .userAgent(userAgent)
                .data("q", searchTerm)
                .data("tbm", "nws")
                .data("start",""+i)
                .method(Method.GET)
                .referrer("https://www.google.com/").get();

        for (Element result : doc.select("#rso > div")) {

            if(result.select("div.st").size()==0) continue;

            Element h3a = result.select("h3 > a").first();

            String title = h3a.text();
            String url = h3a.attr("href");
            String preview = result.select("div.st").first().text();

            // just printing out title and link to demonstate the approach
            System.out.println(title + " -> " + url + "\n\t" + preview);
        }

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

输出

Stack Overflow: Movie Magic -> https://geekdad.com/2016/09/stack-overflow-movie-magic-2/
    I got to visit the set of Kubo and the Two Strings and see some of the amazing work that went into creating the film. But well before the ...
Will StackOverflow Documentation Realize Its Lofty Goal? -> https://dzone.com/articles/will-stackoverflow-documentation-realize-its-lofty
    With the StackOverflow Documentation project now in beta, how close is it to realizing the lofty goals it has set forth for itself? Can it ever ...
Stack Overflow: Progress Report -> https://geekdad.com/2016/09/stack-overflow-progress-report/
    Of the books on my list, the only one I totally finished so far is Kidding Ourselves, which I included in this Stack Overflow. And that perhaps is an ...
....

Python中文网

有 Java 编程相关的问题?

java如何修改以解析谷歌新闻搜索文章标题、预览和URL？

更新

共 (1) 个答案

# 1 楼答案