java从网站c读取信息#

1 年 Questions & Answers 191

在这个项目中，我希望能够查看一个网站，从该网站检索文本，并在以后处理这些信息

我的问题是，从网站检索数据（文本）的最佳方式是什么。在处理静态页面与处理动态页面时，我不确定如何做到这一点

通过搜索，我发现：

        WebRequest request = WebRequest.Create("anysite.com");
        // If required by the server, set the credentials.
        request.Credentials = CredentialCache.DefaultCredentials;
        // Get the response.
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        // Display the status.
        Console.WriteLine(response.StatusDescription);
        Console.WriteLine();

        // Get the stream containing content returned by the server.
        using (Stream dataStream = response.GetResponseStream())
        {
            // Open the stream using a StreamReader for easy access.
            StreamReader reader = new StreamReader(dataStream, Encoding.UTF8);
            // Read the content. 
            string responseString = reader.ReadToEnd();
            // Display the content.
            Console.WriteLine(responseString);
            reader.Close();
        }

        response.Close();

因此，通过我自己运行这个程序，我可以看到它从一个网站返回html代码，而不是我想要的。我最终希望能够输入一个站点（比如一篇新闻文章），并返回文章的内容。这在c#或Java中可能吗

谢谢

共 (4) 个答案

# 1 楼答案

您所描述的被称为web scraping，有很多库可以为Java和C#实现这一点。目标站点是静态的还是动态的并不重要，因为两者最终都会输出HTML。另一方面，大量使用JavaScript或Flash的网站往往存在问题
# 2 楼答案

我不想打断你，但这就是网页的外观，它是一个很长的html标记/内容流。浏览器会将其呈现为您在屏幕上看到的内容。我能想到的唯一方法就是自己解析html

在谷歌上快速搜索后，我找到了这篇堆栈溢出的文章。 What is the best way to parse html in C#?

但我敢打赌，你认为这会比你预期的容易一些，但这就是编程的乐趣所在，编程总是挑战问题
# 3 楼答案
您只需使用WebClient即可：
```
using(var webClient = new WebClient())
{
   string htmlFromPage = webClient.DownloadString("http://myurl.com");
}
```
在上面的示例中htmlFromPage将包含HTML，然后您可以对其进行解析以找到您要查找的数据
# 4 楼答案
请试试这个
```
System.Net.WebClient wc = new System.Net.WebClient(); 
```
string webData = wc.DownloadString("anysite.com");

Python中文网

有 Java 编程相关的问题?

java从网站c读取信息#

共 (4) 个答案

# 1 楼答案

# 2 楼答案

# 3 楼答案

# 4 楼答案