JAVAutil。扫描仪和维基百科

3 月，3 周 Questions & Answers 3669

我正在尝试使用java。util。扫描器获取维基百科内容并用于基于单词的搜索。事实上，这一切都很好，但当读一些单词时，它会给我错误。查看代码并进行一些检查，结果发现，使用一些单词不识别编码，或者这样，内容就不再可读了。这是用于获取页面的代码：

开始-

try {
        connection =  new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
                    Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        content = scanner.next();
//          if(word.equals("pubblico"))
//              System.out.println(content);
        System.out.println("Doing: "+ word);
//End

问题出现在意大利语维基百科的“pubblico”一词上。单词pubblico上的println结果如下（剪切）： èèè½]Ksr>；èè½~E ï½1Aï½ï½ï½ï½Eï½ER3tHZï½4vï½ï；PZjtcè½è½è½è½è½è½è½=8è½è½è

你知道为什么吗？然而，从页面源代码和页眉来看是相同的，具有相同的编码

原来内容是压缩的，所以我可以告诉维基百科不要给我压缩的teir页面，或者这是唯一的方法吗？多谢各位

connection = new URL("http://it.wikipedia.org/wiki/"+word).openConnection(); String ctype = connection.getContentType(); int csi = ctype.indexOf("charset="); Scanner scanner; if (csi > 0) scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8))); else scanner = new Scanner(new InputStreamReader(connection.getInputStream())); scanner.useDelimiter("\\Z"); content = scanner.next(); if(word.equals("pubblico")) System.out.println(content); System.out.println("Doing: "+ word);

connection = new URL("http://it.wikipedia.org/wiki/"+word).openConnection(); //connection.addRequestProperty("Accept-Encoding",""); //System.out.println(connection.getContentEncoding()); InputStream resultingInputStream = null; // Stream su cui fluisce la pagina scaricata String encoding = connection.getContentEncoding(); // Codifica di invio (identity, gzip, inflate) // Scelta dell'opportuno decompressore per leggere la sorgente if (connection.getContentEncoding() != null && encoding.equals("gzip")) { resultingInputStream = new GZIPInputStream(connection.getInputStream()); } else if (encoding != null && encoding.equals("deflate")) { resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true)); } else { resultingInputStream = connection.getInputStream(); } // Scanner per estrarre dallo stream la pagina per inserirla in una stringa Scanner scanner = new Scanner(resultingInputStream); scanner.useDelimiter("\\Z"); content = new String(scanner.next());

connection = new URL("http://it.wikipedia.org/wiki/"+word).openConnection(); connection.addRequestProperty("Accept-Encoding",""); System.out.println(connection.getContentEncoding()); Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream())); scanner.useDelimiter("\\Z"); content = new String(scanner.next());

共 (5) 个答案

# 1 楼答案

尝试使用Reader而不是InputStream——我认为它是这样工作的：

您也可以直接将字符集传递给Scanner构造函数，如另一个答案所示

# 2 楼答案
尝试使用具有指定字符集的扫描仪：
```
public Scanner(InputStream source, String charsetName)
```
对于默认构造函数：

Bytes from the stream are converted into characters using the underlying platform's default charset.

Scanner on java.sun.com
# 3 楼答案

需要使用URLConnection，以便确定响应中的content-type header。这应该告诉您create your ^{}时要使用的字符编码

具体来说，请查看内容类型头的“charset”参数

要抑制gzip压缩，请set the accept-encoding header到“identity”。详见the HTTP specification

# 4 楼答案

真管用

# 5 楼答案

编码不会改变。为什么

Python中文网

有 Java 编程相关的问题?

JAVAutil。扫描仪和维基百科

共 (5) 个答案

# 1 楼答案

# 2 楼答案

# 3 楼答案

# 4 楼答案

# 5 楼答案