Java InputStream的重大缺陷之一
InputStream 可以用来读取URL, 但是却有一个重大的缺陷: 这个调用很容易挂起, 而且永远不返回, 这个时候唯一的办法往往是重启JVM. 因为默认是没有设置timeout的, 也就是会永远等待, 而有时候服务器会莫名其妙的突然不响应, 其实可能数据以及发送完成了, 可能还缺最后一个数据包, 但是这个数据包很可能再也不会返回了.
这个情况的出现是不可预料和完全随机的, 前一秒钟有问题, 下一秒钟就可以了, 就好像有时候我们打开一个网页会报错, 但是下一秒刷新一下又会正常一样.
因此任何时候都不推荐直接使用Java的原始的API来读取网络文件. 起码要自己包装一下, 加timeout是最基本的, 最好还制定一个retry的机制, 即timeout超出之后, 重新发送请求, retry的次数也可以设置一个限制, 一般事不过三, 如果一个url发送了三次请求还不能成功的读到数据, 基本就可以放弃了.
下面的Clojure代码来自 https://github.com/overtone/overtone/blob/master/src/overtone/helpers/file.clj, 不过做了一点修改, 可以直接拷贝使用, 没有任何外部依赖.
;; copy a download method from git hub (defn pretty-file-size "Takes number of bytes and returns a prettied string with an appropriate unit: b, kb or mb." [n-bytes] (let [n-kb (int (/ n-bytes 1024)) n-mb (int (/ n-kb 1024))] (cond (< n-bytes 1024) (str n-bytes " B") (< n-kb 1024) (str n-kb " KB") :else (str n-mb " MB")))) (defn print-file-copy-status "Print copy status in percentage - granularity times - evenly as num-copied-bytes approaches file-size" [num-copied-bytes buf-size file-size slices] (let [min num-copied-bytes max (+ buf-size num-copied-bytes)] (when-let [slice (some (fn [slice] (when (and (> (:val slice) min) (< (:val slice) max)) slice)) slices)] (println (str (:perc slice) "% (" (pretty-file-size num-copied-bytes) ") completed"))))) (defn percentage-slices "Returns a seq of maps of length num-slices where each map represents the percentage and the associated percentage of size usage: (percentage-slices 1000 2) ;=> ({:perc 50N, :val 500N} {:perc 100, :val 1000})" [size num-slices] (map (fn [slice] (let [perc (/ (inc slice) num-slices)] {:perc (* 100 perc) :val (* size perc)})) (range num-slices))) (defn remote-file-copy [in-stream out-stream] "Similar to the corresponding implementation of #'do-copy in 'clojure.java.io but also tracks how many bytes have been downloaded and prints percentage statements when *verbose-overtone-file-helpers* is bound to true." (let [buf-size 2048 buffer (make-array Byte/TYPE buf-size)] (loop [bytes-copied 0] (let [size (.read in-stream buffer)] (when (pos? size) (do (.write out-stream buffer 0 size) (recur (+ size bytes-copied)))))) (println "--> Download successful"))) (defn download-file-with-timeout "Downloads remote file at url to local file specified by target path. If data transfer stalls for more than timeout ms, throws a java.net.SocketTimeoutException" [url target-path timeout] (let [ url (java.net.URL. url) con (.openConnection url)] (.setReadTimeout con timeout) (with-open [in (.getInputStream con) out (clojure.java.io/output-stream target-path)] (remote-file-copy in out)) target-path)) ;(download-file-with-timeout "http://zhangley.com/images/2016-04-19_9-10-15.jpg" "c:\\tmp\\aa.jpg" (* 5 60 1000)) ; 一般的文件5分钟足够了 (defn download-file* ([url path timeout] (download-file-with-timeout url path timeout)) ([url path timeout n-retries] (download-file* url path timeout n-retries 5000)) ([url path timeout n-retries wait-t] (download-file* url path timeout n-retries wait-t 0)) ([url path timeout n-retries wait-t attempts-made] (when (>= attempts-made n-retries) (throw (Exception. (str "Aborting! Download failed after " n-retries " attempts. URL attempted to download: " url )))) (let [path path] (try (download-file-with-timeout url path timeout) (catch Exception e (Thread/sleep wait-t) (println (str "Download timed out. Retry " (inc attempts-made) ": " url )) (download-file* url path timeout n-retries wait-t (inc attempts-made))))))) (defn download-file "Downloads the file pointed to by url to local path. If no timeout is specified will use blocking io to transfer data. If timeout is specified, transfer will block for at most timeout ms before throwing a java.net.SocketTimeoutException if data transfer has stalled. It's also possible to specify n-retries to determine how many attempts to make to download the file and also the wait-t between attempts in ms (defaults to 5000 ms) Verbose mode is enabled by binding *verbose-overtone-file-helpers* to true." ([url path timeout] (download-file* url path timeout)) ([url path timeout n-retries] (download-file* url path timeout n-retries)) ([url path timeout n-retries wait-t] (download-file* url path timeout n-retries wait-t)))