-
Notifications
You must be signed in to change notification settings - Fork 186
Closed as not planned
Labels
Description
In trying to decode the common crawl index files. GzDecoder stops at about 1.8M of input of a 690M file.
The file is too large to use .read_to_end (i.e. read it into memory).
If you download the file and use gzip -d cdx-00010.gz the whole file is expanded.
How do you use GzDecoder to get the same behavior as gzip -d?
The code exits early because decoder.Read returns 0 bytes, whereas reading from the stream (input_stream.Read) will continue. So, I assume there is some format issue in the file that GzDecoder does not handle and gzip does. It prints 'Read 0 x' before exiting so I assume there are no errors.
Thanks
Example input:
https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-06/indexes/cdx-00010.gz
Example code:
let mut file = File::open("cdx-00010.gz").expect("Could not open index file.");
decode_to_stream(&mut file);
use std::io::prelude::*;
use std::io;
use std::io::BufReader;
use std::fs::File;
use flate2::read::{GzDecoder};
pub fn decode_to_stream(input_stream: &mut dyn Read)
{
let mut output_file = File::create("decoded").expect("Could not create output file.");
let mut decoder = GzDecoder::new(input_stream);
let mut buffer = [0; 65536];
let mut total_read = 0;
while let Ok(read_size) = decoder.read(&mut buffer[..])
{
println!("Read {} ({}).", read_size, total_read);
if read_size <= 0 {
break;
}
output_file.write(&buffer[..read_size]);
total_read = total_read + read_size;
}
}