Skip to content

GzDecoder stops decoding file toward the start. #339

@nschuessler

Description

@nschuessler

In trying to decode the common crawl index files. GzDecoder stops at about 1.8M of input of a 690M file.
The file is too large to use .read_to_end (i.e. read it into memory).

If you download the file and use gzip -d cdx-00010.gz the whole file is expanded.
How do you use GzDecoder to get the same behavior as gzip -d?

The code exits early because decoder.Read returns 0 bytes, whereas reading from the stream (input_stream.Read) will continue. So, I assume there is some format issue in the file that GzDecoder does not handle and gzip does. It prints 'Read 0 x' before exiting so I assume there are no errors.

Thanks

Example input:
https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-06/indexes/cdx-00010.gz

Example code:

 let mut file = File::open("cdx-00010.gz").expect("Could not open index file.");
 decode_to_stream(&mut file);

use std::io::prelude::*;
use std::io;
use std::io::BufReader;
use std::fs::File;
use flate2::read::{GzDecoder};


pub fn decode_to_stream(input_stream: &mut dyn Read)
{
    let mut output_file = File::create("decoded").expect("Could not create output file.");
    let mut decoder = GzDecoder::new(input_stream);
    let mut buffer = [0; 65536];
    let mut total_read = 0;
    while let Ok(read_size) = decoder.read(&mut buffer[..])
    {
        println!("Read {} ({}).", read_size, total_read);
        if read_size <= 0 {
            break;
        }

        output_file.write(&buffer[..read_size]);
        total_read = total_read + read_size;
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions