Extract file header from stream instead of physical file

I've been looking through your library and I found the `extractFileHeader` and it works great.
My only issue is that we are running in a cloud environment and dealing with rather large avro files (48gb).
Having to download that file onto the docker image and inspect it is rather in efficient.

I've been trying to modify your method to allow me to take a `Readable` instead of the actual `path` but it turns out too many inner methods are being used in the `extractFileHeader` for it to be feasible. I've done something like this:
```
import { Injectable } from '@nestjs/common';
import * as avro from 'avsc';
import { Readable } from 'stream';

@Injectable()
export class AvroSchemaFileExtractorService {
  MAGIC_BYTES;
  HEADER_TYPE;

  constructor() {
    this.MAGIC_BYTES = Buffer.from('Obj\x01');
    const OPTS = { namespace: 'org.apache.avro.file' };
    const MAP_BYTES_TYPE = avro.Type.forSchema({ type: 'map', values: 'bytes' }, OPTS);
    this.HEADER_TYPE = avro.Type.forSchema(
      {
        name: 'Header',
        type: 'record',
        fields: [
          { name: 'magic', type: { type: 'fixed', name: 'Magic', size: 4 } },
          { name: 'meta', type: MAP_BYTES_TYPE },
          { name: 'sync', type: { type: 'fixed', name: 'Sync', size: 16 } },
        ],
      },
      OPTS,
    );
  }

  async get(fileStream: Readable, opts: any = {}): Promise<string | null> {
    // const decode = opts.decode === undefined ? true : !!opts.decode;
    const size = Math.max(opts.size || 4096, 4);
    let buf = Buffer.alloc(size);
    for await (const chunk of fileStream) {
      if (chunk.length > size) {
        buf = chunk;
        break;
      }
    }
    try {
      if (buf.length < 4 || !this.MAGIC_BYTES.equals(buf.slice(0, 4))) {
        return null;
      }

      // Here it starts to break down.
      const tap = new (avro as any).utils.Tap(buf);
      let header = null;
      do {
        header = (this.HEADER_TYPE as any)._read(tap);
      } while (!(avro as any).isValid());
      // if (decode !== false) {
      //   const meta = header.meta;
      //   meta['avro.schema'] = JSON.parse(meta['avro.schema'].toString());
      //   if (meta['avro.codec'] !== undefined) {
      //     meta['avro.codec'] = meta['avro.codec'].toString();
      //   }
      // }
      return header;
    } finally {
      if (opts.destroy) {
        fileStream.destroy();
      }
    }
  }
}

```

But again the inner methods are not exposed and I cannot access them.
Would it be possible to include a more cloud friendly version that accepts a stram instead of a path?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract file header from stream instead of physical file #396

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extract file header from stream instead of physical file #396

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions