Skip to content
This repository was archived by the owner on Nov 30, 2017. It is now read-only.

arfon/gh_archive_parser

Repository files navigation

GitHub Archive Parser

Some rather unpleasant code that can be used to normalize the structure of GitHub public events across the years.

Build Status

Usage

Assuming you have a folder called files with a bunch of GitHub Archive compressed JSON files e.g. http://data.githubarchive.org/2015-01-01-15.json.gz then the following code (also in parse.rb) should extract the JSON and normalize it to the schema expected by the GitHub Archive (schema.js).

require 'zlib'
require 'yajl'
require_relative 'event_transform'

incoming = Dir.glob('files/*.json.gz')
parse_error_count = 0

incoming.each do |file|
  puts "*********************"
  puts "Working with #{file}"
  puts "*********************"
  gz = File.open(file, 'r')

  begin
    js = Zlib::GzipReader.new(gz).read
  rescue Zlib::GzipFile::Error
    puts "Empty file, no events"
    next
  end

  begin
    Yajl::Parser.parse(js) do |event|

      transformer = EventTransform.new(event)
      transformer.process
      encoded = Yajl::Encoder.encode(transformer.parsed_event)

      puts encoded
    end
  rescue Yajl::ParseError
    parse_error_count += 1
  end
end

About

Some rather unpleasant code that can be used to normalize the structure of GitHub public events across the years.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages