NOC Analyst | Observability Engineer | SRE Enthusiast
π Nairobi County, Kenya
Welcome to my GitHub! I'm a Network Operations Center (NOC) Analyst with 5+ years in financial services, specializing in observability, monitoring, and reliability engineering. I transform complex system data into actionable insights using modern monitoring stacks, automation, and SRE principles to ensure high-availability infrastructure.
- System Observability: Building comprehensive monitoring solutions with ELK, Zabbix, and Grafana
- Site Reliability Engineering: Implementing SLOs, error budgets, and automation for infrastructure resilience
- Security Operations: Integrating security monitoring into observability pipelines
- Infrastructure as Code: Automating deployment and monitoring with Python and Bash scripting
1. Dead File Detection Tool π
βΆ Python-based dead code detector for identifying unused and orphaned files in repositories
βΆ Multi-language support including Python, JavaScript, Java, C++, Go, and 20+ file types
βΆ Smart categorization of unreferenced, orphaned, and suspicious files with heuristic analysis
βΆ CI/CD integration with JSON output and configurable rules for automated code maintenance
βΆ Production-ready tool with comprehensive documentation and example configurations
βΆ Production-grade Streamlit application for comprehensive uptime monitoring and SLA management
βΆ Advanced analytics engine with adaptive spike detection using Z-score and MAD algorithms
βΆ Automated SLA reporting with daily uptime calculations pushed back to Zabbix via trapper items
βΆ Cross-host correlation analysis for identifying infrastructure-wide performance patterns
βΆ Smart caching system with configurable TTL and auto history/trends switching for optimal performance
βΆ Python-based infrastructure provisioning and configuration management
βΆ Automated remediation scripts for common NOC incidents (reduced MTTR by 35%)
βΆ Health check orchestration with self-healing capabilities
βΆ Compliance monitoring with automated security posture assessments
βΆ Real-time network telemetry collection from Cisco and SD-WAN infrastructure
βΆ Predictive analytics for capacity planning and anomaly detection
βΆ Integration with threat intelligence feeds for security-aware monitoring
βΆ Custom Grafana panels for executive-level reporting
βΆ Incident response automation with Slack/Teams integration
βΆ Post-incident analysis templates and blameless culture documentation
βΆ Chaos engineering experiments for system resilience testing
βΆ Toil identification and elimination tracking
- 99.8% average uptime across monitored services (tracked via custom Zabbix analytics)
- 12 minutes Mean Time to Detection (MTTD) using adaptive spike detection algorithms
- 22 minutes Mean Time to Resolution (MTTR) with automated correlation analysis
- 65% reduction in manual operational tasks through Python automation
- Zero false positive alerts through intelligent filtering and MAD-based anomaly detection
- Production-grade SLA monitoring with automated daily uptime reporting to stakeholders
| Certification | Badge | Status |
|---|---|---|
| ISC2 Certified in Cybersecurity (CC) | β Active | |
| CompTIA Linux+ | β Active | |
| CISCO Cybersecurity Essentials | β Active | |
| AWS Solutions Architect | π― In Progress |
- Advanced Analytics: Implementing Z-score and MAD algorithms for infrastructure anomaly detection
- SLA Engineering: Building comprehensive uptime calculation engines with automated reporting
- API Integration: Developing robust API clients with OAuth2, caching, and rate limiting
- Statistical Analysis: Cross-host correlation matrices for infrastructure pattern identification
- Performance Optimization: Auto-switching between Zabbix history/trends for optimal query performance
- Production Systems: Deploying enterprise-grade monitoring dashboards with 14-day log retention
- Code Quality Automation: Building dead code detection tools for automated repository maintenance
I'm passionate about advanced observability engineering, statistical monitoring algorithms, and production-grade SLA systems. Whether you're interested in anomaly detection, automated uptime reporting, or building enterprise monitoring dashboards, I'd love to share insights and collaborate!
π LinkedIn: josephkibaki
π§ Email: kibaki.joseph1@gmail.com
π¦ Twitter: @J_Kibaki
π¬ Open to: Mentoring, Knowledge Sharing, SRE Discussions
"Observability is not about collecting dataβit's about understanding your systems well enough to ask the right questions when things go wrong."
β‘ Building reliable systems, one metric at a time β‘