The goal of this repository is to use the Kaggle "Microsoft Malware Prediction competition" data and apply data science techniques to predict if a machine will have malware. The bigger challenges on this competition are the huge dataset, and finding ways to run it on Kaggle kernel, Google colab or on a local machine (Memory issues), and also the high number of features.
- Malware Detection - EDA and LGBM
- Malware Detection - Extended EDA
- Microsoft Malware Prediction - EDA with Tableau
- Malware Prediction - Adversarial Validation
- Deep_learning [link]
- Documentation [link]
- Project working cycle and effort, relevant content and insights [link]
- EDA [link]
- Analysis of Train Dataset Distribution [link]
- Analysis of the Distribution Between Test and Train [link]
- Encoding evaluation with datasets distribution [link]
- Encoding for binary features [link]
- Encoding for features with high cardinality [link]
- Encoding for features with low cardinality [link]
- Malware Detection - EDA and LGBM [link]
- Malware Detection - Extended EDA [link]
- Feature type and cardinality [link]
- Missing study (high) [link]
- Missing study (low) [link]
- Version and Build features [link]
- Binary features EDA [link]
- Numerical features EDA [link]
- Train validation split [link]
- Model backlog [link]
- Utils [link]
- Auxiliary script to merge data sets [link]
Kaggle competition: https://www.kaggle.com/c/microsoft-malware-prediction Our insights are here.
The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways.
With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.
As one part of their overall strategy for doing so, Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences.
Can you help protect more than one billion machines from damage BEFORE it happens?
This competition is hosted by Microsoft, Windows Defender ATP Research, Northeastern University College of Computer and Information Science, and Georgia Tech Institute for Information Security & Privacy.
- General
- Deep_learning
- EDA
- Model backlog
- Google colab
- Kaggle
- The Tensorflow End-to-end part was stopped for taking too much effort, I was able to iterate faster with Keras, but it may be a good option for longer training using neural networks.
- Try model staking, it may help with predictions but will need different training and validation of the models.
- Features like AvSigVersion and OSBuild may be mapped to datetime, as was pointed by other competitors, then you create a timeline, this may help with train/validation splits.
- Try Dimensionality reduction on high cardinality features, techniques like PCA, t-SNE or auto encoders may help.
- Other approaches to parameter tunning like Bayesian optimization may help as well.
- Techniques of model explainability like SHAP or ELI5 may help to finetune models.