GitHub - dimitreOliveira/MicrosoftMalwarePrediction: (580th place - Top 24%) Repository for the "Microsoft Malware Prediction" Kaggle competition.

About the repository

The goal of this repository is to use the Kaggle "Microsoft Malware Prediction competition" data and apply data science techniques to predict if a machine will have malware. The bigger challenges on this competition are the huge dataset, and finding ways to run it on Kaggle kernel, Google colab or on a local machine (Memory issues), and also the high number of features.

Our team published Kaggle kernels:

What you will find

Deep_learning [link]
- Tensorflow model (Estimator API) [link]
- End-to-end model with Tensorflow [link]
Documentation [link]
- Project working cycle and effort, relevant content and insights [link]
EDA [link]
- Analysis of Train Dataset Distribution [link]
- Analysis of the Distribution Between Test and Train [link]
- Encoding evaluation with datasets distribution [link]
- Encoding for binary features [link]
- Encoding for features with high cardinality [link]
- Encoding for features with low cardinality [link]
- Malware Detection - EDA and LGBM [link]
- Malware Detection - Extended EDA [link]
- Feature type and cardinality [link]
- Missing study (high) [link]
- Missing study (low) [link]
- Version and Build features [link]
- Binary features EDA [link]
- Numerical features EDA [link]
- Train validation split [link]
Model backlog [link]
- Models generated on Google colab [link]
- Models generated on Kaggle [link]
- Model backlog [link]
Utils [link]
- Auxiliary script to merge data sets [link]

Microsoft Malware Prediction

Can you predict if a machine will soon be hit with malware?

Kaggle competition: https://www.kaggle.com/c/microsoft-malware-prediction Our insights are here.

Overview

The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways.

With more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and is deeply invested in improving security.

As one part of their overall strategy for doing so, Microsoft is challenging the data science community to develop techniques to predict if a machine will soon be hit with malware. As with their previous, Malware Challenge (2015), Microsoft is providing Kagglers with an unprecedented malware dataset to encourage open-source progress on effective techniques for predicting malware occurrences.

Can you help protect more than one billion machines from damage BEFORE it happens?

Acknowledgments

This competition is hosted by Microsoft, Windows Defender ATP Research, Northeastern University College of Computer and Information Science, and Georgia Tech Institute for Information Security & Privacy.

Dependencies by folder:

General
- Python 3.x
- Pandas
- NumPy
Deep_learning
- Tensorflow
- Sklearn
EDA
- Seaborn
- Matplotlib
- Category Encoders
- Dask
- Plotly
- LightGBM
Model backlog
- Google colab
  - Seaborn
  - Matplotlib
  - Sklearn
  - Keras
  - Tensorflow
- Kaggle
  - Seaborn
  - Matplotlib
  - Category Encoders
  - Dask
  - Sklearn
  - CatBoost
  - XGBoost
  - LightGBM

TODO (in case anyone wants to continue this work):

The Tensorflow End-to-end part was stopped for taking too much effort, I was able to iterate faster with Keras, but it may be a good option for longer training using neural networks.
Try model staking, it may help with predictions but will need different training and validation of the models.
Features like AvSigVersion and OSBuild may be mapped to datetime, as was pointed by other competitors, then you create a timeline, this may help with train/validation splits.
Try Dimensionality reduction on high cardinality features, techniques like PCA, t-SNE or auto encoders may help.
Other approaches to parameter tunning like Bayesian optimization may help as well.
Techniques of model explainability like SHAP or ELI5 may help to finetune models.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
Assets		Assets
Deep_learning		Deep_learning
Documentation		Documentation
EDA		EDA
Model backlog		Model backlog
Utils		Utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About the repository

Our team published Kaggle kernels:

What you will find

Microsoft Malware Prediction

Can you predict if a machine will soon be hit with malware?

Overview

Acknowledgments

Dependencies by folder:

TODO (in case anyone wants to continue this work):

About

Releases

Packages

Contributors 2

Languages

License

dimitreOliveira/MicrosoftMalwarePrediction

Folders and files

Latest commit

History

Repository files navigation

About the repository

Our team published Kaggle kernels:

What you will find

Microsoft Malware Prediction

Can you predict if a machine will soon be hit with malware?

Overview

Acknowledgments

Dependencies by folder:

TODO (in case anyone wants to continue this work):

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages