Data Wrangling which involves gathering, assessing and cleaning of data from different sources e.g. a twitter account. It is part of the requirements for the Udacity Data Analyst Nanodegree Program.
This repository contains the project files both pdf, png and ipynb, the dataset in tsv and csv and the readme in md. The Packages used on this project include:
- Numpy Version '1.23.1' - For arrays
- Pandas Version '1.4.3' - For 2D Data Structure
- Matplotlib Version 3.5.2 - For Visualization
- tweepy - For Twitter API
- request - For HTPP request
- json - For parsing json objects python in
Using Python and its Libraries, I was able to gather data from 3 different sources, assess and identify various quality and tidiness issues and ultimately take control of the wild data through data cleaning. Other tasks include storing the cleaned data, completing the analysis, presenting at least a visual representation of the insights and also writing a report.
The steps include
- Data Gathering - Done programmatically using code for reproducibility
- Files on hand Using pandas
- Web Scraping using Requests to get HTPP object
- Twitter API using tweepy
- Data Assessment - Done virtually and programmatically to check for quality and tidiness.
- Data Cleaning - Code and Test
- Data Analysis and Visualization
- Conclusions
At this stage, the dataset was merged into a master file and stored programmatically. It is safe to say that the dataset was good enough to carry out a mini analysis. I was able to identify the following • I was able toidentify the most prominent dog breeds • I was able toidentify the most prominent dog stage • I was able toidentify the most prominent source
I was able to complete this project and go through the three stages of data wrangling. I’m now very comfortable with integrating information from multiple data sources, checking for structural and content issues, treating these issues, all programmatically. Together with some python libraries, I was able to meet the requirements for the project. All the activities that occurred in each stage have been duly communicated and summarized in the above paragraphs