Ics054 Unit 1
Ics054 Unit 1
Definition: Data analytics is the process of examining, cleaning, transforming, and modelling data to
discover useful insights, inform conclusions, and support decision-making.
Purpose: It helps organizations to make data-driven decisions, identify patterns, and forecast trends.
Data analytics is a rapidly growing field that empowers businesses, governments, and organizations to
understand and make informed decisions based on vast amounts of data. With the evolution of AI,
machine learning, and cloud technologies, the future of data analytics is poised for even more
advancements, making it a crucial tool for success across industries.
A. INTERNAL SOURCES
Data generated from within an organization during its routine operations.
● Transactional Databases: Records of business transactions (e.g., sales, purchases,
payments).
o Example: A Sales table in a company's SQL database.
● Customer Relationship Management (CRM) Systems: Contain customer interaction data,
support tickets, and lead information.
o Example: Salesforce or HubSpot data.
● Enterprise Resource Planning (ERP) Systems: Integrated management of core business
processes (e.g., finance, supply chain, manufacturing).
● Application Logs: Records of events generated by software applications (e.g., web server
logs, application error logs).
● Company Documents: Internal reports, emails, and presentations.
B. EXTERNAL SOURCES
Data obtained from outside the organization.
● Public Government Datasets: Open data provided by governments (e.g., census data,
health statistics, economic indicators).
o Example: data.gov, data.gov.uk, Eurostat.
● Commercial Data Providers: Companies that collect and sell specialized data.
o Example: Nielsen for market research, Thomson Reuters for financial data.
● APIs (Application Programming Interfaces): Structured channels to access data from other
web services and platforms.
o Example: Using the Twitter API to get tweets, or the Google Maps API to get location
data.
● Web Scraping/Crawling: The process of using code to automatically extract data from
websites.
o Example: Scraping product prices and reviews from e-commerce sites for competitive
analysis.
● Social Media: A vast source of unstructured public opinion, trends, and user-generated
content.
● Third-Party Surveys: Data collected by research firms.
D. OPEN DATA
Open data is accessible to everyone and free to use. However, if it’s high-level data, or it’s heavily
summarised and aggregated, it might not be very relevant to you. It might also not be in the format
you need, or it might be very difficult for you to make sense of it. All of these challenges can require a
lot of time to make the data usable.
• Government data
• Health and scientific data: World Health Organisation (WHO), Nature.com scientific data etc
• Social media: Google trends (i.e. look at national trends on search terms), Twitter (allows you to
search by tags and users, which can be downloaded by using Twitter APIs).
SUMMARY
EXTERNAL Public economic Data from a Twitter API News articles, YouTube
SOURCE data in CSV call videos
CLASSIFICATION OF DATA
1- STRUCTURED DATA
Structured data is highly organized and fits neatly into traditional data models such as tables, rows, and
columns. It is often stored in relational databases or spreadsheets and is easy to enter, store, query,
and analyze.
CHARACTERISTICS
Fixed Format: Data follows a predefined schema (e.g., tables with rows and columns).
Easily Searchable: Structured data can be easily accessed using standard database tools (e.g.,
SQL).
High Consistency: Data types are consistent (e.g., dates, numbers, text).
EXAMPLES
Databases: SQL databases like MySQL, PostgreSQL, and Oracle.
Spreadsheets: Excel files or CSV files.
CRM Systems: Customer databases containing structured fields (e.g., name, age, email).
2. SEMI-STRUCTURED DATA
Semi-structured data does not conform to a rigid schema like structured data, but it contains some
organizational properties that allow for easier analysis compared to unstructured data. It often
includes tags or markers that separate elements.
CHARACTERISTICS
Flexible Format: Data has a loose structure (e.g., key-value pairs, tags, or metadata).
Self-Describing: Data contains metadata that describes the data itself, making it easier to
interpret.
Can Be Nested: Data may have nested elements (e.g., JSON or XML).
EXAMPLES
JSON (JavaScript Object Notation): A flexible format often used in web APIs.
XML (eXtensible Markup Language): Used for storing data in a hierarchical format.
NoSQL Databases: Document-based stores (e.g., MongoDB) or key-value stores (e.g., Redis).
Email Metadata: Subject, sender, timestamp, etc., that are structured but have unstructured
content.
3. UNSTRUCTURED DATA
Unstructured data is raw data that doesn't have a predefined format or structure, making it difficult to
store, process, or analyze using traditional tools and methods.
CHARACTERISTICS
Lack of Organization: No predefined structure; it can be varied and messy.
Diverse Formats: It could be text, images, audio, video, or even social media posts.
Hard to Analyze Directly: Requires advanced techniques like Natural Language Processing (NLP),
machine learning, or image recognition for extraction of insights.
EXAMPLES
Text Documents: Word files, PDFs, email bodies.
Multimedia: Images, videos, audio recordings.
Social Media Posts: Tweets, Instagram posts, YouTube videos, comments.
Web Content: Blogs, websites, news articles.
SEMI-STRUCTURED UNSTRUCTURED
ASPECT STRUCTURED DATA
DATA DATA
Fixed schema
Flexible schema (e.g., No fixed schema or
Organization (tables, rows,
JSON, XML) organization
columns)
Data lakes, Object
Relational databases
Storage NoSQL databases, Files storage (e.g.,
(SQL)
Hadoop)
Parsing, NLP, machine
Processing SQL queries, BI tools transformation, learning, image
NoSQL queries processing
Hadoop, Spark,
SQL, Excel, BI tools NoSQL (MongoDB,
Tools TensorFlow, NLTK,
(Tableau, Power BI) Cassandra), Hadoop
OpenCV
Social media posts,
Customer databases, JSON, XML, NoSQL
Examples images, videos,
financial records databases
emails
Easy to manage and Flexible, scalable, Rich insights from
Advantages
analyze evolving schema diverse data types
Difficult to analyze
Limited to Requires
Challenges without advanced
predefined formats preprocessing
methods
CHARACTERISTICS OF DATA
BIG DATA
Big Data refers to extremely large and complex datasets that exceed the processing capabilities of
traditional data management and analysis tools. These datasets come in various forms (structured,
semi-structured, unstructured) and are generated rapidly from sources like sensors, social media,
transactions, and devices.
CHARACTERISTICS
VOLUME: The sheer amount of data, often measured in terabytes or petabytes, generated
every second.
VELOCITY: The speed at which new data is produced and needs to be managed or analyzed.
VARIETY: Multiple data formats including text, images, audio, video, logs, and sensor data.
VERACITY: The accuracy, reliability, and trustworthiness of data sources.
VALUE: The potential to extract useful insights or benefits from data analysis.
ANALYTICS CAPABILITIES OF BIG DATA PLATFORMS
Platforms like Apache Hadoop and Apache Spark enable parallel computing across clusters,
allowing complex analytical tasks such as batch processing, real-time stream processing, and
machine learning on huge datasets.
REAL-TIME ANALYTICS
Many platforms support the ingestion and analysis of streaming data for immediate
insights, useful for applications like fraud detection, recommendation engines, and
operational monitoring.
MACHINE LEARNING INTEGRATION
Built-in machine learning libraries and support for frameworks let users build, train, and
deploy predictive models directly within the platform, accelerating AI and advanced
analytics workflows.
SQL AND QUERYING ENGINES
Big Data platforms provide fast SQL-based querying support (e.g., Google BigQuery, Amazon
Redshift) on structured and semi-structured data, making analytics accessible to data
analysts and business users familiar with relational querying.
DATA VISUALIZATION AND REPORTING
Integrated tools or third-party integrations like Power BI and Tableau enable visualization of
big data analytics results, supporting data-driven decision-making via interactive dashboards
and reports.
SCALABILITY AND FLEXIBILITY
Cloud-based Big Data platforms (AWS, Azure, Google Cloud) provide elastic scalability to
handle varying workloads and data volumes, supporting diverse analytics needs without
infrastructure constraints.
DATA INTEGRATION
These platforms consolidate data from disparate sources, enabling comprehensive analytics
that span operational systems, customer data, IoT, social media, and more, providing 360-
degree insights.
Apache Hadoop & Spark: Open-source platforms for distributed processing and analytics.
Google BigQuery: Serverless data warehouse with fast SQL analytics and built-in ML.
Amazon EMR & Redshift: Cloud-native analytics and data warehousing solutions.
Microsoft Azure HDInsight & Synapse Analytics: Managed cloud services with strong ML and BI
integration.
Databricks: Unified platform for large-scale data engineering, analytics, and machine learning.
ANALYTIC PROCESS
The analytic process in data analytics is a structured methodology that transforms raw data into
meaningful insights to aid decision-making. It typically follows these key steps:
2. COLLECT DATA
Gather relevant data from diverse sources such as databases, APIs, surveys, web logs, and
IoT devices. The quality and relevance of collected data are vital for accurate analysis.
3. DATA CLEANING AND PREPARATION
Process the raw data to correct errors, handle missing values, remove duplicates, and
transform data into suitable formats. This step ensures the reliability of subsequent
analyses.
4. DATA ANALYSIS
Apply statistical methods, machine learning algorithms, or other analytical techniques to
explore, model, and identify patterns or relationships in the data. This can include
descriptive, diagnostic, predictive, and prescriptive analytics.
5. INTERPRETATION AND REPORTING
Interpret the results within the context of the original problem, translating data patterns
into actionable insights. Present findings using visualizations, dashboards, or reports
tailored for stakeholders to support decision-making.
INFORMED DECISION-MAKING
Data analytics empowers organizations to make decisions based on facts and trends rather
than intuition, reducing guesswork and increasing accuracy in business strategies and
operations.
IMPROVED CUSTOMER UNDERSTANDING
It helps businesses deeply understand customer behavior, preferences, and needs, enabling
personalized marketing and improved customer experiences, which enhance loyalty and
satisfaction.
OPERATIONAL EFFICIENCY
Analytics identifies inefficiencies, bottlenecks, and process optimizations to reduce costs
and improve productivity, resulting in streamlined business operations.
RISK MANAGEMENT
By analyzing historical data and identifying warning signs, data analytics proactively helps
manage risks related to markets, fraud, security, and compliance.
COMPETITIVE ADVANTAGE
Analytics enables companies to analyze competitors' strategies and market trends, allowing
them to innovate and position themselves better in the market.
FASTER INNOVATION AND PRODUCT DEVELOPMENT
Data-driven insights inform product development, helping businesses design products that
meet customer demands and respond quickly to market changes.
COST REDUCTION AND REVENUE GROWTH
Companies using data analytics often experience significant cost savings through optimized
resource allocation and increased revenues through targeted marketing and operational
improvements.
REAL-TIME INSIGHTS
Real-time data analytics provides up-to-date information that helps companies adapt
quickly to changes, seize opportunities, and mitigate issues immediately.
TYPES OF DATA ANLYTICS
1. DESCRIPTIVE ANALYTICS - "WHAT HAPPENED?"
● Purpose: To summarize and describe past data to understand what has occurred. It is the most
common and foundational type of analytics, providing a starting point for all further
investigation.
● Methods: Data aggregation, data mining, and summary statistics (e.g., mean, median, mode).
● Tools: Dashboards, reports, and Business Intelligence (BI) tools.
○ "Sales increased by 15% last quarter."
○ "The website had 100,000 visitors last month."
○ "The customer churn rate was 5%."
● Purpose: To use historical data to forecast future outcomes and trends. It uses statistical
models and machine learning to identify the likelihood of future results. It is about managing
uncertainty.
● Methods: Statistical modeling, machine learning, forecasting, and pattern matching.
● Tools: Python (scikit-learn, TensorFlow), R, SAS, and other ML platforms.
○ "Based on historical trends & current market data, we predict a 20% increase in
customer churn next month."
○ "This customer has a 92% probability of clicking on the promotional email."
○ "We forecast that demand for this product will peak in the first week of December."
4. PRESCRIPTIVE ANALYTICS - "WHAT SHOULD WE DO?"
● Purpose: To recommend actions you can take to affect desired outcomes or avoid future
problems. It is the most advanced type, often incorporating predictive models and simulating
the consequences of various choices to determine the best course of action.
● Methods: Optimization, simulation, recommendation engines, and complex machine learning
algorithms.
● Tools: Advanced ML, AI, and complex computational modeling software.
○ "To reduce the predicted customer churn, the model recommends offering a 15%
discount and a free service upgrade to these specific high-risk customers."
○ "To optimize delivery routes and minimize fuel costs, the system prescribes the
following schedule for drivers."
○ Netflix recommends a specific movie to you (prescription) based on what it predicts you
will enjoy.
DATA VISUALIZATION
● Key Activities:
○ Identify the Business Problem: What are we trying to solve or achieve?
○ Define Objectives: How will we measure success?
○ Identify Key Stakeholders: Who is the project for? Who will use the results?
● Key Outputs:
○ Clear project charter with defined objectives and success metrics.
○ A list of data sources needed and timeline.
● Key Activities:
○ Data Acquisition: Gathering data from identified sources (SQL databases, APIs, CRM, flat
files, IoT sensors, etc.).
○ Data Cleaning (Data Cleansing):
■ Handling missing values (impute or remove).
■ Correcting errors and inconsistencies.
○ Data Transformation:
■ Normalizing/scaling values.
■ Joining tables from different sources.
■ Formatting data types (e.g., converting strings to dates).
● Key Outputs:
○ A clean, consolidated, and well-structured dataset ready for analysis.
PHASE 3: DATA EXPLORATION & MODEL PLANNING (EDA)
Deeply understand data patterns and relationships to inform choice of analytical techniques.
● Key Activities:
○ Exploratory Data Analysis (EDA):
■ Calculating descriptive statistics (mean, median, standard deviation).
■ Creating visualizations: histograms, box plots, scatter plots, correlation matrices.
○ Identify Patterns & Anomalies: Find trends, outliers, and relationships between
variables.
○ Choose Modeling Techniques: Based on EDA, select appropriate algorithms (e.g., linear
regression, classification, clustering).
● Key Outputs:
○ Insights about data structure and relationships.
○ A plan for which models or analytical approaches to use.
● Key Activities:
○ Data Splitting: Dividing the dataset into training, validation, and testing sets.
○ Model Development: Building the selected models (e.g., using Python's scikit-learn, R,
TensorFlow).
○ Model Training: Feeding the training data to the algorithm to let it learn patterns.
○ Model Validation & Tuning: Using the validation set to test performance and adjust
hyperparameters to avoid overfitting.
○ Model Testing: Evaluating the final model's performance on the unseen test set to get
an unbiased estimate of its accuracy.
● Key Outputs:
○ A trained, validated, and tested analytical model.
○ Model performance metrics (e.g., Accuracy, Precision, Recall, RMSE, R²).
Keeps the "How" on track. Manages All Phases. Their role is continuous,
timelines, resources, budget, and ensuring the project moves
Project Manager /
scope. Facilitates communication efficiently from discovery through to
Scrum Master between technical and business teams operationalization without going
and mitigates risks. over budget or missing deadlines.
The Storyteller & Explorer. Bridges the Phase 1 (Discovery): Helps clarify
gap between business and technical data requirements. Phase 2
teams. Uses SQL, statistics, and (Preparation): Often heavily involved
Data Analyst
visualization to find patterns, generate in cleaning. Phase 3 (Exploration):
reports, and translate data into Core role. Phase 5 (Communicate):
actionable business insights. Creates dashboards and reports.