Unit1- Basics
What is Data ?
 The term Data is defined as a raw and unstructured fact that needs to be processed to make it meaningful.
 Data can be simple and unstructured at the same time until it is structured. Usually data contains facts,
 numbers, symbols, image, observations, perceptions, characters, etc.
 To derive meaning, data is always interpreted by a machine or human. So, it is meaningless. Data comprises
 of statements, characters and numbers in a raw form. Examples of Data; the number of visitors to a website
 by country, for the past 100 years, the history of temperature readings around the globe is the data.
 What is Information ?
 The term Information is defined as a set of data that is processed according to the given requirement in a
 meaningful way. To make the information useful and meaningful, it must be processed, presented and
 structured in a given context.
 Information is processed from data and possess context, purpose and relevance. It also includes raw data
 manipulation.
 Optimization
    o   The query optimizer (also known as the optimizer) is database software that identifies the most
        efficient way (like by reducing time) for a SQL statement to access data
    o   Database optimization involves maximizing the speed and efficiency with which data is retrieved.
    o   The process of selecting an efficient execution plan for processing a query is known as query
        optimization.
    o   Query optimization is used to access and modify the database in the most efficient way possible.
 Data Preprocessing
 Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning
 model. It is the first and crucial step while creating a machine learning model.
Query processing refers to the range of the activities involved in extracting data from a database to process the
query and generate result.
        Query (SQL)                                          Result
                                Query Processing
 Before processing the query (which is the SQL query), the system must translate the query into a usable form
 (language which system can understand)
What is Database
The database is a collection of inter-related data which is used to retrieve, insert and delete the data
efficiently. It is also used to organize the data in the form of a table, schema, views, and reports, etc.
For example: The college Database organizes the data about the admin, staff, students and faculty etc.
Using the database, you can easily retrieve, insert, and delete the information.
Database Management System (DBMS)
   o   Database    management      system   is   a software     which   is   used to   manage   the   database.   For
       example: MySQL, Oracle, etc are a very popular commercial database which is used in different applications.
   o   In 1960, the Charles bachman designed the dbms.
   o   Database management system is the combination of two words –
       Database + Management System = DBMS
   o   DBMS provides an interface to perform various operations like database creation, storing data in it, updating
       data, creating a table in the database and a lot more.
   o   Database management system is a collection of programs that enables users to create and maintain the
       database.
   o                                                        Operating system
           Application              DBMS                         (OS)
                                                                                         Database
   o   Database management system (DBMS) can be also define as an interface between application and the operating
       system to access that database.
   o   It provides protection and security to the database.
Types of DBMS
   o   Relational DBMS
   o   Non - Relational DBMS
Relational DBMS :- In this DBMS, data stored in table format
        Roll No          Name           Class
            1            Ram              FY
            2             Jai             TY
            3            Om               SY
            4             Sai             FY
For Ex – MYSQL, Oracle
Non - Relational DBMS :- In this DBMS, data stored in table Key-value point.
{       Roll No:1,
        Name: ‘Om’,
        Class: ‘FY’                 }
For Ex – MongoDB
Characteristics of DBMS
    o   It uses a digital repository established on a server to store and manage the information.
    o   It can provide a clear and logical view of the process that manipulates data.
    o   DBMS contains automatic backup and recovery procedures.
    o   It can reduce the complex relationship between data.
    o   It is used to support manipulation and processing of data.
    o   It is used to provide security of data.
    o   It can view the database from different viewpoints according to the requirements of the user.
Applications of DBMS
    o   Banking - For maintaining customer information, accounts, loans and banking transactions.
    o   Universities - For maintaining students information, records, course, registration and grades.
    o   Railway Reservation - For checking the availability of reservation in different trains, tickets.
    o   Airlines - For reservation and schedule information.
    o   Telecommunication – For keeping records of calls mode, generating monthly bills, etc.
    o   Finance – For storing information about holidays, sales and purchase of financial instructions.
    o   Sales – For customer, product and purchase information.
Advantages of DBMS
   o   Controls database redundancy: It can control data redundancy because it stores all the data in
       one single database file and that recorded data is placed in the database.
   o   Data sharing: In DBMS, the authorized users of an organization can share the data among multiple users.
   o   Easily Maintenance: It can be easily maintainable due to the centralized nature of the database system.
   o   Reduce time: It reduces development time and maintenance need.
   o   Backup: It     provides backup and recovery subsystems which create automatic backup of data
       from hardware and software failures and restores the data if required.
   o   Multiple user interface: It provides different types of user interfaces like graphical user interfaces,
       application program interfaces.
Disadvantages of DBMS
   o   Cost of Hardware and Software: It requires a high speed of data processor and large memory
       size to run DBMS software.
   o   Size: It occupies a large space of disks and large memory to run them efficiently.
   o   Complexity: Database system creates additional complexity and requirements.
   o   Higher impact of failure: Failure is highly impacted the database because in most of the organization,
       all the data stored in a single database and if the database is damaged due to electric failure or database
       corruption then the data may be lost forever.
There are four types of Data Languages
       1.   Data Definition Language (DDL)
       2.   Data Manipulation Language (DML)
       3.   Data Control Language (DCL)
       4.   Transactional Control Language (TCL)
DDL is the short name for Data Definition Language, which deals with database schemas and
descriptions, of how the data should reside in the database.
           CREATE: to create a database and its objects like (table, index, views, store
            procedure, function, and triggers)
           ALTER: alters the structure of the existing database
           DROP: delete objects from the database
           TRUNCATE: remove all records from a table, including all spaces allocated for
            the records are removed
           COMMENT: add comments to the data dictionary
           RENAME: rename an object
DML is the short name for Data Manipulation Language which deals with data manipulation
and includes most common SQL statements such SELECT, INSERT, UPDATE, DELETE, etc.,
and it is used to store, modify, retrieve, delete and update data in a database.
         SELECT: retrieve data from a database
         INSERT: insert data into a table
         UPDATE: updates existing data within a table
         DELETE: Delete all records from a database table
         MERGE: UPSERT operation (insert or update)
         CALL: call a PL/SQL or Java subprogram
         EXPLAIN PLAN: interpretation of the data access path.
         LOCK TABLE: concurrency Control
DCL is short for Data Control Language which acts as an access specifier to the database.
(basically to grant and revoke permissions to users in the database
         GRANT: grant permissions to the user for running DML (SELECT,
          INSERT, DELETE,…) commands on the table
         REVOKE: revoke permissions to the user for running DML (SELECT,
          INSERT, DELETE,…) command on the specified table
TCL is short for Transactional Control Language which acts as an manager for all types of
transactional data and all transactions. Some of the command of TCL are
         Role Back: Used to cancel or Undo changes made in the database
         Commit: It is used to apply or save changes in the database
         Save Point: It is used to save the data on the temporary basis in the database
          Database Management System: The software which is used to managem
databases is called Database Management System (DBMS). For Example, MySQL, Oracle,
etc. are popular commercial DBMS used in different applications.
DBMS allows users the following tasks :
         Data Definition: It helps in the creation, modification, and removal of definitions
          that define the organization of data in the database.
         Data Updation: It helps in the insertion, modification, and deletion of the actual
          data in the database.
         Data Retrieval: It helps in the retrieval of data from the database which can be
          used by applications for various purposes.
         User Administration: It helps in registering and monitoring users, enforcing data
          security, monitoring performance, maintaining data integrity, dealing with concurrency
          control, and recovering information corrupted by unexpected failure.
What is Data Quality?
Data quality is defined as:
the degree to which data meets a company’s expectations of accuracy, validity,
completeness, and consistency.
By tracking data quality, a business can pinpoint potential issues harming quality,
and ensure that shared data is fit to be used for a given purpose.
When collected data fails to meet the company expectations of accuracy,
validity, completeness, and consistency, it can have massive negative impacts on
customer service, employee productivity, and key strategies.
Why Is Data Quality Important?
Quality data is key to making accurate, informed decisions. And while all data
has some level of “quality,” a variety of characteristics and factors determines the
degree of data quality (high-quality versus low-quality). Furthermore, different
data quality characteristics will likely be more important to various stakeholders
across the organization.
A list of popular data quality characteristics and dimensions include:
     Accuracy
     Completeness
     Consistency
     Integrity
     Reasonability
     Timeliness
     Uniqueness/Deduplication
     Validity
     Accessibility