PDF 1
PDF 1
Internship Report
CHAPTER 1
                           COMPANY PROFILE
        As part of the course in Computer Science and Engineering degree, prescribed
 byVisvesvarayaTechnologicalUniversity,aninternshipisundertaken.Thedetailsofthecompanyth
 atprovided theinternship aregiven.
 1.1 About Company
        Board Infinity is a career-tech platform based in Mumbai, India, founded in 2017
 byAbhay Gupta and Sumesh Nair. The company aims to enhance career development and
 jobreadinessforstudentsandprofessionalsthroughpersonalizedlearningexperiencesandmentorsh
 ip.BoardInfinityoffersawiderangeofcoursesinfieldssuchasdatascience,digitalmarketing,
 software development, and business management. These programs are designed tobe flexible
 and adaptive to individual learner needs, incorporating one-on-one mentoringsessionswith
 industry experts,hands-on projects,andcareer coaching.
 Theplatformhasraisedapproximately$3.2millioninfundingandconnectslearnerswithover2,000
 industry experts to ensure a focused and practical learning approach. Board Infinityoperates
 with   the     vision   of   bridging    the   gap    between     academic     knowledgeand
 industryrequirements,thereby improving employabilityand career growthfor its users.
 Board Infinity also provides career transition support, helping learners shift from their
 currentroles to new and more desirable positions within the industry. The company
 emphasizesoutcomes, aiming to offer tangible improvements in job placements and career
 advancementsforitsusers
                                                                                                    1
Dept of C.S.E
                                       “ Report on DataScience”
                                                                                   Internship Report
INSERT,      UPDATE,      and   DELETE,     which    are   used   to   retrieve,   add,   modify,
andremovedata,respectively.Thesecommandsarefundamentalforinteractingwithdatabasesanma
nagingdataefficiently.IalsoexploredadvancedSQLconceptslikejoins,subqueries,andfunctions,w
hichenableefficientdatamanipulationandretrievalfrommultipletables.Understanding              joins
allowed     me     to    combine     data    from    different    tables,   while      subqueries
andfunctionsprovidedadvanced waystohandle complexqueriesand datatransformations.
 Inthedatasciencesegment,Iwasintroducedtotheentiredatasciencelifecycle,encompassingdata
 collection, cleaning, analysis, and visualization. I learned about various types of
 data,including structured, unstructured, and semi-structured data, and the appropriate methods
 forhandling each type. The course emphasized the importance of data cleaning, as it is a
 criticalstep to ensure accuracy and reliability in subsequent analyses. I gained proficiency in
 datapreprocessingtechniquessuchashandlingmissingvalues,outlierdetection,anddatanormalizat
 ion.
 The course provided hands-on experience with key programming languages and tools
 likePython,Pandas,NumPy,andMatplotlib.Pythonservedastheprimaryprogramminglanguage,a
 nd I learned how to use it for data manipulation, analysis, and visualization. Pandas
 wasparticularly useful for data wrangling, allowing me to work with large datasets
 efficiently.NumPyprovidedcapabilitiesfornumericalcomputing,makingiteasiertoperformmathe
 matical operations on arrays and matrices. Matplotlib, along with other visualizationlibraries,
 enabled me to create insightful charts and graphs to communicate data findingseffectively.
1.1 Objectives
    The primary objective of the data science course was to equip students with the
 necessaryskills to analyze, interpret, and leverage data for informed decision-making. Key
 learningoutcomesincluded.
 MasteressentialSQLcommands:SELECT,INSERT,UPDATE,DELETE.
 LearndatafilteringandsortingtechniquesusingWHERE,ORDERBY,GROUPBY.
 ExploreadvancedSQLconceptslikejoins,subqueries,andfunctions.
 Gainproficiencyindatacollection,cleaning,andpreprocessing.
 Developskillsindataanalysisusingstatisticalmethods.
                                                                                                     2
Dept of C.S.E
                                        “ Report on DataScience”
                                                                                   Internship Report
        Learn to visualize data effectively using Python libraries like Matplotlib.
                                                                                                  3
Dept of C.S.E
                                            “ Report on DataScience”
                                                                                    Internship Report
CHAPTER 2
TASK PERFORMED
   The weekly tasks performed have been explained in detail below which gives the overview of several
   concepts undertaken during the internship.
Data Science is a multidisciplinary field that combines elements of computer science, statistics,
and domain-specific knowledge to extract insights and knowledge from data. It involves using
various techniques, tools, and methods to collect, process, analyze, and interpret large amounts of
                                                                                                      4
   Dept of C.S.E
                                           “ Report on DataScience”
                                                                                     Internship Report
data to gain a deeper understanding of the underlying patterns, trends, and correlations. Data
Science is an interdisciplinary field that draws on concepts and techniques from computer science,
statistics, mathematics, and domain-specific knowledge to extract insights and knowledge from
data.
The importance of Data Science cannot be overstated, as it has become a crucial aspect of various
industries. Its applications are numerous, including extracting insights from large datasets,
identifying patterns and trends, making predictions and recommendations, informing business
decisions, and driving innovation and growth. Data Science is applied in healthcare for
personalized medicine, disease diagnosis, and treatment; in finance for risk management, portfolio
optimization, and fraud detection; in marketing for customer segmentation, targeted advertising,
and campaign optimization; and in environmental science for climate modeling, predictive
analytics, and sustainability. By leveraging Data Science, organizations can gain a competitive
edge, improve decision-making, and drive business success.
           Detailed description of various Data Science roles such as Data Scientist, Data Analyst,
            Data Engineer, etc.
           Introduction to essential tools and technologies used in Data Science (e.g., Python, R,
            SQL).
Data Science encompasses various job roles, including Data Scientist, Data Analyst, Data
Engineer, and more. A Data Scientist extracts insights from data, develops predictive models, and
informs business decisions. A Data Analyst interprets data to identify trends and patterns, while a
Data Engineer designs and implements data pipelines. Other roles include Data Architect,
Business Analyst, and Machine Learning Engineer.
Essential tools and technologies in Data Science include Python, R, SQL, and more. Python is a
popular programming language used for data analysis, machine learning, and visualization. R is a
language and environment for statistical computing and graphics. SQL is a language for managing
and analyzing relational databases. Additionally, tools like Tableau, Power BI, and Excel are used
                                                                                                       5
   Dept of C.S.E
                                           “ Report on DataScience”
                                                                                     Internship Report
for data visualization and analysis.
          Discussion on the motivations and inspirations for pursuing a career in this field.
   4. Traditional Approach vs. Data Science Approach
Pursuing a career in Data Science/AI/ML requires motivation and inspiration. Many are drawn to
the field by the opportunity to work with data, drive business decisions, and innovate. Others are
inspired by the potential to solve complex problems and make a meaningful impact.
Traditional approaches often rely on intuition and experience, whereas Data Science-driven
approaches rely on data-driven insights and statistical analysis. The traditional approach may lead
to biased decision-making, whereas Data Science approaches provide objective, data-backed
solutions. By embracing Data Science, organizations can unlock new opportunities, drive
innovation, and gain a competitive edge.
   Data Science has numerous real-world applications, and one such example is predicting
   customer churn for a telecom company. By analyzing customer data, such as usage patterns
   and billing information, a Data Science model can identify high-risk customers and enable
   targeted retention strategies. This case study demonstrates how Data Science can solve a
   complex business problem and drive significant revenue savings.
MODULE - 3: Business Analytics with Microsoft Excel
   2. Understanding Business Metrics
    Business analytics with Excel enables users to analyze and visualize data, making it easier to
understand business performance. Excel provides various tools and functions, such as pivot tables,
charts, and formulas, to facilitate data analysis. By learning business analytics with Excel, users
can unlock insights, identify trends, and drive business growth. This module introduces the
fundamentals of business analytics using Excel, empowering users to make data-driven decisions.
This module delves deeper into business analytics concepts in Excel, covering basics, functions,
pivot tables, dashboard creation, and statistical analysis. Students will learn to leverage Excel's
capabilities to analyze and visualize data, creating informative dashboards and reports. Topics
include data manipulation, chart creation, and advanced functions like VLOOKUP and INDEX-
MATCH. Pivot tables will be explored in depth, enabling students to summarize and analyze large
datasets efficiently.
Advanced data visualization techniques will also be covered, allowing students to create
interactive and dynamic visualizations in Excel. This includes using tools like Power BI, Power
Pivot, and D3.js to create cutting-edge visualizations. Students will learn to effectively
communicate insights and trends to stakeholders, enhancing their ability to drive business
decisions with data-driven storytelling. By mastering advanced data visualization in Excel,
students will be able to unlock new insights, identify areas for improvement, and drive
                                                                                                       7
    Dept of C.S.E
                                            “ Report on DataScience”
                                                                                    Internship Report
  business growth.
 Overview of MySQL and NoSQL databases and their relevance to Data Science.
      SQL joins and techniques for combining data from multiple tables.
This module dives deeper into SQL and Python for data science, building on the foundational
knowledge gained earlier. Students will solidify their understanding of database concepts,
including data types, schema design, and normalization.
The module covers basic and advanced SQL queries, enabling students to extract insights from
databases effectively. Topics include filtering data using the WHERE and ORDER BY
clauses, as well as techniques for aggregating and grouping data. Students will learn to write
efficient SQL queries, including subqueries, window functions, and common table
expressions.
Additionally, the module explores SQL joins and techniques for combining data from multiple
tables. Students will learn to use INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL
OUTER JOIN to integrate data from different tables, enabling them to analyze complex
relationships and extract valuable insights. By mastering advanced SQL techniques, students
will be able to work with large datasets and extract meaningful information to inform data-
driven decisions.
                                                                                                 9
Dept of C.S.E
                                            “ Report on DataScience”
                                                                                       Internship Report
CHAPTER 3
                                 PROJECT DETAILS
3.1 Project Overview
                                                                                                     10
   Dept of C.S.E
                                           “ Report on DataScience”
                                                                                      Internship Report
3.2Types of Diabetes:
1) When an individual having type 1 diabetes, their immune system is not strong enough and the
   white blood cells cannot to make enough insulin. There are no convincing studies that
   demonstrate the
2) causes of type 1 diabetes, and there are also no effective preventative measures till now.
3) Type 2 diabetes is characterised by either insufficient insulin production by the cells or
   improper insulin use by the body. 90% of people with diabetes have this kind of diabetes,
   making it the most prevalent type. Both genetic and lifestyle factors contribute to its
   occurrence.
4) Gestational diabetes manifests as in pregnant women who have high blood sugar levels
   unexpectedly. It will return in two-thirds of patients during consecutive pregnancies. There is a
   high likelihood that type 1 or type 2 diabetes will develop during a gestational diabetes
   affected pregnancy. Diabetes is also caused by genetic conditions, It is caused by at least two
   defective genes on chromosome 6, the chromosome that controls the body's response to
   numerous antigens. The incidence of type 1 and type 2 diabetes may also be influenced by
   viral infection. Infection with viruses such as rubella, mumps, hepatitis B virus, and
   cytomegalovirus increase the risk of having diabetes. The goal of this study is to create a
   system that, by fusing the findings of several machine learning approaches, can more
   accurately conduct early diabetes prediction for a patient. To predict diabetes, we use a variety
   of Machine Learning classification and ensemble techniques. Machine learning is a technique
   used to intentionally train computers or other machines. By creating various categorization and
   ensemble models from the obtained dataset, various machine learning techniques efficiently
   capture knowledge. Many machine learning techniques are capable of making predictions, but
   selecting the right method can be challenging. Therefore, we use well-known classification
   and ensemble algorithms on the dataset for this aim to make predictions. To predict diabetes,
   we use a variety of Machine Learning classification and ensemble techniques. Machine
   learning is a technique used to intentionally train computers or other machines. By creating
   various categorization and ensemble models from the obtained dataset, various machine
   learning techniques
       .
                                                                                                    11
   Dept of C.S.E
                                       “ Report on DataScience”
                                                                                   Internship Report
                                                                                                 12
Dept of C.S.E
                                      “ Report on DataScience”
                                                                               Internship Report
3.4 Objectives:
The purpose of this study is to evaluate the Diabetes dataset, develop, and implement a
Diabetes prediction and recommendation system built on machine learning classification
algorithms. Leukemia, anaemia, diabetes, haemophilia, blood cholesterol, cancer, HIV/AIDS,
and other blood problem illnesses exist. Diabetes Mellitus affects around 400 million
individuals worldwide. Hundreds of thousands of people are affected by this chronic illness.
These technologies are intended to detect their medical issues. 1.5.1 Goals:
    ● The goal is to raise awareness about the significance of diabetes as a worldwide public
   health concern.
    ● To examine the literature on diabetes diagnosis and prediction Create a model using
   machine learning techniques.
    ● Diabetes prevention and management are being promoted in underserved populations.
    ● Diagnosis of diabetes at an early stage using food intake.
    ● The importance of lifestyle in identifying individuals with diabetes and avoiding
   complications, especially health and food. Serious actions must be undertaken to reduce
   the impacts of diabetes at an initial stage, which also helps to reduce the number of
   diabetic patients. Aside from that, if someone believes they have diabetes, they should
   focus on preventing complications such as blindness, common illness that involves
   dialysis, amputation, or perhaps death. Therefore, a balanced diet is necessary to prevent
   the progression of diabetes. Accurate classification of diabetes is a fundamental step
   towards diabetes prevention and control in healthcare. However, early and onset
   identification of diabetes is much more beneficial in controlling diabetes. The diabetes
   identification process seems tedious at an early stage because a patient has to visit a
   physician regularly. The advancement in machine learning approaches has solved this
   critical and essential problem in healthcare by predicting disease. Several techniques have
   been proposed in the literature for diabetes prediction
                                                                                             13
Dept of C.S.E
                                      “ Report on DataScience”
                                                                                Internship Report
3.5 Motivation and Problem Defination The problem of diabetes prediction using
machine learning revolves around accurately identifying individuals at risk of developing
diabetes before clinical symptoms appear. Given the increasing prevalence of diabetes
worldwide, early detection is crucial for effective intervention and management. Machine
learning offers a sophisticated approach to analyze complex and multifaceted health data,
including genetic, metabolic, and lifestyle factors, which traditional methods may overlook.
The motivation behind employing machine learning in this context is driven by the potential to
enhance predictive accuracy, enable personalized healthcare strategies, and ultimately reduce
the global burden of diabetes. By leveraging advanced algorithms, healthcare providers can
identify high-risk individuals early, tailor preventative measures, and improve patient
outcomes, thereby addressing a significant public health challenge.
3.7 Motivation
   1. **Early Detection**: Early identification of at-risk individuals can significantly reduce
   the incidence of diabetes and its complications. Machine learning models can predict
   diabetes before the onset of symptoms, allowing for timely intervention.
   2. **Personalized Medicine**: Machine learning enables the development of personalized
   risk profiles by considering a wide range of variables, including genetic information, blood
   biomarkers, dietary habits, and physical activity levels. This personalized approach can
   lead to more effective prevention and treatment strategies.
   3. **Efficiency and Scalability**: Machine learning models can analyze large datasets
   quickly and accurately, making them scalable solutions for healthcare systems. This
   efficiency can help in managing the growing number of diabetes cases worldwide.
                                                                                              14
Dept of C.S.E
                                        “ Report on DataScience”
                                                                                 Internship Report
   4. **Cost-Effectiveness**: Early prediction and intervention can reduce healthcare costs
   by preventing severe diabetes-related complications that require expensive treatments.
   5. **Data Utilization**: The increasing availability of health data from electronic health
   records, wearable devices, and genetic testing presents an opportunity to leverage machine
   learning for comprehensive diabetes prediction. This data-rich environment enhances the
   predictive power of machine learning models. Applications 1. Risk Assessment Tools:
   Developing user-friendly tools for clinicians and patients that provide real-time risk
   assessments based on machine learning predictions.
   2. Preventive Programs: Designing targeted preventive programs for high-risk individuals
   identified by machine learning models, focusing on lifestyle modifications and regular
   monitoring.
    3. Clinical Decision Support: Integrating machine learning models into clinical decision
   support systems to aid healthcare providers in making informed decisions about diabetes
   prevention and management.
   4. Research and Development: Using machine learning to identify new biomarkers and
   risk factors for diabetes, contributing to the ongoing research and development in the field.
   3.8 Challenges
   1. Data Quality and Integration: Ensuring the quality and consistency of data from multiple
   sources is crucial for accurate predictions. Integrating heterogeneous data types (e.g.,
   genetic, clinical, lifestyle) poses a significant challenge.
   2. Model Interpretability: Developing models that are not only accurate but also
   interpretable to healthcare providers is essential for gaining trust and facilitating clinical
   adoption.
   3. Privacy and Security: Protecting patient data privacy and ensuring secure data handling
   are critical when dealing with sensitive health information.
   4.Bias and Fairness: Addressing potential biases in machine learning models to ensure fair
   and equitable predictions across diverse populations. By addressing these challenges and
   leveraging the capabilities of machine learning, significant advancements can be made in
   the early prediction and management of diabetes, ultimately imp
                                                                                               15
Dept of C.S.E
                                       “ Report on DataScience”
                                                                                  Internship Report
   3. Data Input: The pre-processed data is fed into the system for further processing.
   4. Data Division: The data is split into two sets: Training data: Used to train the
   forecasting model. Testing data: Used to evaluate the model’s performance on unseen data.
    5. Forecasting Model: A forecasting model is built using the training data. This model
   learns patterns and relationships within the data to predict future electricity consumption.
    6. Hyperparameter Tuning: The model’s parameters are optimized to achieve the best
   possible performance. This involves adjusting settings that control the model’s behavior.
    7. Is Forecasting Accurate? The model’s predictions are compared to the actual values in
   the testing data. If the accuracy is not satisfactory, the process may iterate back to
   hyperparameter tuning or model selection.
    8. Forecasted Output: Once the model’s accuracy is deemed acceptable, it generates
                                                                                                  16
Dept of C.S.E
                                      “ Report on DataScience”
                                                                                Internship Report
   4.Requirements
    • Software Requirement:
   To build a machine learning model for diabetes prediction using the specified software
   requirements, here's a comprehensive list of tools and libraries you'll need along with their
   respective purposes: Software Requirements
    1. Python 3.7 or Higher: Core programming language: Python is essential for writing and
   executing the code for data manipulation, model building, and interface creation. 2.
   Streamlet: For creating web application interfaces: Streamlit allows for the creation of
   interactive and user-friendly web applications to visualize data and model predictions.
   3.NumPy: Numerical operations and array handling: NumPy is used for performing
   efficient numerical computations, handling arrays, and performing mathematical
   operations.
                                                                                               17
Dept of C.S.E
                                       “ Report on DataScience”
                                                                                   Internship Report
                                                                             CHAPTER 4
                    CONCLUSION AND FUTURE SCOPE
                                                                                                 18
Dept of C.S.E
                                      “ Report on DataScience”
                                                                                 Internship Report
REFERENCES
 [1] Board Infinity Data Science Course – Refer to the specific course materials and
 lecturesavailable on Board Infinity that provided foundational knowledge in SQL and Python
 for datascience(https://www.boardinfinity.com/lms/free-data-science-course/overview).
 [2] SQLDocumentation–EssentialforunderstandingSQL
 fundamentalsandadvanceddatabasemanagement techniques(https://dev.mysql.com/doc/).
 [3] Python Documentation – Key resource for learning Python programming, including
 itsapplicationindatascienceandintegrationwithlibraries(https://docs.python.org/3/).
 [4] PandasDocumentation–
 CrucialfordatamanipulationandanalysiswiththePandaslibraryinPython
 (https://pandas.pydata.org/).
 [5] W3Schools–
 ProvidestutorialsandreferencesforbasicconceptsinSQLandPython,usefulforsupplementary
 learning (https://www.w3schools.com/).
                                                                                               19
Dept of C.S.E
                “ Report on DataScience”
                                           Internship Report
CERTIFICATION
                                                         20
Dept of C.S.E