Repository for final project allocation and submission for CS60013 : Programming &; Data Structures offered in Autumn 2022 at IIT Kharagpur taught by Prof Subhamoy Mandal.
Deadline for project submission is 10th November 2022 at 23:59 IST.
This repository might be updated with new projects and/or changes to existing projects. Please check back regularly.
Final projects are crafted by Vinay and Sai Pavan and approved by Prof Subhamoy Mandal.
- Final Projects for CS60013 : Programming and Data Structures
- The project is to be done in groups of 3 students except (5th group). The students are expected to work together collaboratively.
- The choice of programming language is left to the students. However, the most common languages used are Python and C/C++.
- Each group will be assigned a mentor TA who will be responsible for guiding the group throughout the project.
- Meetings with the mentor TA will be scheduled at the beginning of the project and at regular intervals.
- Each student will be evaluated based on the contribution towards the project. Make sure you are contributing equally to the project.
- Code plagiarism will not be tolerated. Any submission found to be plagiarized will be awarded a zero grade.
- Late submissions will not be accepted.
- The final project evaluation is based on the following criteria:
Continuous Evaluation (CE) : 40%Code Quality and Documentation : 20%Final Submission and Report : 40%
Continuous Evaluation (CE): 40%- The CE will be based on the following criteria:
- Your participation in the weekly meetings with your mentor TA.
- Your weekly progress and updates on the project.
- The CE will be based on the following criteria:
Code Quality and Documentation: 20%- This will be based on the following criteria:
- Code Quality : 10% (based on the code quality and readability)
- Documentation : 10% (based on the documentation of the code and the project)
- This will be based on the following criteria:
Final Submission and Report: 40%- This will be based on the following criteria:
- Final Submission : 20% (based on the final submission of the project)
- Final Report : 20% (based on the final report of the project)
- This will be based on the following criteria:
- CE will be evaluated if you have attended
at least 75%of the weekly meetings with your mentor TA.
Forkthisgithub.com/ummadiviany/pds_final_projectsrepository.Clonethe forked repository to your local machine using the following command:git clone github.com/{your_username}/pds_final_projects- Your projects are in the
submissionsdirectory. You can find the project description in the README.md file of the respective project directory. - Work on the project and make
regular commitsto your local repository andpushthem to your forked repository. - Your mentor TA will review your code and provide feedback.
- You have to submit the following:
Final Code: The final code of your project in the respective project directory.- Code should be highly readable and well documented.
- Try to write efficient code and avoid unnecessary code.
Final Report: The final report of your project in the respective project directory. The report should be in the form of amarkdownfile with the namereport.md. The report should contain the following:Introduction: A brief introduction of the project.Data: A brief description of the data used in the project.Questions & Answers: The questions and their respective answers. Also include the code snippets used to answer the questions andwho solvedthe question.References: The references used in the project.
- Submission of the final project will be done via
GitHub Pull Requests. - Once you are done with the project, you can create a
Pull Requestto themainbranch of thegithub.com/ummadiviany/pds_final_projectsrepository. - We will review your merge request and provide feedback. You can make changes to your code and update the merge request. If accepted, your project will be merged to the
mainbranch of thegithub.com/ummadiviany/pds_final_projectsrepository. - That's it!
Congratulations!!have successfully submitted your final project.
The deadline for the final project submission is 10th November 2022, 23:59 IST.
| Students | Project | Mentor TA |
|---|---|---|
| Amar Majhi, Mamta Rani, Reflex Kumar Patel | Project 4 : Medical Image Visualization and Analysis | Sai Pavan |
| Bhanu Kumar Meena, Syeda Najafara Fathima, Kavin Puri | Project 1 : Medical Transcription Analysis | Vinay |
| Pooja P Jain, Sathishkumar S, P.V.Kamlesh | Project 3 : ISBI 2022 Accepted Submissions Analysis | Vinay |
| Ramkumar K, Chaudhari Saurabh Santosh, Samriddha Das | Project 2 : Agriculture Crop Production Analysis | Vinay |
| Prabhukalyan Dash, Soumita Guria | Project 5 : Patient Health Statistical Analysis | Sai Pavan |
- The project aims to analyse the medical transcription dataset. The dataset is located in the
data/medical_transcriptions/mtsamples.csvdirectory. - The dataset is a
csvfile.CSVstands forCommaSeparatedValues. It is a simple file format used to store tabular data, such as a spreadsheet or database. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. - The dataset contains following fields :
description: Short brief of the interaction between the patient and the doctor.medical_specialty: Medical specialty of the issue discussed in the transcription.sample_name: Medical Samples used for the diagnosis.transcription: Full transcription of the interaction between the patient and the doctor.keywords: Keywords of the transcription
- The project can be divided into sub-areas as follows :
Data Preprocessing- Write functions to read the csv file. Suggestion : Use the
pandaslibrary. - This dataset needs bit of pre-processing. The
medical_specialtyfield contains multiple values. You need to split the values and create a list of values. For example, if themedical_specialtyfield containsOrthopedics, Neurology, then you need to split it into['Orthopedics', 'Neurology']. - The keywords field contains multiple values. You need to split the values and transform it into a list of values. For example, if the
keywordsfield contains'pain, headache, migraine', then you need to split it into['pain', 'headache', 'migraine']. - Look into the dataset and find out if there are any other fields that need to be pre-processed.
- Write functions to read the csv file. Suggestion : Use the
Data Analysis- In this part you can prepare a set of questions at least 10 and answer them using the dataset.
- Some examples questions to get you started:
- What is the
most commonmedical specialty? - What is the
most commonmedical sample? - What is the
most commonkeyword? - What is the
averagelength of the transcription? - What is the
averagelength of the description? - What is the
averagelength of the keywords? - And so on... Get creative and come up with your own questions.
- What is the
Data Visualization- In this part you can make use of the
matplotlibandseabornlibraries to visualize the answers to the questions you asked in the previous part. - Everyone likes to see the results in the form of
graphsandcharts. So, make sure you visualize the answers to the questions you asked in the previous part.
- In this part you can make use of the
-
This project aims to analyse the crop production data from 2006 to 2011 from all the states of India. The dataset is located in the
data/crop_production/directory. -
The data directory contains 5 csv files. Go through the data files and understand the data.
-
Different data files contain different types of data. For example
datafile_1.csvcontains the following fields:Crop: Name of the cropState: Name of the stateCost of Cultivation (/Hectare) A2+FL: Cost of cultivation per hectareCost of Cultivation (/Hectare) C2: Cost of cultivation per hectareCost of Production (/Quintal) C2: Cost of production per quintalYield (Quintal/ Hectare): Yield per hectare
-
The
datafile_2.csvcontains the following fields:Crop: Name of the cropProduction (YYYY - YY): Production of the crop between two consecutive yearsArea (YYYY - YY): Area of the crop between two consecutive yearsYield (YYYY - YY): Yield of the crop between two consecutive years
-
Go through the data files and understand the data. You can use the
pandaslibrary to read the csv files and perform analysis on the data. -
The data files are not clean. You need to clean the data before you start analysing it.
-
The project can be divided into the following parts:
Data Processing- Writing the functions for reading the data files.
- Once you have read the data files, you need to clean the data. You can use the
pandaslibrary to clean the data. - Only keep the data which is relevant to the analysis and drop the rest of the data.
Data Analysis- In this part, you need to prepare a set of questions and answer them using the data provided.
- Answer
at least 15 questionsusing the data provided. - A few examples questions to get you started are as follows:
- Which
crophas thehighest productionin the country? - What are the major
stateswherericeis grown? - What is the
average cost of cultivationofricein the country? - What are seasons where
Sunfloweris grown? (data availabe indatafile_5.csv) - What is average crop duration for
Paddy,WheatandMaize?
- Which
- You can come up with your own questions and answer them using the data provided.
Data Visualization- Visualize the data using
matplotliborseabornlibrary. - Visualizing the data will help you understand the data better and answer the questions.
- Visualize the data using
- The project aims to analyse the accepted submissions of ISBI 2022. The dataset is located in the
data/isbi2022/directory. - The dataset comprised of multiple
jsonfiles.JSONstands forJavaScriptObjectNotation. It is a lightweight data-interchange format. It is easy for humans to read and write. - Each json file contain the information about multiple papers(about 100 papers in each). The information about the paper is stored in the form of key-value pairs.
JSONis all about key-value pairs (akadictionariesin Python). - Each paper contains more than 20 attributes, but the most useful attributes are listed as follows :
articleTitle: Title of the paperauthors: List of authors of the papercitationCount: Number of citations of the paperdownloadCount: Number of downloads of the paperstartPage: Starting page of the paperendPage: Ending page of the paperabstract: Stripped abstract of the paper
- The project can be divided into sub-areas as follows :
Data Preprocessing- Write functions to read to multiple json files and concatenate them into a single dataframe.
- Also only keep the useful attributes mentioned above and drop the rest.
Data Analysis- In this part you can prepare a set of questions at least 15 and answer them using the dataset.
- Some examples questions to get you started:
- On which
areaof ISBI 2022, the most number of papers were submitted? - Which are the
top 10downloaded papers and what are they about? - Which are the
top 10cited papers and what are they about? - What are the
meanandmediannumber ofauthorsper paper? - Most common words in the abstracts of the papers? Form a
word cloud. - What is the
averagenumber ofpagesper paper? - And so on... Get creative and come up with your own questions.
- On which
Data Visualization- In this part you can make use of the
matplotlibandseabornlibraries to visualize the answers to the questions you asked in the previous part. - Everyone likes to see the results in the form of
graphsandcharts. So, make sure you visualize the answers to the questions you asked in the previous part.
- In this part you can make use of the
- The project aims to read, visualize and analyze the medical images. The dataset is located in the
data/medical_images/directory. - The dataset contains medical images of
MRIandCTscans for different anatomical parts of the body. It also contains thesegmentation masksfor the images. - The dataset has
Hippocampus MRIimages and segmentation masksHeart MRIimages and segmentation masksProstate MRIimages and segmentation masksAbdomen CTimages and segmentation masks
- These scans are used to diagnose the diseases of the body. The segmentation masks are used to identify the different parts of the body in the images.
- Scans are in
NIFTIformat.NIFTIis a standard format for storing medical images. - All the scans are
3D Volumes. Each 3D volume is a stack of2D images. Each 2D image is called aslice. - Your first task is to read the images and visualize them. You can use the
nibabellibrary to read the images. - Visualizing the images is important to understand the data. You can use the
matplotliblibrary to visualize the images. Visualization can be done in multiple ways. You can visualize the images in the following ways:- Visualize the
slicesof the images and the segmentation masks. - Visualize the
3D volumesof the images and the segmentation masks.
- Visualize the
- The next task is to analyze the images. You can use the
numpylibrary to analyze the images. The analysis part is open ended. - You can perform simple statistical analysis on the images. You can also perform more complex analysis like
image segmentationandimage classification. - Statistical analysis may include the following:
- Calculate the
mean,median,standard deviation,minimumandmaxmumfor the whole image, segmented image. - Now compare the statistics of the segmented image with the whole image. What do you observe?
- Calculate the
- Complex analysis may include the following:
- Perform
image segmentationon the images. You can use thescikit-imagelibrary to perform image segmentation. - Perform
image classificationon the images. You can use thescikit-learnlibrary to perform image classification. - You can also perform
image registrationon the images. You can use theSimpleITKlibrary to perform image registration.
- Perform
- Try with statistical analysis first and then move on to more complex analysis. Although, we do not expect you to perform complex analysis, you can try it if you want to.
- Remember, the analysis part is open ended. You can come up with your own analysis ideas and implement them.
-
Create a .csv file which contain the following information
-
create an attribute with name
patientTen names of your friends or Random names-stringformat -
add the attribute
patient Identifierand assign 1 to 10 digits for each person-integer format. -
add the attribute
Heightand add the respective heights infloatformat{5.5,5.6,6.1,6.1,6.0,5.9,5.8,5.8,5.8,9.1} Float format -
add the attribute
Temperatureand add the respective heights infloatformat{97.2,97.3,97.8,98,98.1,98.2,97.3,98,101,102} Float format -
add the attribute '
diseaseand assign the following as per their patient identifierRandomly assign the disease to patients with the following {Headeach ,cold ,fever} -
add the attribute
Hospitaland assign the following as per the patient identifier randomly. -
add the attribute
Costand assign the following as per the patient identifier randomly.{20.0,1000.0,800.0,910.0,950.0,980.0,990.0,890.0,880.0,930.0} Float format
-
-
Obtain the statistics from the dataset you created
-
Now create the class to represent the same above data
-
create the methods to calculate the statistics
Mean ,Median ,Mode of 'Height' Mean ,Median ,Mode of 'Cost' Mean ,Median ,Mode of 'Temperature'
-
-
comment on the statistic calculations and clearly mention your observation
Note: This section may hold high weightage so write the observations in short and specific to point.
- Python Documentation
- Class Code Materials
- Introduction to Computation and Programming Using Python
- Elements of Programming Interviews in Python
- Python Libraries