DATA WAREHOUSING
AND MINING COURSEWORK
PDAM – 2018/2019
Unit Name: Data Warehousing and Mining Coursework
Unit Code: PDAM - U25764
Unit Coordinator:
Submission Deadline: Part I: Friday 14th December 2018 5:00pm
Part II: Friday 10th May 2019 5:00pm
Weight: Part I 40% - Part II 60%
Assesses: All learning outcomes.
IMPORTANT NOTE
The individual components in this coursework are NOT group work.
For individual and group components of this project, any unacknowledged copying of
either printed material or software from any other person or source (including
electronic media) constitutes plagiarism, which is a serious disciplinary offence; any
cases of plagiarism will be handled using the University disciplinary procedures.
Please ensure that your coursework is anonymous. Your NAME must not appear
anywhere on the coursework or the coversheet. Please use your ID only. Work to be
submitted online on Moodle.
First Submission: Due on Friday 14th December 2018 5:00pm
Designing and implementing a data warehouse for NHS
The aim of this project is to design and implement a data warehouse for the National Health Service
NHS using MySQL. The NHS consists of a number of treatment centers (e.g. hospitals, surgeries,
walk-in centers, etc.). Each center has a number of staff members (doctors, consultants, nurses etc.).
Each member has one occupation and one or specialties (e,g Occupation: Consultant, Specialties:
orthopedic, joint replacement). The NHS keeps track of its patients over time; diagnosis, the kind of
treatment they have received, the cost of each treatment etc. There are two types of patients;
Outpatient: a patient who is not hospitalized but visited a GP, hospital etc. Inpatient: a patient who has
been admitted to the hospital for day or more. Also, the NHS keeps track of the drugs/operations used
in each treatment (if any).
The following is a sample for the main tables in the operational database; this is just to show the core of
the case study. You should expand and improve the model where required.
Patient (Pa ent_ID, Pa ent_DoB, Pa ent_County, Pa ent_City, Pa ent_Ocupa on..)
Treatment_Unit (Unit_Code, Unit Type, Unit_County, Unit_City ….)
Staff (Staff ID, Staff_Level, Staff_Specializa on, )
Drug (Drug_Code, Drug_Cost, Drug_Type ….. )
Operation (Op_Code, OP_Type ….)
Treatment (Treat_ID, Treat_Type, Pa ent_ID, Diagnosis, Unit_Code, Staff_ID)
Treatment_Drug (Treat_ID, Drug_ID, TD_Date, TD_Dura on, TD_Cost).
Treatment_Operation (Treat_ID, Opera on_Code, TD_Date, TD_Dura on, TD_Cost).
Your task: Your task is to design and implement a warehouse for the NHS using dummy data. The aim
of this DW demo is to demonstrate to the NHS decision makers and consultants the benefits they can
gain from investing in a DW.
Step 1: Define a Star/Snowflake schema (i.e. normalized star schema). Your schema should cover
around two subjects.
Step 2: Create the tables using MySQL (You can use other DBMS but you will need to get approval
from the unit coordinator Dr. Mohamed Bader first).
Step 3: Populate the tables with some dummy data.
Step 4: Write and run 6 SQL queries. Your queries must be meaningful, serve at least 3 different
stakeholders and demonstrate the strength of DW in supporting decision makers. Also, your queries
must cover all subjects in your DW (at least 2 to 3 queries per subject). You should provide a short
description of each query.
Deliverables of this component is a report that contains:
a. Short description of your project: the report should explain your design decisions.
[500 words max].
b. Your Star/Snowflake schema.
c. Screenshots of each query and its output along with a short description of each query.
Your screenshots must be clear and readable. If the query and/or the result of the
query are not clear and readable, zero will be awarded.
The submission is online through Moodle (the submission details will be available on Moodle).
This component of your coursework contributes 40% of the total mark for the unit assessment. The
marking criteria [in 100% breakup of marks] for this component are as follows:
25% Justification of design decisions and project explanation. Also, for organisation, language
style and clarity of the report.
35% The correctness, coverage, quality and novelty of the design and star/snowflake schema.
40% The quality and the coverage of the advanced SQL queries
Second Submission: due on Friday 10th May 2019 5:00pm (Two Tasks)
Task I 40%: using WEKA-Classification software and critical thinking
You are required to search for at least 20 different datasets with a maximum of 40 datasets. You are
then required to apply the following classification techniques using the WEKA software on the chosen
datasets:
(1) Decision tree (J48),
(2) Random Forest, and
(3) K-NN (IBk) (with K taking the value of 1 up to the number of class labels in the dataset)
Note: Random Forest was not covered in lecture, part of this CW is to self-learn about other
classification methods beyond what was covered in the lectures.
Once you have applied the algorithms on all the datasets, it is required to accomplish the following
tasks:
1. Compare the performance of the applied techniques in terms of accuracy.
2. Analyse the results with regards to the dataset properties.
3. Write a report of no longer than 1000 words detailing the results you have reached in (1) and (2)
with recommendations on the choice of the data mining technique according to the features of the
datasets.
Task II 20%: using WEKA-Clustering software and critical thinking
You are required to apply the following clustering techniques using the WEKA software on only 10 of
the datasets you selected in task 1:
(1) K-means,
(2) Agglomeration method
Remove the class attribute before applying the above clustering methods. Once you have applied the
clustering techniques on all the datasets, it is required to accomplish the following tasks:
1. Use the clustering evaluation methods to compare the performance of the above algorithms
2. Write a report of no longer than 500 words detailing the results you have reached in (1) and (2)
with recommendations on the choice of the data mining technique according to the features of the
datasets.
Deliverables of this component of the coursework are:
You are to write a report addressing the aforementioned tasks in no more than 1500 (1000 classification
+ 500 clustering) words excluding figures and tables. Your report must cover the following areas:
(1) A short summary of the datasets you used and the justification of choice.
(2) A detailed analysis of your results when comparing the different classification/clustering
techniques.
The submission is online through Moodle (the submission details will be available on Moodle).
This component (task 1 and 2) of your coursework contributes 60% of the total mark (40% Task I and
20% Task 2) for the unit assessment. The marking criteria [in 100% breakup of marks] for this
component are as follows:
20% Justification of choice and number of the datasets used
20% Appropriate use of tables and figures when reporting the results
30% Analysis of the results of the experiments you have conducted
20% Conclusion with recommendations on how to match a dataset to a technique
10% Organisation, language style and clarity