Project Scenario
You have been approached by a client who analyses atmospheric science and climate model data. They
have developed a new analysis technique, but it takes too long to run for them to use it. They have
asked you to investigate the use of big data techniques to reduce the processing time.
They have a large volume of data to process, and the analysis needs to be repeated frequently. They
have the following basic requirements:
1. Current analysis time is approximately 2.5 hours to analyse the climate model output data for a
1-hour time period.
2. The data for a single day of model output is approximately 250MB. However, they have over
100 years’ worth of data to analyse making a total of over 9TB.
3. Each day, they need to analyse the new data set for that day, so they wish to complete the
analysis of the data for a 24-hour period (25 data sets) in under 2 hours.
4. It is not possible to hold on this in memory at one time, so the new process should load only 1
hour of data for processing at a time. If parallel processing is to occur, then 1 hour of data per
worker can be loaded as needed.
You have been tasked with investigating the use of parallel processing to achieve the analysis speed
required, with the following expectations:
1. Test and compare the processing speed of sequential and parallel processing
2. Extrapolate your findings to indicate the number of processors required to achieve the target
processing time.
3. Test how your code responds to common errors, e.g. data that is text instead of numeric, use of
NaN in the data as an error code.
4. Run automated tests that allow your client to set the tests running and return later to see the
results, without user intervention.
The data has been provided by the European Centre for Medium Range Weather Forecasts (ECMWF)
Continued over…
Project Deliverables
Your project should deliver the following:
1. Working code that demonstrates:
a. Loading of only the data required for the processing taking place
b. Sequential processing of the data
c. Parallel processing of the data
d. Plots of the comparisons between sequential processing and parallel processing with
different numbers of workers
e. Automated testing of your code to deal with pre-defined data error types.
2. A formal project report for your client covering:
a. Comparisons between parallel and sequential data processing
b. Estimated number of processors required to achieve the goal of processing 24-hours of
data in under 2 hours.
c. Testing the code to see how it deals with:
i. Text instead of numeric values
ii. NaN values indicating data errors.
iii. Note: it is not necessary to solve these problems to pass, but you should be
able to suggest methods of dealing with these problems so code will not crash.
d. A summary of the evidence generated during your project and how it helps you arrive
at your conclusions
e. Recommendations
f. References
g. Appendices containing:
i. Code flow charts
ii. Gannt chart for your project
iii. Logbook
iv. Specification items
3. VIVA / presentation. You will be expected to present your work in a formal presentation / VIVA.
Details of this can be found in the VIVA assessment brief.
This assessment brief covers only parts 1 and 2. The assessment brief for part 3, VIVA, is found in a
separate document.
Additional Information
1. You will be provided with NetCDF data files:
a. One complete, correct data file
b. One file containing instrument errors, recorded as NaN.
c. One file containing data storage error where the numerical values have been saved as
text
2. You are provided with code files for the analysis technique. You should not edit this file in any
way. You are required run the analysis, for timing purposes, but are not expected to analyse,
display, report on, or deal with the results of the analysis in any way.
Continued over…
3. You are expected to define your project by means of a list of 5 SMART specification items.
These should be included in an appendix.
4. You are expected to plan the work required for this project and provide a complete Gannt
chart, including identifying the critical path. This should be included in an appendix.
5. This is a formal report and it is expected that appropriate formal grammar and language are
to be used. Where this is not the case, a penalty of up to 10% may be applied to the marks for
the report structure. For help with formal writing, please contact the Centre for Academic
Writing.
Notes:
1. You are expected to use the Coventry University APA style for referencing. For support and
advice on this students can contact Centre for Academic Writing (CAW).
2. Please notify your registry course support team and module leader for disability support.
3. Any student requiring an extension or deferral should follow the university process as outlined
here.
4. The University cannot take responsibility for any coursework lost or corrupted on disks, laptops
or personal computer. Students should therefore regularly back-up any work and are advised to
save it on the University system.
5. If there are technical or performance issues that prevent students submitting coursework through
the online coursework submission system on the day of a coursework deadline, an appropriate
extension to the coursework submission deadline will be agreed. This extension will normally be
24 hours or the next working day if the deadline falls on a Friday or over the weekend period.
This will be communicated via your Module Leader.
6. You are encouraged to check the originality of your work by using the draft Turnitin links on
Aula.
7. Collusion between students (where sections of your work are similar to the work submitted by
other students in this or previous module cohorts) is taken extremely seriously and will be
reported to the academic conduct panel. This applies to both courseworks and exam answers.
8. A marked difference between your writing style, knowledge and skill level demonstrated in class
discussion, any test conditions and that demonstrated in a coursework assignment may result in
you having to undertake a Viva Voce in order to prove the coursework assignment is entirely
your own work.
9. If you make use of the services of a proof reader in your work you must keep your original
version and make it available as a demonstration of your written efforts.
10. You must not submit work for assessment that you have already submitted (partially or in full),
either for your current course or for another qualification of this university, with the exception of
resits, where for the coursework, you maybe asked to rework and improve a previous attempt.
This requirement will be specifically detailed in your assignment brief or specific course or
module information. Where earlier work by you is citable, i.e. it has already been
published/submitted, you must reference it clearly. Identical pieces of work submitted
concurrently may also be considered to be self-plagiarism.
Continued over…
Marking Rubric
Topic Total Section Marks Description / Breakdown
Total 150
Report
This is a formal report and it is expected that appropriate formal grammar and language are to be used. Where this is not the case, a
penalty of up to 10% may be applied to the marks for the report structure. For help with formal writing, please contact the Centre for
Academic Writing.
Report
Structure 30 a. 5 Introduction
This should be clear and concise, introduce the project, the aims and how the report is
Max 30 Marks structured
b. 5 Code description
Describe the functionality of the code files, what they are used for and how they
achieve their tasks, including testing. Do not describe syntax.
c. 10 Comparisons of parallel and sequential timing
Detailed explanation of the meanings of the results, how parallel processing achieves
higher speeds, detailed analysis and extrapolation to achieving the processing goal,
makes good use of figure and visual aids.
d. 5 Summary
Pull together the key information well, present the information clearly
e. 5 Conclusions and recommendations
Make clear references to the report content to recommend the number of processors
that may be required, describe limitations of the analysis, how different systems may
perform differently etc. Additional research, information and understanding of e.g.
HPCs, cloud computing etc. will be a benefit.
Report Figures
and Diagrams 25 a. Code flow charts showing processes (appendices):
Max 25 Marks Sequential processing, parallel processing, testing
Plot graphs of worker / processing speed and extrapolated graph to show processing in
b. the required time.
Marks 5 Mark allocation for each of the 5 plots and charts in parts a and b (no marks if not
25
per present):
figure Clarity of main plot, colours, line styles, markers etc, title, axis labels, legend, caption
Clarity of main plot, colours, line styles, markers etc, title, axis labels, legend, caption
References
and
Appendices 30 a. 5 References
Appropriate use of references. References should be in a standard format, e.g. APA 7th
Edition. No penalty will be applied for using another common standard, e.g. IEEE. A
high number of references are not required for this report but should include as a
Max 30 Marks minimum Matlab, non-standard Matlab toolboxes and the data provider's website.
b. 5 Gannt Chart
Complete, detailed, including sub-tasks, critical path identified
c. 15 SMART targets
5 SMART targets
d. 5 Log Book
Provide a detailed log book, with many detailed entries, total time should add up to
~150 hours or more
Code
Analysis Code 20 a. 10 Sequential processing code
Max 20 Marks b. 10 Parallel processing code
Mark allocation for clearly structured, clear and detailed annotations, block or function
descriptions, clear variable names, consistent structure and formatting, breaking your
code into functions
Automated
Testing Code 20 a. 10 Automating all speed comparisons
Max 20 Marks b. 10 Automating the 'break tests'.
Mark allocation for clearly structured, clear and detailed annotations, block or function
descriptions, clear variable names, consistent structure and formatting, breaking your
code into functions
Results Displaying the results automatically in the analysis code will be eligible for full marks.
Display Code 15 Using a separate file with manually entered results will be capped at 5 marks
Max 15 Marks a. 5 Comparison plot of sequential vs parallel
b. 5 Sequential vs parallel on same plot
c. 5 Mean processing time per datum
Marks will be allocated for the clarity of the data and graph, colours, symbols etc, title
and axis labels and a legend
Version Marks will be allocated for sufficient versions overall, separation of function and
Control 10 10 versions for each, detailed commit notes and readme