Overall Project Brief
Data analysis代写 This document contains an over-arching project that requires the analysis of big data and displays of the results.
Note: This is NOT the brief you are working to for your project. This document contains an over-arching project that requires the analysis of big data and displays of the results. The project that you select for this this module should contribute to the overall success of this project.Data analysis代写
The European Centre for Medium Range Weather Forecasts (ECMWF) use data output from a number of climate models.Data analysis代写
There are 14 models that could potentially be used, we will be using 7 only.
The models provide data on a grid spacing and cover a variety of climate chemistry species, e.g. CO2, H2O, O3 as well as some pollutants such as PM2.5, PM10 etc. For this project we will be looking at O3 only, but the technique could be extended across all chemical and pollutant species.Data analysis代写
In total there can be up to 250 different values output for each grid space. The grids are spaced at 0.1° latitude and Longitude. This results in a 700 x 400 grid to cover Europe. The data is provided in NetCDF format, a standard format frequently used in climate science. Libraries for reading these files are available in most common languages.Data analysis代写
We have been provided with 25 hourly values, to cover a single day, for each model.
In total that results in data points, occupying 196Mb. This might not sound big, but it is the number of calculations that brings this into the ‘big data’ realm, 7 million calculations!
These model outputs are used to generate an overall estimate of the climate. Currently the overall estimate is calculated by taking the mean of the model values at each grid space, i.e. at each grid space, sum the model outputs, calculate the mean and save to a new file. Taking a simple mean of 7 values, 7 million times is relatively quick, <30s.Data analysis代写
Following recent research by Hyde et al. (Hyde, Hossaini, and Leeson 2018), using the DDC clustering algorithm (Hyde and Angelov 2014) it was found that the accuracy of climate model ensembles could be improved by up to 18%.Data analysis代写
This involves running the clustering algorithm at each grid location, finding the most populous cluster, and using the centre of this cluster as the new ensemble value. This technique is known as a Cluster-Based-Ensemble (CBE). Clearly this process will take considerably longer than calculating a simple mean value. In the original research the values at each grid location were calculated sequentially and implementing the technique in this manner has been estimated to take several days to process our 24-hour data period. This is not an acceptable time-scale.Data analysis代写
This project involves the use of ‘big data’ techniques to reduce the processing time to <2 hours. These techniques include:
- Sub-spacing of data. It has been found that the clustering at each location is rarely affected by data >1 grid space distant. Thus, the clustering at each location can be carried out on a small subset of the data between -3 and +3 grid spaces. The radius for the clustering should be set to 1.5 by default on the geographic axes. The code must allow for altering the values of the number of grid spaces and the cluster radius, i.e. by storing as a variable, and must not be hard coded.Data analysis代写
- The geographic spacing (and therefore the grid distance) is to be scaled to 0-1 for use with the clustering algorithm.
- To calculate the appropriate radius for the ozone axis:
a. Calculate the mean of the minimum ozone value each month
b. Calculate the mean of the range of ozone values each month
c. Calculate the mean of the standard deviation of the ozone values each month
d. The radius is calculated by dividing (c) by (b)
e. All the ozone values must then be scaled by subtracting (a) and then dividing by (b)
4. Parallel processing. The sequential processing of each grid location contributes the most, by far, to the overall processing time. Parallel processing of each grid space will reduce the overall time.
5. Plotting a large number of data points can be difficult to understand. A suitable format for visualizing the results is required.
There are many steps to this project, which are detailed below. It not expected that your final output will complete the calculation within the required time. Nor is it a requirement that you complete the whole project, you can focus on specific sub-projects. Suggestions of how to break out smaller sub-projects are provided in a separate document. However, this module is ‘Big Data Programming Project’ so your work will be expected to demonstrate some use of big data techniques such sub-spacing, parallel processing, display of big data etc.Data analysis代写
The following are the key steps required in this project:
- Read the data from the provided file
- Generate a CBE:
a. Divide the data into suitable sub-sets for parallel processing
b. Parallel process the clustering
i. Run the clustering algorithm at each data location
ii. Select the most populous cluster and use the centreas the new CBE value
1. If more than one cluster is most populous, take the mean of their centres
c. Save CBE to file
3. Generate a simple mean ensemble
a. calculate the simple mean values for the ensemble
b. Save the simple mean to a file
4. Compare the ensemble to the observations:
a. Read the observations from file
b. Calculate difference (bias) between simple ensemble and observations
c. Calculate difference (bias) between CBE and observations
d. Calculate the difference between the CBE and simple ensemble bias
5. Plot results:
a. Plot the observations over a map
b. Plot the CBE over a map
c. Plot the simple ensemble over a map
d. Plot the CBE bias over a map
e. Plot the simple ensemble bias over a map
f. Plot the difference between CBE and simple ensemble biases over map
Note: Observation data are not currently available Data analysis代写