**Problem 1: Data Analysis**

**Problem 1: Data Analysis**

数据分析作业代写 Do not change the names of the given variables that are being assessed, since this will cause the test to classify the result…

### My Solutions 数据分析作业代写

** ATTENTION**: Do not change the names of the given variables that are being assessed, since this will cause the test to classify the result as being incorrect.

** Problem**: You have been provided with a dataset called “HursleyHouseSubset.csv”. You have also been provided with an incomplete MATLAB script.

** Referenced files and functions**: For this problem, these include: “HursleyHouseSubset.csv” and “FeatureNamesWithIndices.png”. Both these files can be found on the Moodle page. There are also three function templates within the script which you will need to complete to generate the desired results.

** Aim**: Carry out basic exploratory data analysis to understand the structure and format of the given dataset and features.

** Objective**: Write MATLAB code where necessary to complete the given tasks, as listed in the provided script. You might need to uncomment certain lines of code for the script to function correctly.

** Assessment**: Your solutions to the different tasks will be assessed once you submit your script. For the tasks marked as (Pretest), you can check whether your solution is correct before submitting your script for grading by clicking on ‘Pretest’.

** Weighting**: This problem will count towards 4% of the final module grade for ELEC0032.

**TIPS and NOTES**: 数据分析作业代写

**TIPS and NOTES**

1) When running the script, you will get the following warning. This is fine for this problem and will not cause any issues so do __not__ try to fix it.

*Warning: Column headers from the file were modified to make them valid MATLAB identifiers before creating variable names for the table. The original column headers are saved in the VariableDescriptions property. Set ‘PreserveVariableNames’ to true to use the original column headers as table variable names. *

2) Read the comments in the given script carefully. They provide invaluable information and will help you understand what is required from each task.

3) Remember to Run Script first and check the Output before looking at the Test results. If the Output is returning errors, then you must fix these before addressing the Test results. 数据分析作业代写

4) A common issue is that the variable names have been changed or entered incorrectly. Hence, read the warnings and error messages carefully to ensure that the variable names have been spelled correctly. Variable names are ** case-sensitive** in MATLAB.

5) If you wish to explore the dataset and the workspace variables, you need to copy (and further develop) the provided script and given dataset in a proper MATLAB instance using either MATLAB Online or a local version of MATLAB running on your computer. You cannot look at the contents of the dataset or variables within MATLAB Grader. The referenced files, which you can import into the workspace of your MATLAB instance, can be found on the Moodle page.

6) Do not uncomment all lines of code simultaneously since this might result in many errors when running the script. Uncomment lines as you progress through the script.

% October 2019 % ELEC0032 assignment % Problem 1: Data Analysis % ATTENTION: % 1) When referring to the column indices of the dataset's features, % make sure you use the screenshot called "FeatureNamesWithIndices" as your % reference for matching Feature Names to their Column Indices. % 2) The function templates are provided at the end of this script. Do not % change the location of these functions. They must appear at the end of % the script. % ************************************************************************* % T1: Complete the following line of code in order to import the dataset with suitable data types.

RawData = [];

% ************************************************************************* % T2: How many observations are there in this dataset?

nbr_observations = []

`% T3: How many features are there in this dataset?`

nbr_features = []

% ************************************************************************* % T4: Extract the names of the features in the order that they are given in the dataset.

feature_names = []

% ************************************************************************* % T5-T7: Complete the function to determine the features that have missing data % (their column indices) and the quantity of missing data for every feature % REMINDER: Scroll to the end of the script to find the list of functions to be completed % UNCOMMENT THE FOLLOWING LINE %[features_with_missing_data, missing_data_per_feature] = fun_findmissingdata(RawData) % ************************************************************************* % T8-T9: Complete the function to count the number of missing observations in % the given dataset. % UNCOMMENT THE FOLLOWING LINE %nbr_of_missing_observations = fun_countmissingobservations(RawData) % ************************************************************************* % T10: How many observations have missing data across more than one features?

nbr_of_observations_with_multiple_missing_data = []

% ************************************************************************* % T11-T12: Complete the function to determine the features (their column indices) % which are strictly quantitative in MATLAB terms? % UNCOMMENT THE FOLLOWING LINE %IndicesOfQuantFeatures = fun_determinequantfeatures(RawData) % ************************************************************************* % T13: Create a Clean dataset with the missing observations removed and % call it "CleanData".

CleanData = [];

% T14: Create a table summarising the minimum, maximum, range, median, average and % standard deviation values for all quantitative predictors in this newly created Clean % dataset. % UNCOMMENT THE FOLLOWING LINE %subset_with_quantitative_predictors = CleanData(:,IndicesOfQuantFeatures); % UNCOMMENT THE FOLLOWING LINES % START - DO NOT UNCOMMENT THIS LINE %quantitative_predictor_stats = zeros(6,size(subset_with_quantitative_predictors,2)); % We can use a loop to create an array storing the statistics for each feature of the new subset. %for i = 1:size(subset_with_quantitative_predictors,2) % quantitative_predictor_stats(1,i) = min(subset_with_quantitative_predictors{:,i}); % % ENTER YOUR CODE HERE %end %quantitative_predictor_stats_table = array2table(quantitative_predictor_stats,... % 'VariableNames',subset_with_quantitative_predictors.Properties.VariableNames,'RowNames',{'Min','Max','Range','Median','Average','Std Dev'}) % END - DO NOT UNCOMMENT THIS LINE % ************************************************************************* % T15: Simply looking at the quantitative predictor statistics table, and using the % information provided in the article below, which feature is most likely % to have outliers (enter the column index of this feature)? % Copy and paste the link below into your web browser to access the % article: % https://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-+measures+of+central+tendency

feature_with_potential_outliers = []

% ************************************************************************* % T16: With the aid of a scatterplot, determine the features (enter the column indices) % whose distribution is most akin to a Gaussian distribution.

features_with_Gaussian_distribution = []

% LIST OF FUNCTIONS % UNCOMMENT EACH FUNCTION AS YOU COMPLETE IT % Test T7 % UNCOMMENT THE FOLLOWING LINES %function [y1,y2] = fun_findmissingdata(x) % % ENTER YOUR CODE HERE % y1 = []; % y2 = []; %end % Test T9 % UNCOMMENT THE FOLLOWING LINES %function y = fun_countmissingobservations(x) % % ENTER YOUR CODE HERE % y = []; %end % Test T12 % UNCOMMENT THE FOLLOWING LINES %function y = fun_determinequantfeatures(x) % % ENTER YOUR CODE HERE % y = []; %end

**Problem 2: Statistical Learning 数据分析作业代写**

**Problem 2: Statistical Learning 数据分析作业代写**

### My Solutions

** ATTENTION**: Do not change the names of the given variables that are being assessed, since this will cause the test to classify the result as being incorrect. 数据分析作业代写

** Problem**: Problem 2 continues from Problem 1. In this case, however, a Clean dataset is used as the starting point.

** Referenced files and functions**: For this problem, these include: “DataForProblem2.mat” which can be found on the Moodle page.

** Aim**: Analyse the relationship between different features and predict response values.

** Objective**: Write MATLAB code where necessary to complete the given tasks, as listed in the provided script. You might need to uncomment certain lines of code for the script to function correctly.

** Assessment**: Your solutions to the different tasks will be assessed once you submit your script. For the tasks marked as (Pretest), you can check whether your solution is correct before submitting your script for grading by clicking on ‘Pretest’.

** Weighting**: This problem will count towards 5% of the final module grade for ELEC0032.

**TIPS and NOTES**: 数据分析作业代写

**TIPS and NOTES**

Refer to the tips and notes for Problem 1.

% October 2019 % ELEC0032 assignment % Problem 2: Statistical Learning % In this part, we wish to analyse the relationship between different features % and predict response values. First, let us load the dataset for this problem.

`load('DataForProblem2.mat')`

`% T1`

ImportedDataset = CleanData;

% ************************************************************************* % T2: Using simple linear regression, what is the predicted temperature for a % pressure equal to 980 units considering only these two variables in the model % (round your result to two decimal points precision) % ENTER CODE HERE

predicted_temperature = []

% ************************************************************************* % T3: What percentage of the variance in the temperature is explained by % the linear model in T2 (express it using two decimal points precision)?

percentage_variability = []

% ************************************************************************* % T4-T7: Suppose now that we wish to investigate the differences in temperature (response) % based on occupancy status (input) across all rooms but ignoring the other variables. % Using a linear regresion model, answer the following questions (round your results using two decimal points precision). % ENTER CODE HERE % T4 % What is the average temperature when a room is occupied?

average_temp_when_occupied = []

% T5 % What is the average temperature when a room is unoccupied?

average_temp_when_unoccupied = []

% T6 % What is the average difference in temperature between an occupied and unoccupied room (enter the absolute value)

average_diff_in_temp = []

% T7 % Does the model indicate that there is statistical evidence of a difference in average temperature based on the occupancy status (enter 1 if true or 0 if false)?

is_there_statistical_evidence = []

% ************************************************************************* % We now wish to develop a model which predicts whether a room (or more % precisely a location within a room) has high or low temperature. % ************************************************************************* % ************************************************************************* % T8: First, we need to carry out some feature engineering to remove the % unwanted features and add new features which could be of use in our model % Create a new data subset that includes features 6, 8, 9, 10, 11 % ENTER CODE HERE % Now extract the day (in numeric format) from the date in the dataset % and store it in a new variable called DayNum. Add this new variable to your new subset called CleanDataSubset. % ENTER CODE HERE % Finally, create a binary variable called tempHL which contains a 1 if the temperature % is above the median and a 0, if the temperature is below or equal to the median. % Add this variable to your new CleanDataSubset. % ENTER CODE HERE % Check that the newly created subset is correct. % The new column indices should be as follows: % 1 2 3 4 5 6 7 % {'Temperature'} {'LightLevel'} {'RelativeHumidity'} {'NoiseLevel'} {'Pressure'} {'DayNum'} {'tempHL'}

NewDataSubset = [];

% ************************************************************************* % T9-T10: Which features are reasonably correlated/anti-correlated with tempHL? % (in this problem, by reasonable we mean where the correlation coefficient is equal to or above 0.3)? % Hint: Convert the dataset to a numerical array to compute the correlation matrix. %T9: First compute the correlation coefficients

cormat = []

% UNCOMMENT THE FOLLOWING LINES %cormat_table = array2table(cormat,... % 'VariableNames', CleanDataSubset.Properties.VariableNames,... % 'RowNames',CleanDataSubset.Properties.VariableNames) % T10: Then indicate which variables show correlation with tempHL. % Enter 1 if strongly correlated, enter -1 if strongly anti-correlated, % and enter 0 if not strongly correlated. % Replace the feature names in the vector below with with 1, 0 or -1. For example: % vector_showing_strong_corr = [roTemp roLightLevel roRH roNoiseLevel roPressure roDayNum] % vector_showing_strong_corr = [-1 -1 -1 -1 -1 -1] % ENTER YOUR ANSWER HERE

vector_showing_strong_corr = []

% ************************************************************************* % T11-T14: Using the new subset CleanDataSubset with the added DayNum and tempHL variables, % we now wish to split the data into training and test sets. The training % set should comprise data for all features for the first 21 days in a month (inclusive of the 21st day), % while the test set should comprise the remaining data. % How many observations are in the training set and how many in the test % set? % ENTER CODE HERE % T11 - Training data trainData = []; % T12 - Test data

testData = [];

```
% T13
nbr_training_observations = []
```

```
% T14
nbr_test_observations = []
```

% ************************************************************************* % T15-T22: Now that we have identified the training and test sets, we wish to % create yet a further reduced subset which in turn will be used for our KNN models.

% T15 % UNCOMMENT THE FOLLOWING LINE %DataKNN = NewDataSubset(:,[1 5:7]); % Now split the new KNN subset (which we have called DataKNN) into training % and test subsets. % T16

trainKNN = [];

```
% T17
testKNN = [];
```

% We also need to extract the response variable into separate training and test % variables themselves for the KNN model % T18 train_tempHL = [];

```
% T19
test_tempHL = [];
```

% ATTENTION: Any code for the KNN model must appear after the following locked lines % referring to the random number generator. Otherwise, you risk obtaining incorrect results. % To ensure that results are reproduced the same each time, a random number % generator is reset each time a new variant of the KNN model (with % different number of neighbours) is run.

rng('default'); rng(1,'twister');

% T20 - Check that the correct KNN variant is being used. % Use K = 1 and configure the KNN model to break the ties at random. knnfit1 = [];

% Now we wish to use the test data to generate (predict) KNN estimates for 100 % different KNN models, where the K for each model varies between K = 1 and % K = 100. % NOTE: For this particular task, you may treat the test set as a form of % validation set, that is to say, you can use the same test set for each of % the 100 KNN models.

maxK = 100;

j = zeros(1,maxK); for i = 1:maxK rng(1,'twister');

` % Use the same KNN variant as for knnfit1`

```
knnfit = [];
knnpred = [];
j(i) = mean(knnpred ~= test_tempHL);
end
```

```
% T21 - What is the minimum error amongst the 100 KNN models?
minerror = []
```

```
% T22 - What is the value of K that corresponds to this minimum error?
bestK = []
```

% Plot the error for different values of K figure() plot(j) title('Error versus K')

% T23 - Have linear regression models been used? % T24 - Have KNN models been used?