1 APS1070

Basic Principles and Models – Project 1

cs代码代写 This project is individual – it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza…

This project is individual – it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza Q&A forums (the answer might be useful to others!).

Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.

##Marking Scheme: cs代码代写

This project is worth 14 marks of your final grade.

Draw a plot or table where necessary to summarize your findings.

Practice Vectorized coding: If you need to write a loop in your solution, think about how you can implement the same functionality with vectorized operations. Try to avoid loops as much as possible (in some cases, loops are inevitable).

 

Project 1 [14 Marks] cs代码代写

Let’s apply the tools we have learned in the tutorial to a new dataset.

We’re going to work with a breast cancer dataset. Download it using the cell below:

from sklearn.datasets import load_breast_cancer 

dataset = load_breast_cancer()

 

2.1 Part 1: Getting started [4 Marks]

First off, take a look at the data, target and feature_names entries in the dataset dictionary. They contain the information we’ll be working with here. Then, create a Pandas DataFrame called df containing the data and the targets, with the feature names as column headings. If you need help, see here for more details on how to achieve this. [1] * How many features do we have in this dataset? * What are the target classes? * What do these target classes signify? * How many participants tested Malignant?  * How many participants tested Benign?    

 ### YOUR CODE HERE ###

Use seaborn.lmplot (help here) to visualize a few features of the dataset. Draw a plot where the x-axis is “mean radius”, the y-axis is “mean texture,” and the color of each datapoint indicates its class. Do this once again for different features for the x- and y-axis and see how the data is distributed. [1]

Standardizing the data is often critical in machine learning. Show a plot as above, but with two features with very different scales. Standardize the data and plot those features again. What’s different? Why? [1]

It is best practice to have a training set (from which there is a rotating validation subset) and a test set. Our aim here is to (eventually) obtain the best accuracy we can on the test set (we’ll do all our tuning on the training/validation sets, however). To tune k (our hyperparameter), we employ cross-validation (Help). Cross-validation automatically selects validation subsets from the data that you provided. Split the dataset into a train and a test set “70:30”, use random_state=0. The test set is set aside (untouched) for final evaluation, once hyperparameter optimization is complete. [1]

### YOUR CODE HERE ###

2.2 Part 2: KNN Classifier without Standardization [3 Marks] cs代码代写

Normally, standardizing data is a key step in preparing data for a KNN classifier. However, for educational purposes, let’s first try to build a model without standardization. Let’s create a KNN classifier to predict whether a patient has a malignant or benign tumor.

Follow these steps:

  1. Train a KNN Classifier using cross-validation on the dataset.Sweep k (number of neighbours) from 1 to 100, and show a plot of the mean cross-validation accuracy vs k. [1]
  2. Whatis the best k? Comment on which ks lead to underfitted or overfitted  [1]
  3. Canyou get the same accuracy (roughly) with fewer features using a KNN model? You’re free to use trial-and-error to remove features (try at least 5 combinations), or use a more sophisticated approach like Backward Elimination. Describe your findings using a graph or table (or multiple!). [1]
### YOUR CODE HERE ###
cs代码代写
cs代码代写
 

2.3 Part 3: Standardization [2 Marks] cs代码代写

Standardizing the data usually means scaling our data to have a mean of zero and a standard deviation of one.

Note: When we standardize a dataset, do we care if the data points are in our training set or test set? Yes! The training set is available for us to train a model – we can use it however we want.

The test set, however, represents a subset of data that is not available for us during training. For example, the test set can represent the data that someone who bought our model would use to see how the model performs (which they are not willing to share with us). Therefore, we cannot compute the mean or standard deviation of the whole dataset to standardize it – we can only calculate the mean and standard deviation of the training set. However, when we sell a model to someone, we can say what our scalers (mean and standard deviation of our training set) was. They can scale their data (test set) with our training set’s mean and standard deviation. Of course, there is no guarantee that the test set would have a mean of zero and a standard deviation of one, but it should work fine.

To summarize: We fit the StandardScaler only on the training set. We transform both training and test sets with that scaler.

  1. Createa KNN classifier with standardized data (Help), and reproduce all steps in Part  [1]
  2. Doesstandardization lead to better model performance? Is performance better or worst? Discuss. [1]
### YOUR CODE HERE ###

2.4 Part 4: Test Data [1 Mark] cs代码代写

Now that you’ve created several models, pick your best one (highest accuracy) and apply it to the test dataset you had initially set aside. Discuss. [1]

### YOUR CODE HERE ###

2.5 Part 5: New Dataset [4 Marks]

Find an appropriate classification dataset online and train a KNN model to make predictions.

  • Introduceyour  [1]
  • Createa KNN classifier using the tools you’ve  [2]
  • Presentyour  [1]

Hint: you can find various datasets here: https://www.kaggle.com/datasets and here: https://scikit-learn.org/stable/datasets/index.html#toy-datasets.

To use a dataset in Colab, you can upload it in your Google drive and access it in Colab (help here), or you can download the dataset on your local machine and upload it directly to Colab using the following script. cs代码代写

from google.colab import files uploaded = files.upload()

When submitting your project on Quercus, please make sure you are also uploading your dataset so we can fully run your notebook.

YOUR CODE HERE ###

更多代写:machine learning代写 多邻国代考 SQL作业代写 会计作业代写 economic代写 澳洲Assignment