Homework 2: K-means clustering via scikit-learn
First, please accept the assignment on GitHub Classroom by clicking the link on Piazza to create the assignment repository. Please see Homework 1 instructions for how to clone your repository and submit. Remember to tag your submission with a “submission” tag and update the tag as necessary.
For each part below, we have provided skeleton code in part1.py
, part2.py
and part3.py
.
We have also included one or two example outputs in the files part1.txt
, etc. so that you
can check your first few lines of program output and check the output format.
NOTE 1: The first part of your program output should EXACTLY match the example output files (i.e. character by character). The format of the output should be exactly the same. This is important so that we can automatically grade your submissions.
NOTE 2: Please make sure to use print
statements to actually print to stdout instead
of writing to the text files.
Thus, when we run your program, it should actually print to stdout. As an example, when running
from the command line, the output should look like the following:
$ python part1.py
2
[53 97]
3
[62 50 38]
Note that this is accomplished by merely calling print
statements.
If you want to save to a file (optional), you can redirect stdout to the corresponding
text file like so:
$ python part1.py > part1.txt
You may receive 0 points for a part if you do not follow these instructions since we are using automatic grading.
Part 1: Different number of clusters
We have provided code to load the classic Iris flowers dataset in part1.py
.
For this part, run kmeans via scikit-learn’s sklearn.cluster.KMeans
estimator.
Use all the default parameter settings except set random_state=0
so that
everyone’s code produces the same output.
Vary the number of clusters (n_clusters
parameter of KMeans
) from 2 to 10 and
fit the estimator to the dataset.
For each number of clusters,
- Print the number of clusters.
- Print the number of points in each cluster.
Part 2: Different random seeds
For this part, we have loaded a sample double circles dataset. This part illustrates that if the algorithm does not always converge to the same solution and the data distribution may not be in nice round clusters.
Run KMeans with default parameters except with the number of clusters set to 3.
Vary the random state from 0 to 5 inclusive (i.e. [0, 1, 2, 3, 4, 5]) and fit the estimator to the dataset. For each random state,
- Print the random state.
- Print the cluster centers as a 3 x 2 matrix.
Part 3: Different maximum iteration and confusion matrix
For this last part, we will use the digits dataset, which represent small 8x8 grayscale images
for the handwritten numbers 0-9.
We will be using the scikit-learn function sklearn.metrics.confusion_matrix
to compare our clustering
labels to the true class labels (i.e. to the true label of which digit the image represents).
See sklearn’s confusion matrix documentation for more information at:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
Essentially, we will treat our clustering algorithm as doing classification and then evaluate via standard classification evaluation metrics. Because clustering labels are just dummy indices and don’t correspond to class labels, we will need to permute the clustering labels so that they (approximately) align with the true class labels. Note that in general this is not necessarily possible and may not be obvious, but in this simple case, the results are reasonable. We have already provided the function for you to permute (or map) your clustering labels to predicted class labels. Thus, when computing the confusion matrix you will use the true class labels and the permuted labels.
(If a dataset has class labels, this is one way to evaluate new clustering methods but is not useful in the real-world applications since class labels are (by the definition of clustering) not available.)
For this part, use default parameters for KMeans
except set the random state to 0 and the number of clusters to 10.
Vary the maximum iteration to be 1, 5, 10, and 50 (in that order), fit the estimator, permute the cluster labels, compute the confusion matrix using the true labels and permuted labels, and then print out the following:
- Print out the maximum iteration.
- Print out the confusion matrix.