Project 3 – Ensemble Methods and Unsupervised Learning
In this project you will explore some techniques in unsupervised learning
as well as ensemble methods. It is important to realize that understanding
an algorithm or technique requires understanding how it behaves under a
variety of circumstances. You will go through the process of choosing and
exploring two classification datasets, tuning the algorithms you have
learned about, writing a thorough analysis of your findings, and presenting
your findings. The most crucial part of this assignment is the analysis and
your ability to explain and justify your results.
I. Choosing Datasets
The first task in this assignment is choosing two interesting classification
datasets, these can be binary or multiclass. The features can be of any
type, and it is recommended that you choose datasets with diverse feature
sets. I don’t care where you get the data from. You can download some,
take some from your own research, or make some up on your own. What I
do care about is that the datasets must be interesting. They should
contain a decent amount of features and a sufficiently large amount of
examples. Do not choose an “easy” dataset, however don’t go crazy either
trying to find the perfect one. Your two datasets should also differ in some
way such that you can compare and contrast your results between the
two. You should also be following standard machine learning practice by
splitting your dataset into training and testing, and only touching the
testing dataset at the very end when you are ready to report results. (Cross
validation is highly recommended).
II. Coding (10%)
After choosing your datasets you will now be tasked with writing code to apply
the machine learning algorithms you have learned about. Your code must be
written in python, but you may use any libraries that have already implemented
the machine learning algorithms (e.g scikit-learn). You are not expected to code
the algorithms from scratch, and in fact I would highly discourage it. What you
may not do is copy code from the internet. Below are the analyses you are
required to run.
1) Run K-means and Hierarchical Clustering on your datasets and analyze
what you observe.
2) Run two dimensionality reduction algorithms (PCA and UMAP) on your
datasets. Observe and analyze the results.
3) Re-run the K-means and Hierarchical Clustering on your dimensionality
reduced datasets and compare the results to part (1).
4) Tune and train two ensemble models (AdaBoost and Random Forests) on
both your original and dimensionality reduced datasets. Compare and
analyze the results.
Your code does not have to be pretty or well written. However, it must be written
in python and I must be able to run one script (main.py) that will produce all the
results and figures in your report.
III. Report (80%)
You will then produce a report describing and analyzing your methods and
results. Here you will describe the datasets you have chosen and why they are
interesting. You will then provide an analysis on how the different machine
learning algorithms performed on each dataset. The report must be limited to 10
pages maximum. Plots and figures are highly recommended. It is up to you
how you wish to demonstrate your understanding of the machine learning
algorithms you have explored, but below I have listed some potential ideas for
analysis and items you may wish to include in the report.
• A description of your two datasets and why you feel that they are interesting.
• Hypotheses on how you believe the learning algorithms will perform on each
dataset and why.
• How you dealt with different features in your datasets? missing data? different
scalings?
• Training and testing error rates you obtained for your various learning
algorithms (some sort of cross validation is highly recommended)
• The effect of hyperparameters on performance
• Comparing and contrasting results between datasets
• Comparing and contrasting results between learning algorithms
• Training and testing error rates as a function of training dataset size
• Timing analysis of how long it takes to train/test each algorithm
• Conclusions
• Ideas for future analyses
• What you may have done differently
• References