Make sure to like & comment if you liked this video!
Take Hank's course here: https://www.datacamp.com/courses/unsupervised-learning-in-r
Many times in machine learning, the goal is to find patterns in data without trying to make predictions. This is called unsupervised learning. One common use case of unsupervised learning is grouping consumers based on demographics and purchasing history to deploy targeted marketing campaigns. Another example is wanting to describe the unmeasured factors that most influence crime differences between cities. This course provides a basic introduction to clustering and dimensionality reduction in R from a machine learning perspective, so that you can get from data to insights as quickly as possible.
Hi! I'm Hank Roark, I'm a long-time data scientist and user of the R language, and I'll be your instructor for this course on unsupervised learning in R.
In this first chapter I will define ‘unsupervised learning’, provide an overview of the three major types of machine learning, and you will learn how to execute one particular type of unsupervised learning using R.
There are three major types of machine learning. The first type is unsupervised learning. The goal of unsupervised learning is to find structure in unlabeled data. Unlabeled data is data without a target, without labeled responses.
Contrast this with supervised learning. Supervised learning is used when you want to make predictions on labeled data, on data with a target.
Types of predictions include regression, or predicting how much of something there is or could be, and classification which is predicting what type or class some thing is or could be.
The final type is reinforcement learning, where a computer learns from feedback by operating in a real or synthetic environment.
Here is a quick example of the difference between labeled and unlabeled data. The table on the left is an example with three observations about shapes, each shape with three features, represented by the three columns. This table, the one on the left is an example of unlabeled data. If an additional vector of labels is added, like the column of labels on the right hand side, labeling each observation as belonging to one of two groups, then we would have labeled data.
Within unsupervised learning there are two major goals. The first goal is to find homogeneous subgroups within a population. As an example let us pretend we have a population of six people. Each member of this population might have some attributes, or features — some examples of features for a person might be annual income, educational attainment, and gender. With those three features one might find there are two homogeneous subgroups, or groups where the members are similar by some measure of similarity. Once the members of each group are found, we might label one group subgroup A and the other subgroup B. The process of finding homogeneous subgroups is referred to as clustering.
There are many possible applications of clustering. One use case is segmenting a market of consumers or potential consumers. This is commonly done by finding groups, or clusters, of consumers based on demographic features and purchasing history. Another example of clustering would be to find groups of movies based on features of each movie and the reviews of the movies. One might do this to find movies most like another movie.
The second goal of unsupervised learning is to find patterns in the features of the data. One way to do this is through ‘dimensionality reduction’. Dimensionality reduction is a method to decrease the number of features to describe an observation while maintaining the maximum information content under the constraints of lower dimensionality.
Dimensionality reduction is often used to achieve two goals, in addition to finding patterns in the features of the data.
Dimensionality reduction allows one to visually represent high dimensional data while maintaining much of the data variability. This is done because visually representing and understanding data with more than 3 or 4 features can be difficult for both the producer and consumer of the visualization.
The third major reason for dimensionality reduction is as a preprocessing step for supervised learning. More on this usage will be covered later.
Finally a few words about the challenges and benefits typical in performing unsupervised learning.
In unsupervised learning there is often no single goal of the analysis. This can be presented as someone asking you, the analyst, “to find some patterns in the data.” With that challenge, unsupervised learning often demands and brings out the deep creativity of the analyst.
Finally, there is much more unlabeled data than labeled data. This means there are more opportunities to apply unsupervised learning in your work.
Now it's your turn to practice what you've learned.