The main goal of this reading is to understand enough statistical methodology to be able to leverage the machine learning algorithms in Python’s scikit-learn library and then apply this knowledge to solve a classic machine learning problem.
The first stop of our journey will take us through a brief history of machine learning. Then we will dive into different algorithms. On our final stop, we will use what we learned to solve the Titanic Survival Rate Prediction Problem.
Some disclaimers:
- I am a full-stack software engineer, not a machine learning algorithm expert.
- I assume you know some basic Python.
- This is exploratory, so not every detail is explained like it would be in a tutorial.
With that noted, let’s dive in!
A Quick Introduction to Machine Learning Algorithms
As soon as you venture into this field, you realize that machine learning is less romantic than you may think. Initially, I was full of hopes that after I learned more I would be able to construct my own Jarvis AI, which would spend all day coding software and making money for me, so I could spend whole days outdoors reading books, driving a motorcycle, and enjoying a reckless lifestyle while my personal Jarvis makes my pockets deeper. However, I soon realized that the foundation of machine learning algorithms is statistics, which I personally find dull and uninteresting. Fortunately, it did turn out that “dull” statistics have some very fascinating applications.
You will soon discover that to get to those fascinating applications, you need to understand statistics very well. One of the goals of machine learning algorithms is to find statistical dependencies in supplied data.
The supplied data could be anything from checking blood pressure against age to finding handwritten text based on the color of various pixels.
That said, I was curious to see if I could use machine learning algorithms to find dependencies in cryptographic hash functions (SHA, MD5, etc.)—however, you can’t really do that because proper crypto primitives are constructed in such a way that they eliminate dependencies and produce significantly hard-to-predict output. I believe that, given an infinite amount of time, machine learning algorithms could crack any crypto model.
Unfortunately, we don’t have that much time, so we need to find another way to efficiently mine cryptocurrency. How far have we gotten up until now?
A Brief History of Machine Learning Algorithms
The roots of machine learning algorithms come from Thomas Bayes, who was English statistician who lived in the 18th century. His paper An Essay Towards Solving a Problem in the Doctrine of Chances underpins Bayes’ Theorem, which is widely applied in the field of statistics.
In the 19th century, Pierre-Simon Laplace published Théorie analytique des probabilités, expanding on the work of Bayes and defining what we know of today as Bayes’ Theorem. Shortly before that, Adrien-Marie Legendre had described the “least squares” method, also widely used today in supervised learning.
The 20th century is the period when the majority of publicly known discoveries have been made in this field. Andrey Markov invented Markov chains, which he used to analyze poems. Alan Turing proposed a learning machine that could become artificially intelligent, basically foreshadowing genetic algorithms. Frank Rosenblatt invented the Perceptron, sparking huge excitement and great coverage in the media.
But then the 1970s saw a lot of pessimism around the idea of AI—and thus, reduced funding—so this period is called an AI winter. The rediscovery of backpropagation in the 1980s caused a resurgence in machine learning research. And today, it’s a hot topic once again.
The late Leo Breiman distinguished between two statistical modeling paradigms: Data modeling and algorithmic modeling. “Algorithmic modeling” means more or less the machine learning algorithms like the random forest.
Machine learning and statistics are closely related fields. According to Michael I. Jordan, the ideas of machine learning, from methodological principles to theoretical tools, have had a long prehistory in statistics. He also suggested data science as a placeholder term for the overall problem that machine learning specialists and statisticians are both implicitly working on.
Categories of Machine Learning Algorithms
The machine learning field stands on two main pillars called supervised learning and unsupervised learning. Some people also consider a new field of study—deep learning—to be separate from the question of supervised vs. unsupervised learning.
Supervised learning is when a computer is presented with examples of inputs and their desired outputs. The goal of the computer is to learn a general formula which maps inputs to outputs. This can be further broken down into:
- Semi-supervised learning, which is when the computer is given an incomplete training set with some outputs missing
- Active learning, which is when the computer can only obtain training labels for a very limited set of instances. When used interactively, their training sets can be presented to the user for labeling.
- Reinforcement learning, which is when the training data is only given as feedback to the program’s actions in the dynamic environment, such as driving a vehicle or playing a game against an opponent
In contrast, unsupervised learning is when no labels are given at all and it’s up to the algorithm to find the structure in its input. Unsupervised learning can be a goal in itself when we only need to discover hidden patterns.
Deep learning is a new field of study which is inspired by the structure and function of the human brain and based on artificial neural networks rather than just statistical concepts. Deep learning can be used in both supervised and unsupervised approaches.
In this article, we will only go through some of the simpler supervised machine learning algorithms and use them to calculate the survival chances of an individual in tragic sinking of the Titanic. But in general, if you’re not sure which algorithm to use, a nice place to start is scikit-learn’s machine learning algorithm cheat-sheet.
Basic Supervised Machine Learning Models
Perhaps the easiest possible algorithm is linear regression. Sometimes this can be graphically represented as a straight line, but despite its name, if there’s a polynomial hypothesis, this line could instead be a curve. Either way, it models the relationships between scalar dependent variable