# Machine Learning: introduction to predictive algorithms

Francesca Bigardi, senior project manager EDALAB, during the internal course of Machine Learning and Advanced predictive techniques

# Machine Learning: Introduction

Artificial intelligence is essentially a marketing concept, what informatics and statistics implement concern in particular **machine learning** and **deep learning** (a branch). Machine learning is the ability of a computer to learn from experience (or rather to automatically learn and improve from it without being explicitly programmed) such as modifying its processes based on data acquisition.

The three main concepts of machine learning are **data**, **software** and **hardware** that combined create predictive (and non-prevention) models. Machine learning algorithms use mathematical-computational methodologies to gather information directly from data, without mathematical models and predetermined equations. Machine learning algorithms increase their performance in ”automatic and adaptive mode“ as they come into contact with the data to be learnt.

For the software part it is right to underline how Python remains the main programming language for machine learning, with different algorithms already implemented.

N.B. There are no machine learning models already configured or standardised for certain cases, the value of machine learning derives from the quality of the data found, which varies from scenario to scenario. What can be found already done are tools (such as tensorflow or PyTorch) which, however, individually do not create value because the machine learning must be lowered for the specific application in question.

What are we talking about?

**Autonomous driving**? The essence of machine learning.- The suggestions of online offers like those of
**Amazon**or**Netflix**? The application of machine learning to everyday life. - Know what your customers say about y
**our company on Twitter?**Machine learning combined with the creation of language rules. **Interception**of a**fraud**? One of its less obvious but increasingly frequent uses.**Energy efficiency**and**sustainability**? What we at EDALAB are studying and analyzing.- Improvement of process quality, reduction of inefficiencies, saving time/cost, identification and isolation of anomalies?
**What we in EDALAB propose to do**

What are the **basic elements** to create machine learning?

- Ability to prepare and select heterogeneous data
- Ability to implement basic and advanced algorithms
- Processes of iterative automation.
- Scalability.
- Ensemble of modelling technology.

Some **tools** to start understanding machine learning

- In machine learning, a target is called a label.
- In statistics, a target is called a dependent variable.
- A statistic variable is called feature in machine learning.
- A statistical transformation is called feature creation in machine learning.

**Deep learning** combines increasingly powerful computers with special neuronal networks to understand the patterns present in large volumes of data. Deep learning techniques are currently state of the art for the ability to identify objects in images and words in sounds. Researchers are now trying to apply these successes in the recognition of models to more complex tasks, such as automatic language translation, medical diagnoses and many other important areas, both social and business.

**Machine Learning: meaning**

**Machine Learning** refers to machine learning as the ability of machines (understood as computers) to learn without having been explicitly and previously programmed (at least as in the traditional sense of computer science).

The term was coined for the first time by Arthur Lee Samuel, a pioneer American scientist in the field of Artificial Intelligence, in 1959, although, to date, the most accredited definition by the scientific community is that provided by another American, **Tom Michael Mitchell**, director of the Machine Learning department of Carnegie Mellon University:

**”A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”.**

In other words: Machine Learning allows computers to learn from experience (not in the sense of“human” but still referred to a computer program); there is learning (experience) when the performance of the program improves after the performance of a task or the completion of an action (also wrong, starting from the assumption that also for the man is worth the principle“mistaken learning”).

From the computer point of view when it comes to Machine Learning it is not a matter of writing programming code through which, step by step, you tell the machine what you want to do but the program are provided with data sets that, elaborated through specific algorithms, they develop their own logic to perform a function oriented to a specific objective.

**How Machine Learning works: major categories**

The Machine Learning works on the basis of two different approaches, identified by Samuel in the late 1950s, which distinguishes machine learning in two sub-categories of Machine Learning depending on whether the computer is given complete examples to be used as an indication to perform the required task (**supervised learning**) or that you let the software work without any“help” (**unsupervised learning**).

The reality, however, is more complicated and there are numerous subsets that allow to classify in more detail the machine learning based on its operation. Let us go into more detail with some more practical examples.

**A) Supervised Learning**

In this machine learning technique we have a problem that can be formalised through the definition of a target variable consisting of the information we want to obtain. The function f(x) is the machine learning model that allows to correlate x (input) with y(output).

In this case machine learning on the computer are constituted both of the data sets as input and the information relative to the desired results setting as objective what the system identifies a general rule able to Connect the input data to the output data. The aim is to be able to then reuse this rule for other similar tasks.

There are two further subdivisions:

– **regression**: if the output takes one or more continuous values

– **classification**: if the output assumes finite values

In regression analysis is a technique used to analyse a data set consisting of a dependent variable and one or more independent variables. The aim is to estimate a possible functional relationship between the dependent variable and the independent variables.

Now let’s talk about the main processes to carry out a machine learning with supervised approach.

**Preprocessing**

To start building your own machine learning model you have to carry out that process called **data screening** to define the statistics and quantities to be included in the dataset. The study concerns the division/clustering to analyse the quality, reliability and accuracy of the data. There is a work of data cleaning to make it representative through the statistical calculation of deviance, covariance, correlation, deviation, interpolation and several other factors.

A crucial phase in this process is **data reduction**, which allows the development of recommended modelling approaches based on the data set:

- Size reduction and modeling: neural networks, ordinary least square
- Direct modeling: stepwise selection, ridge regression, LASSO
- Deep learning (large amount of data)

There are several techniques to execute the data reduction, we report the two main ones:

- Correlation analysis by which rescheduling variables are removed
- PCA(principal component Analysis) through which derived variables are created through the use of linear transformations analyzing only the main component

**OLS – Ordinary Least Squares**

Ordinary least squares (OLS) is a form of statistical regression used in machine learning and used to predict unknown values from an existing dataset. Let’s take an example where this technique can be used in the field of machine learning: We assume a basic scenario for predicting the size of the shoe from a data set that includes the height and size of the shoe. Given the data, you can use the formula of the minimum ordinary squares to create a rate of variation and predict the size of the shoe, given the height of a subject. In short, OLS takes an input, the independent variable, and produces an output, the dependent variable. In machine learning this factor may be relevant in some cases.

N.B With large data sets are often poorly conditioned.

Because ordinary minimum squares are a form of regression, used to create forecast information about sample data, is widely used in machine learning. Using the above example, an automatic learning algorithm can process and analyse specific sample data that includes information on both the height and measurement of the shoe. Given the data points and using the minimum ordinary squares, the algorithm can start making predictions about the measurement of an individual’s shoe.

### Machine Learning Ridge Regression

Let us take a simple example to understand the functioning of this statistical technique in order to understand how it can be useful for machine learning.

The linear regression model is given by the following equation:

Y = Σ WⱼHⱼ(Xᵢ)

Σ va da j = 0 a j = D where D is the total number of elements.

Wⱼ is the coefficient jᵗʰ

Hⱼ is functionality function jᵗʰ that takes Xᵢ observation

Xᵢ isobservation iᵗʰ

The above equation gives the expected value, provided that the values of the coefficients W are known.

For simplification, we indicate the above equation with F(X) where X is observation.

The cost function for example of the linear regression model is given by the following equation:

Cost function = RSS(W) = Σ [Yᵢ – 𝔽(Xᵢ)]².

Σ goes from i=0 a i = N where N is the total number of observations.

Yᵢ is the known value of observation iᵗʰ.

𝔽(Xᵢ) provides the expected value of observation iᵗʰ .RSS stands for Residual Sum of Squares.

Keep in mind that the cost function always works on the training data set.

The whole idea of the linear regression model revolves around minimizing the value of the cost function above.

In general, to decrease the cost function, we increase the number of features of our model. As we continue to increase the characteristics of the model, the model begins to adapt well the training data set and the value of the cost function begins to decrease.

But as the number of features increases, our equations become a higher order polynomial equation, and this leads to an oversized data.

**Why is overfitting negative?**

In an oversized model the training error becomes almost nil, with the result that the model works perfectly on the training data set. But does that model work perfectly on data sets other than those related to training, such as the real data of the outside world?

Generally you see that an oversized model works worse on the test data set, and you also see that an oversized model works worse even on a new additional test data set.

From the chart above, we can see that the oversized model works well on training data and the cost function is zero for training data.

But when we test this model with the test data set in the chart above, the model does not work well. For the test data, the model has incorrect values that are far from correct. This is enough to label this model as unfit for real use.

**How we identify overfitting?**

By viewing the model (as above), you can easily see the overfitting in the model (observing how the model fits well to the training data). But as the complexity of our model increases, it enters a higher dimension that makes it difficult to visualise on charts (or some other tool).

Instead of trying to always display the model, we can also see the oversizing by seeing the value of the coefficients ( W ). Generally, when oversizing occurs, the values of these coefficients become very high.

Ridge regression is used to quantify data oversizing by measuring the size of the coefficients.

To solve the overfitting problem, we need to balance two things:

1. How the / model function adapts to data.

2. Magnitude of the coefficients.

Ridge Regression Cost = RSS(W) + λ*||W||²

We added λ in the total cost function as a regulatory parameter to balance the adjustment of data and the size of the coefficients.

Taking the gradient of the above equation (differentiation):

```
Δ[RSS(W) + λ||W||]²
= Δ{(Y - HW)ᵗ(Y - HW)} + λ Δ{WᵗW}= -2Hᵗ(Y - HW)+2λW
```

Ridge Regression Cost = RSS(W) + λ*||W||² = (Y – WH)*(Y – WH) + WW

Setting the gradient above to 0 you get:

W = (HᵗH + λI)-¹HᵗY

We therefore know the values of the coefficients W.

**How to choose the suitable λ value for machine learning?**

The data set for machine learning is divided into three main sets:

**1. Training set**

This data set will be used to obtain the value of the coefficients W for each value of λ. Suppose that the value of the coefficients W for each value of λ is Wλ.

**2. Validation set**

The different values of Wλ will be evaluated at the time of validation. The one with a lower error value will be selected.

**3. Test set**

The selected value of the W” coefficients will be reassessed using a test data set.

The above method can only be used in machine learning if there is sufficient data.

And this is how the value of λ is finally selected. The process is a sort of brute force but with intelligent conjecture and experience, iterations to identify the value λ can be reduced.

In this simple example we have seen an example of a linear regression cost function that takes into account the data overlap with the adjustment parameter λ.

**Limit on machine learning**

Machine learning prediction can change dramatically with small input disruptions.

**LASSO**

**Least absolute shrinkage and selection Operator regression**, known as **LASSO regression**, is a regularised version of *Linear Regression*: adding a regularisation term called *alpha* to **cost function**, the learning algorithm is forced to keep *weight* as low as possible.

Contrary to **Ridge regression**, which minimises the *weight* of some *features* and reduces the contribution to the model, **LASSO regression** makes a real **selection** of independent variables (*feature selection*): bringing the remaining ones to **zero** through an appropriate *weight* value associated, and generating a *sparse model* (with some* nonzero features*).

The function lasso is in the form:

L = ∑( Ŷi- Yi)2 + λ∑ |β|β|

The only difference to the Ridge regression is that the term regularisation is in absolute value. But this difference has a huge impact on the trade-off we discussed earlier. The lasso method overcomes the disadvantage of Ridge regression not only by punishing the high values of β coefficients, but also by zeroing them if they are not relevant. Therefore, you may find yourself with fewer features included in the model than the initial ones, which is a huge advantage.

There is always a need to validate the stability of the machine learning model. In other words, it is not possible to adapt the model to training data and hope that it will work accurately for the real data that has never been seen before.

There is a need for a guarantee that the model starts from data correctly detected, not disturbed excessively: the level of distortion and variance must be as low as possible.

### Machine Learning Cross Validation

In this machine learning process it is decided whether numerical results quantifying the assumed relationships between the variables are acceptable as a description of the data. Generally an error estimate for the model is made after its formation, better known as residue assessment.

In this process, a numerical estimate of the difference between the expected and the original answers is made, also called training error. However, this only gives us an idea of how well our model on the data used to train it does. Now the model may not be able to adapt or oversize the data.

**Machine Learning Cross Validation k-Fold**

Cross-validation is a re-sampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k which refers to the number of groups in which a given sample of data is to be divided. As such, the procedure is often called k-fold cross-validation. When choosing a specific value for k, it can be used instead of k in the reference to the model, such as k=10 which becomes a cross-validation of 10 times.

Cross validation is mainly used in machine learning to estimate the ability of a machine learning model on unseen data. That is, using a limited sample to estimate how the model should work in general when used to make predictions about unused data during model formation.

It is a popular method because it is simple to understand and because it generally results in a less partial or less optimistic estimation of the capabilities of the model compared to other methods, such as a simple train/test subdivision.

The general procedure is as follows:

- Mix the data set randomly.
- Divide the data set into k groups
- For each individual group:

Take the group as a test or waiting data set

Take the remaining groups as training data sets

Mount a model on the training set and evaluate it on the test set.

Maintain the rating score and discard the model. - Summarise the skill of the model using the sample of model rating scores.

It is important to note that each observation of the data sample is assigned to a single group and remains in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the time interval set 1 time and used to train the model k-1 times.

This approach involves randomly subdividing the set of observations into K groups, or folds, of approximately the same size. The first fold is treated as a validation set, and the method adapts to the remaining fold k-1.

It is also important that any data preparation prior to the model adaptation takes place on the CV training data set within the cycle rather than on the larger data set. This also applies to any hyperparameter setting. Failure to perform these operations within the cycle may result in data loss and an optimistic estimation of the skill of the model

The results of a cross-validation k-fold test are often summarised with the average of the model’s skill scores. It is also good practice to include a measure of the variance of skill scores, such as standard deviation or standard error.

**B) Unsupervised Learning**

This category of Machine Learning represents the model in which only data sets are supplied to the system, without any indication of the result considered. The objective in this case is to trace hidden patterns and models or to identify in the inputs a logic of structure without these being previously labelled.

**C) Reinforcement Learning**

In this method the system interacts with a dynamic environment (to obtain input data) to achieve a goal for which it receives a “reward” also learning from errors. The behaviour and performance of the system is determined by a series of learning based on rewards and error detection.

It is a model in which the computer learns to win in a game against a human by concentrating its efforts on carrying out a set of tasks aimed at achieving the maximum objective value. The system learns while performing its tasks and while making errors to improve its performance by virtue of previously achieved results.

**D) Semi-supervised**

In this case it is a model“hybrid” where the computer is provided with an incomplete set of data for training/learning; some of these inputs are“equipped”of the respective output examples (as in supervising learning)others do not (as in unsupervised learning). The basic objective is always the same: to identify rules and functions for problem solving, as well as models and data structures to achieve certain objectives.

**E) Other Practical Approaches to Machine Learning: from Probabilistic Models to Deep Learning**

Clearly there are other sub-categories of Machine Learning which serve as a practical“classification” identifying more practical approaches to the application of Machine Learning algorithms.

One example is graphs-based decision trees.

A decision tree is a system with n input variables and m output variables. The input variables ( attributes ) are derived from the observation of the environment. The last variables in output, instead, identify the decision / action

Note. In the very deep decisional trees, the variables in intermediate outputs, coming out from the parent nodes, coincide with the input variables of the child nodes. Intermediate output variables condition the path to the final decision.

Each node checks a condition ( test ) on a particular property of the environment ( variable ) and has two or more branches down in operation. The process consists of a sequence of tests. Always start from the root node, the parent node located higher up in the structure, and proceed down.

Logical trees have the undisputed advantage of simplicity. They are easy to understand and execute. Compared to neural networks, the decision tree is easily understood by humans. Therefore, man can verify how the machine comes to the decision

In addition, Boolean decision trees are easily developed as a programming code, because they can be represented with any propositional language.

**The disadvantages**

Decision tree representation is ill-suited for complex problems, because the space of hypotheses becomes too large. The spatial complexity of the algorithm could become prohibitive.

**Other categories of Machine Learning**

There are also other sub-categories of Machine Learning that actually serve to give a sort of classification“practice” because, in fact, they identify practical approaches to the application of the algorithms of Learning Machine (from which there may therefore be derived categories of“learning”systems).

Another concrete example comes from the“clustering” that is from the mathematical models that allow to group data, information, objects, etc. “similar”; it is a practical application of Machine Learning behind which there are different learning models ranging from the identification of structures (what defines a cluster and what is its nature) “objects” which must be part of one group rather than another.

Then, there is the sub-category of“probabilistic models” which base the system learning process on probability calculation (the best known is perhaps the ”Bayes network”, a probabilistic model that represents in a graph l’set of random variables and related conditional dependencies).

Finally, there are the well-known artificial neural networks that use for learning certain algorithms inspired by the structure, operation and connections of the biological neural networks (ie those of the human being). In the case of the so-called multi-layer neural networks, one then enters the field of **Deep Learning** (deep learning).

**Machine Learning: Applications**

The applications of Machine Learning are already very numerous today, some of which commonly entered our daily lives without us actually realizing it.

Think for example to the use of **search engines**: through one or more keywords, these engines return lists of results (the so-called **SERP – Search Engine results Page**) which are the effect of machine learning algorithms with unsupervised learning (they provide as output information considered relevant to the research performed based on analysis of patterns, models, structures in the data).

Another common example is related to the **spam filters of e-mails** based on machine learning systems that continuously learn both to intercept suspicious or fraudulent email messages and to act accordingly (for example deleting them before they are distributed on users’ personal boxes). Such systems, even with greater sophistication, are also used in the Finance sector for the **prevention of fraud** (such as the cloning of credit cards), **data theft and identity**; algorithms learn to act by linking events, user habits, spending preferences, etc. ; information through which they are then able to identify in real-time any abnormal behaviour that could precisely identify a theft or a fraud.

We at EDALAB are studying Machine Learning models to reduce industrial process inefficiencies, improve the final quality of products by acting at intermediate levels, increase the energy efficiency of plants, machinery and buildings. The challenge is certainly fascinating but also thanks to the Computer Science Department of the University of Verona we think we can take these challenges as an opportunity to continue improving.