When I started my journey into python, I didn’t know about Data Science. As I explained in one article, python was more a way for me to scrap and retrieve useful information for my work, then I use it to automate my tasks because I was alone to do the implementation of 33 websites. As I was automating more and more things, I wanted also to analyze what I was actually doing, getting some meaning behind the data I was helping processing… so came the Data Science interest.

As I knew python pretty well (or so I thought… ahahah…), I was like : “OK let’s import scikit-learn and run some analysis”. **SPOILER ALERT : This is not that easy!**

Doing algorithm improvement and pure data science is only 10% of your job when you are into advanced data analysis (aka : data science). The cleaning and the preprocessing of the data represents 90% of the job. This is common to say but it is true and it needs to be highlighted. It may be the most important part of the job.

In this article, I will cover non algorithmic methods that (mostly) need to be done on your data set before you are running any algorithm.

Most of the explanation are coming from the scikit-learn documentation.

You can have a look : https://scikit-learn.org/stable/index.html

## Data Cleaning

This part will be pretty short because I already tackled this topic on an article. When I was dealing with the Munich House Market data : Cleaning Scraping Data

This is clearly a part where you have to be inventive, because from your ideas will come more or less data, more or less accuracy. It really highlight the fact that being on Data Science still requires creativity and understanding of the data itself. You need to be good with your programming language but without clear understanding of what data you are dealing with, and good strategy to clean the noise, you would handicap your analysis from the beginning. What is lost at that stage cannot be recover later on.

Usually, we also overlook the data cleaning part once it has been done, because it can be complex to change it. All the following process can be impacted by it. Thus if you are half-doing it on that part, you will probably live with this legacy for quite some time. It is worth to consider planning more hours to that part to any other ones when you are at the beginning of your project.

## Data Wrangling

Data wrangling is the principle of changing or transforming the data in order to be able to process it later on.

When you are doing basic data analysis, you may not need to realize a lot of that. As long as you are doing a proper job during the data cleaning process, you don’t exactly need to do much more. Every analysis done by pandas are working as long as the data are cleaned enough.

However, if you would like to realize more advanced usage with that data, like machine learning capabilities, you would need to do some additional stuffs.

### Convert Categorical data

Let’s imagine that you have a data set about house market, you would like that your predictive algorithm takes into consideration if the apartment has a balcony or not. You have a column, balcony with the possible values “yes” and “no”. You will have to replace them by numbers in order for the algorithm to be able to process them.

The example is a categorical data type, and this one is pretty easy. As it is just a binary value, you could replace “yes” and “no” by 1 and 0.

However, sometimes, you have categorical values that are non numerical and also multi oriented. Let’s say for a car, the price is probably higher for a white or black car than for a green. So you would need to take that into consideration.

There are 2 main methods to actually deal with that type of issue.

The following section requires that you are using the 0.20 version of scikit-learn. Don’t hesitate to do a *pip install scikit-learn –upgrade*

Also, a good tip is that sklearn (or scikit-learn) is not automatically importing its subpackage. So you need to import each package at the time when you want to use them. (or a from sklearn import * ==> __ Not recommended__)

#### LabelEncoder

This methods creates a numerical value of each of the categorical values. In the case of the sex, it will take 0 and 1 but it could take 2, 3, 4, … if the number of category increase. *(mostly the case in 2019 as there are more than 2 genders now)*

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from sklearn import preprocessing import pandas as pd df = pd.DataFrame({'sex':['M','F','M','F','M','M','F','F'],'eyes':['blue','brown','green','blue','blue','green','brown','brown']}) ## You got what that meaning enc = preprocessing.LabelEncoder() df['sex_enc'] = enc.fit_transform(df['sex']) ## result df.head(2).T 0 1 sex M F eyes blue brown sex_enc 1 0 #if you want to actually inverse the process : list(enc.inverse_transform(df['sex_enc'])) |

#### Get_dummies() / OneHotEncoder

You could do the same with the eyes but sometimes it is better to have binary values to pass into your algorithm, especially on this case. There are no relation between “blue” and “green”, and one is not better than the other.

With Label Encoded, you could end up creating relation such blue is better than brown if blue is turned into 2 and brown into 1. (2>1)

In order to fix that, we have to encode each value as 0 or 1 and expand the number of columns. As you are mostly storing your data in a dataframe, you can use the get_dummies methods from pandas to create that :

1 2 3 4 5 6 7 8 9 10 11 12 13 | ##same dataframe than before ## with pandas get_dummies() methods df2 = pd.get_dummies(df['eyes'],prefix='eyes') df = df.merge(df2,left_index=True,right_index=True) df.head(2).T 0 1 sex M F eyes blue brown sex_enc 1 0 eyes_blue 1 0 eyes_brown 0 1 eyes_green 0 0 |

but as we are using numpy array time to time for better performance on array, let’s see how we can do it with numpy.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import numpy as np npArray = np.column_stack((['M','F','M','F','M','M','F','F'],['blue','brown','green','blue','blue','green','brown','brown'])) from sklearn import preprocessing onehotencoder = preprocessing.OneHotEncoder() df3 = onehotencoder.fit_transform(npArray).toarray() ##result : 5 columns : 2 for the 2 possible sex values and 3 for the 3 possible eyes values. [[0. 1. 1. 0. 0.] [1. 0. 0. 1. 0.] [0. 1. 0. 0. 1.] [1. 0. 1. 0. 0.] [0. 1. 1. 0. 0.] [0. 1. 0. 0. 1.] [1. 0. 0. 1. 0.] [1. 0. 0. 1. 0.]] |

Another possible way to change categorical data to numerical, such onehotencoding. This time, it works with dictionaries so you’ll need to modify your dataframe. This technique use the feature_extraction library (and not the preprocessing) and require you to create an instance of the class, to use it on your data frame.

Class Name : DictVectorizer

1 2 3 4 5 6 7 8 9 10 11 12 13 | from sklearn.feature_extraction import DictVectorizer vector = DictVectorizer(sparse=False, dtype=int) df_vec = vector.fit_transform(df.to_dict(orient="records")) ## see that I use my dataframe as dictionary with this instance. ## the data #array([[1, 0, 0, 0, 1], [0, 1, 0, 1, 0], [0, 0, 1, 0, 1], [1, 0, 0, 1, 0], [1, 0, 0, 0, 1], [0, 0, 1, 0, 1], [0, 1, 0, 1, 0], [0, 1, 0, 1, 0]], dtype=int32) |

However, this method mix all of the different columns and knowing what exactly has been transformed to what is quite hard to determined. (here sex encoding has been set in the 2 last columns)

Also, you would have trouble to change it back to the way it was.

1 2 | df_inverse = vector.inverse_transform(df_vec) df_inverse = pd.DataFrame.from_dict(df_inverse) |

In the next sections, I will assume that we have only numerical data within our data set. All the categorical, labeled data have been transformed and the data set contains only integer or floats.

1 2 | ## in our case : df_numeric = df.drop(['sex','eyes'],axis=1) |

### Standardization

One of the problem you may encounter as well when you are dealing with data, is that the data points are not describing a normal distribution. It may be a bit over represented on the lower point, or on the higher point, with extreme values and therefore it mess with the calculation of different algorithm. For example Support Vector Machines or the l1 and l2 regularizers of linear models assume that all features are centered around zero and have variance in the same order.

One of the best practice is actually to standardize your data in order to avoid this effect.

Scikit-learn provide different methods to achieve this and I’ll present you some.

The first one is **scale**, it standardize a dataset along any axis. Pretty basic and efficient.

1 2 3 | df_numeric_scale = preprocessing.scale(df_numeric) ## most likely, you dataframe will be changed into an array ## your integer will be changed to floats. |

The second one is standardScaler. Which is doing exactly the same but it the class that you can apply to your data. It is quite interesting if you are interested by inverted the data at some point.

1 2 3 4 5 | standardScaler = preprocessing.StandardScaler() df_numeric_standard = standardScaler.fit_transform(df_numeric) ## To inversed standardScaler.inverse_transform(df_numeric_standard) |

In any case, you will lose your header when you are transforming your data (because scikit-learn is using numpy arrays). As it is keeping the order of the columns, you could just export the columns by doing a simple :

1 | columns = list(df.columns) |

The third one is my favorite : The **minMaxScaler**.

This standardization use a minimum and a maximum to fit all of your value in between. You can also use MaxAbsScaler to actually try to fit the data between -1 and 1 so you have a data set center to 0. That helps for certain algorithm.

In our code, these methods won’t change much because we have already data scale between 0 and 1. But for the sake of the example, I’ll show you the code.

1 2 3 4 5 6 7 | ## example for MinMax minMax = preprocessing.MinMaxScaler() df_numeric_minMax = minMax.fit_transform(df_numeric) ## example for maxAbs maxAbs = preprocessing.MaxAbsScaler() df_numeric_maxabs = maxAbs.fit_transform(df_numeric) |

It is interesting to know that if you want to be quick and don’t want to create instances. You can use the functions (same than for scale):

- minmax_scale()
- maxabs_scale()

### Shuffling the data

On your journey to actually feed your algorithm with your data, there is a specific aspect of your data set that you may overlooked. Your data set is sorted.

This can be an issue because it can give a signal to your algorithm, which in real life it will never received.

A simple way to avoid this caveat is to simply shuffle your dataset before feeding your algorithm with it.

This can be simply done by executing this code :

1 2 | import sklearn df_shuffled = slearn.utils.shuffle(df) |

### Polynomial Feature

This is one is not a must and is not always applied but I think it was worth including it as it is part of the preprocessing package.

For some problems you are trying to solve, you will see that the problem cannot be solved by basic linear regression. Your problem is not linear and therefore the features that will explain it cannot stay with simple coefficients.

This is where the polynomial feature comes into practice. This method enables you to increase the degree of complexity in your data set in order to solved non linear problem with linear algorithm.

Example :

You have 2 variables (a, b), that could explain your model but when you look at your data, it is not quite possible that those 2 variables alone will fit the pattern.

What you could do to increase the complexity of your model is to raise them with polynomial feature.

For **a** and **b**, you create **1**,** a**,** b** **a**^{2}, **b**^{2}, **ab**.

For **a**, **b**, **c** you create **1**,** a**,** b**, **a2**, **b2**, **c2**, **ab**, **ac**, **cb**.

This will give you the degree of complexity that would help you fitting the best your data points.

This functionality will work with numerical data points, not with transformed categorical values. In our previous example, we have set all the value into 0 and 1. If you know how exponent and multiplication work, you imagine that 1^{2} and 1×1 will actually not give us any additional values.

Let’s create a new dataframe with numerical values this type.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | df = pd.DataFrame({'nb_population':[10,20,12,22,13,4],'men':[5,10,6,14,7,2],'women':[5,10,6,8,6,2]}) ## when creating the instance of polynomialFeatures, you need to set the number of exponent you want. ## You can always start with 2 and see if more is required. poly_feature = preprocessing.PolynomialFeatures(degree=2) df_poly = poly_feature.fit_transform(df) #result will be #array([[ 1., 10., 5., 5., 100., 50., 50., 25., 25., 25.], [ 1., 20., 10., 10., 400., 200., 200., 100., 100., 100.], [ 1., 12., 6., 6., 144., 72., 72., 36., 36., 36.], [ 1., 22., 14., 8., 484., 308., 176., 196., 112., 64.], [ 1., 13., 7., 6., 169., 91., 78., 49., 42., 36.], [ 1., 4., 2., 2., 16., 8., 8., 4., 4., 4.]]) |

I wanted to discuss more method that need to be master, and are not algorithmic related (pipeline, cross_validation,…). However, due to the length of this post at the moment, I will probably do a second post to discuss the other methods.

I hope this post was useful to get to know what need to be known before realizing your first algorithm. Have fun coding !