The 3 steps to realize a proper data analysis
Language to choose for Data Analyst Language to choose for Data Analyst
When we are talking about Data Analyst we can use very different languages in order to realize the tasks that are going to be necessary for the analysis. In general, three different steps come in every process of data analyst. Each of the steps requires different skills.
- Data retrieval
- Data Cleaning
- Data Processing
We are going to into those different steps and see what kind of languages can be use there.
1. Data retrieval
Here it will really depends what kind of Data Analyst you are, are you only going to use and process the data. In any case, I find it very useful to know how you retrieve the data that you are going to work with. Every data are limited or biased, knowing how the data are retrieved enable you to understand where the flaws may be and how to trust this data. Mostly data retrieval are about Web Services, about servers or machine. It could be from survey or automated machine/algorithm sending info (i.e.: Finance). In that way, it is mostly very primal language or knowledge to know here:
- Java: Most of the things running on server are Java coded (when big companies)
- PHP: Most of the CMS (Content Management System) are written with PHP.
- And so on… (C/C++/C# / Swift /…)
I think I cover lots of system by only those three elements but I would not put the same effort to learn each of them.
Java: This is not an easy to learn programming language; therefore, you would need to spend lots of time to learn how to code in Java and finding an application to run it is not that easy. Java is for specialist and it is very unlikely that you would do Data Analyst and web programming at the same time. I think that it is enough to scratch on the surface and see how it looks and what the logic of it is. Not going to deep there should be OK. If you want / can learn it better than the rest, this is super great. It runs to many places, server / application (Android) / etc…Knowing Java makes it easy to switch to most of the other data related language. (Scala in particular)
PHP: PHP is mostly the cheap Java of the mid-size website. I do not mean that as a critic, it is just a more easy to learn language that can run website efficiently. You do not need to compile the script before running it (even if you need a web-server application to simulate the PHP encoding). If you want to go Consulting, it is definitely a good language to go there. Many website use PHP and in this kind of company, it would be very interesting to get someone that retrieve and run the data at the same time.
Other languages are also involved in the data retrieval process but they are not that common and don’t play a big role on Data Analysis and they are getting really tougher to learn.
2. Data cleaning
The most important part of the Data Analyst is data cleaning! As for the data retrieval, you may not be part of this process; data cleaning is a mandatory step towards data analysis. Coming from Data retrieval, you have 2 possibilities, the data fetcher guy did his job pretty well and you have a nice and structure data (mostly DataBase) or the data you are receiving are not so clean and you need to go through steps in order to make it “look better”.There, I will discuss 4 types of languages that are possible for data cleaning:
- Linux (Bash)
SQL: Yes, SQL can be used to clean data, especially if you want to have only a subset of data and not the full database to play with each time you want to realize something. This is not hard-core SQL but some temporary table will really help you to run your code faster and use less capacity.
R: this language is not mostly known for its cleaning capacities but it can be used in that purpose. It got a good library that can help you for that task (tidy) and do the job. I am not expert on that subject but you can get expert level support here
Python: Python is the multi task language so no surprise that he can get to the task. In that purpose he is even better than R due to its data exploratory library (pandas) that have been consider with some functionalities for that goal. Also the Pandas library have been written in C for some parts so it is even consider fast for the common programmer.
Linux: The Bash shell is often a language that is not really considered as a real language on its own, in term of Data Analysis I mean. It rarely teach within any course I looked but it is definitely one of the most powerful language. In term of speed, easy to learn and easy to put into practice in a company. Most of the companies are using Linux based server so it is very easy to set an environment to use it if necessary.
Using those different languages should end up with data ready to use for Data Processing.
3. Data processing
I will not go long on this part as this is actually the very reason of this blog. I could summarize it as the type of language that can be used to actually processed the data :
- Python (mostly use in this website)
- R (mostly for academic purposes)
- Scala (for distributed system)
- Octave (academic also)
- SPSS / SAS / etc…
We will come back to that later on the different articles or videos about that.