Fluent Python | Datanalyst

On this article, I will share some of the interesting things I have learned while I was reading the book “Fluent python” by Luciano Ramalho.
I will mostly focused on the first chapters.

The book was recommended within the python community as one of the few that teach you important concept without being too basic or too advanced. Python is so popular that lots of book around it is about how to start with python. It is really nice as lots of people like me, without any computer science degree, can code and get very interesting jobs.

However, when you want to go deeper into the language, it comes very quickly to very advance python discussion, discussing C / C++ and other sub-level discussion. I was looking for a book that can provide an improvement of my skills without being forced to learn C / C++.

On that note, Fluent Python is doing the job perfectly. It can be tough if you read it to early in your journey as you may want to over-engineer your code. But if you apply the different method learned within their correct context, this can give you a nice overview on how code optimization can be done in python.

Python is slow

Python has the reputation of being slow and from what I have read on the subject, it is quite true if you want to do huge programs that are doing lots of calculation and generate or use lots of I/O operations. I am talking about millions operation in a minute here. To be honest, this is not the use case of mere mortals.

If you want to develop a very performance-oriented application to deal with dozen of gigabit of data, you probably need something else. However, for most of the use cases, it is possible to realize the operation in python. Even if you have to deal with gigabyte of data. It mostly depends on your laptop setup (and the RAM you possess).

As Google states : “Use Python when you can, use C/C++ when you need”.

Also, Python is built on top of C (for the CPython – the most common one), and C is known for being pretty fast. Obviously Python will never beat C, but it has strong foundation. Every version of python tries to get faster by implementing method as close as possible to the machine code (C type of code).

This book gives opportunity to optimize your code as some of the method shown are almost directly implemented in C in the python code.
Note : If you wish to do fast computation with python, you should start to learn numpy. Most of its methods for computing arrays are calling C structure.

List are not Arrays

I am also a big user of JavaScript as I need it to develop some others things on my work. When I try to explain python to my colleague, I explain them the context of List. They are one of the main data type structure available in python, so of course you have to explain them.

However, List are not Array. It is common to do the mistake for this interpretation but they are different. Lists can contain any type of data structure within themselves.

myList = ["a",1,2,3.4,[5,6]]

The list I wrote above contains different type of element :

string
integer
floats
list

Yes, you can have list of list ! Therefore each element actually have a pointer to them. Their “id” in the python code are all different and this takes bigger amount of time to actually generate and bigger amount of memory.

I didn’t know before, but the book introduce me with the array that are actually the real deal. They are array of elements that contains only one data type (being numbers)
If you wish to store numbers in a list-type object. Instead of a list, use an array, it is smaller and faster to compute.

import array
myArray = array.array('b',[-2,-1,0,1,2])

You can have more details about this here : https://docs.python.org/3/library/array.html

It has several nice methods :

tofile()
tolist()
tostring()

There are more than 1 type of list

Reading the book made me understand the underlying process that get with each of the basic object type.
I also discovered some hidden gems that are available for the common use in the native library.

If you want to create a FIFO list (First In, First Out), you can use deque, as double ended list.
This is working like a list but it is faster to pop the beginning of the list (list[0]), or to append anything at the beginning.

from collections import deque
myFifo = deque(['a','b','c'])

myFifo.appendleft('0')
##will give ['0','a','b','c']
myFifo.popleft()
## will return '0' and remove it from the list.

the good thing is that it can only be limited to a certain number of items.

dq = deque(range(10),maxlen=10) # will be 0 to 9
dq.append(10) ## will be from 1 to 10

This can really give you out-of-the box functionalities that are really handy for some projects.

To stay around performance optimization on list, you may want to look at the bisect module. The module is really good and fast to insert element, or find its position to insert, in a sorted list.
When you are dealing with very large list, it could be very interesting so you gain some (milli)seconds on this.

import bisect
myList = [0,2,5,3,4,6,7,8,7,8,9]
mySortedList = sorted(myList)
##[0, 2, 3, 4, 5, 6, 7, 7, 8, 8, 9]

bisect.insort_right(mySortedList,1)
## my new list [0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 8, 9]

There are 2 different methods that you can call. insort_right and insort_left. This really matters if you already have an element that match the one you want to insert. If it doesn’t exist, that doesn’t matter as I proved it on the my example.
You can check the position that is going to be used by using bisect_left/right method.

bisect.bisect_left(mySortedList,0) ## returns 0
bisect.bisect_right(mySortedList,0)## returns 1
## in both cases, it doesn't insert it

Dictionaries, Sets and Tuples

Thanks of this book, I understood how the dictionaries are being made with hashing table, but it also enable me to learn how each of their performance.

Dictionary is really fast to deal with fetching and setting data but they have some caveats that you may not know :

They implement an overhead that takes memories
They are not specifically ordered

There are possible other forms of dictionaries in order to deal with the sorting and you may want to check it out if your dictionary order matters.

from collections import OrderedDict ## this dictionnary will keep the elements ordered

Another useful dictionary is the counter, that helps you create a dict, with key to each element and the value is the number of time it is repeated. Works with any iterable.

from collections import Counter
countMe = Counter([1,2,3,2,3,4,1,2,3,4,5,3,5,6,3,2,3])
##returns Counter({1: 2, 2: 4, 3: 6, 4: 2, 5: 2, 6: 1})

countMe.most_common() ##return the dict in order of most common
##returns [(3, 6), (2, 4), (1, 2), (4, 2), (5, 2), (6, 1)]

Another thing with dictionary is that it is quite annoying to set it up and then add value. Especially when you create dictionary of lists, for pandas purposes per example.
Also, quite annoying to always check if the value is present or not in the dict with a try, and if not present, get an except to handle the case.

##this kind of code :
myDict = {}
myDict['key'] = []
##... do something
myDict['key'].append(value)

## 2nd example
try : 
    valueIwant = myDict['lookUpValue']
except: 
    valueIwant = 'NAN'

Thanks to the book, I can now better handle those cases :

## new way
my_dict = {}
my_dict.setdefault('key',[]).append('new_value') ## if not present, set an empty list, and then append the new value

## new way 2 
my_dict.get('lookUpValue','NAN') ## get 'NAN' if the key doesn't exist.

Tuple

For the tuple, I barely used them before because I found their immutability quite annoying. Immutable means that you cannot modify them.
Thanks to the book, I know now that they are quite powerful in term of performance and memory efficiency.
They are quite useful when you want to keep record of things around.

The problem I always see with tuples are also that they don’t have labels with their data. So it is quite annoying to work with them. You have to remember each item position and their naming.

But that was what I thought ! Meet the namedtuple from the collections module. This module will help you create easy to access tuple records with labels on it.

from collections import namedtuple
Test = namedtuple('Something',['myFirst','mySecond','myThird']) ##define my labels
test1 = Test('val1','val2','val3')## set the value of each label
### test1 is now : Something(myFirst='val1', mySecond='val2', myThird='val3')
## you can access the values by label
test1.myFirst
##returns 'val1'

This makes keeping records and accessing the values of records pretty easy, isn’t it ?

Sets

The last but not least are set.
If you are familiar with python, you probably did know about set before.
A set is taking any list or tuple (or iterables) and keep only the unique values.

That is pretty powerful feature if you want to keep only unique values.
But what is even more impressive and you probably didn’t know is that it is possible to compare 2 sets in order to find their union, or their intersection.
And this method is pretty damn fast.

If you want to find the number of unique values that are the same in 2 lists of values. Sets is the tool for you.

I hope this overview gave you some nice tips on python and on the book itself.
I would highly recommend the reading of this book to anyone who has some experience with python programming but want to get serious and optimize their codes.

It is not to be followed blindly, as stated by the book, optimization comes always with the drawback of poor readability. You need to apply those techniques only when it matters.

Python is slow

List are not Arrays

There are more than 1 type of list

Dictionaries, Sets and Tuples

Tuple

Sets

Leave a Reply Cancel reply