Basic to Extra: Making Graphs

Anila Qureshi
5 min readApr 9, 2020

I barely knew any python when I started Flatiron. One of the first things we learn to do is make basic graphs. Luckily, this is the easiest part. With that being said, that’s exactly why they should not be overlooked.

Personally for me, with all of the information we learn I have fallen victim to not going beyond the basics for data visualisations. When I feel like my graph is not good enough I google graphs that are prettier and copy and paste the bits I like the most about them and apply them to my graphs. (This comes with a lot of trial and error and can become time consuming!)

Although this is useful, it’s better to have a good formula or checklist to go off of. This helps to save time on making pretty graphs. It also helps to create a uniformity to your work. These graphs are your mark and what people will remember and refer to when they are looking at your work. We as data scientists use data to help make an informed decision (next steps) or test out hypothesis. There are some really good blog posts and dictionaries I’ve used as references but I will suggest some key takeaways in order to have a strong framework to go off of.

The one thing I’ve learned so far, is that visualisations are vital to bring everything together. It not only shows everything you’ve done technically, but it helps to distill what you’ve done down in a non-technical manner.

Data Operationalisation:

Matplotlib V Seaborn

Matplotlib as we all know is a plotting library for python. It is a lot easier to customise in terms of colours, titles, etc. Matplotlib visuals require a lot of lines of code.

Seaborn is an extension with better themes. It’s very easy to implement a very simple graph with just one line. It is a bit more foggy when it comes to customising it. I’ve personally tried and failed a bunch. This leads me to stick to matplotlib.

Here’s a walk through on a distribution plot visual with matploblib followed by a walk through on seaborn for the same type of graph.

import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

np.random.seed(500) #I'd like to generate some random data for this graph to work.

mu=45 #mean of distribution
sigma= 10#standard deviation
x = mu + sigma * np.random.randn(300)
num_bins = 50
fig, ax = plt.subplots(figsize=[10,10])
#histogram of the data, here it's important to list and know what you're plotting, personally I've gotten a bit confused here. This has more to do with leaving things out.
n, bins, patches = ax.hist(x, num_bins, density=1)
#additional feature of best fit line to show the relationship
y = ((1/(np.sqrt(2*np.pi)*sigma))*np.exp(-.5*(1/sigma*(bins-mu))**2))
ax.plot(bins, y, '--')
ax.set_xlabel('Salary')
ax.set_ylabel('Probability Density')
ax.set_title(' Data Science Salaries')
plt.show()

Here we can assume the income is represented in the thousands, and the density is the amount of the sample that fall within certain income brackets.

Here you can see we do not have a legend, but do hold the basics of titles, labels and colour differentiation.

Now for the same in seaborn.

sns.distplot(x);

Here we see just a distribution plot. Unlike the above the colours, and sleek theme make it more appealing.

Let’s now go from basic to extra!

Let’s add labels, a title, change the size and learn about getting a legend.

plt.figure(figsize=(10,10))
sns.set_context('poster', font_scale=.9)
sns.set_style('darkgrid')
sns.distplot(x);
plt.xlabel('Income', fontsize=12)
plt.ylabel('Probability Density', fontsize=12)
plt.title('Income Distribution in Data Science')

I haven’t quite figured out how to change the numbers here. Will edit this after project.

Analysis

Matplotlib, although it seems very simple, I personally find it to be a lot of trial and error. Also making an impeccable visualisation requires a lot of lines of code.

Seaborn is beautiful and is probably the most widely used in terms of displaying professional graphs, but it gets really confusing when it comes to customisation. Also for a beginner, it’s important to realise that you cannot use seaborn on its own. It has been built on top of matplotlib as a complement, not an alternative.

I have also found that customisation is different even though you need to use a bit of matplotlib in seaborn. It’s important to pay attention to this.

For this reason I think I believe there are some good takeaways for becoming better at visualisations.

  1. Always start with the right imports
  2. In the amount of data we see, be sure to properly identify what you’re going to be representing.
  3. Keep a good framework! Memorise it and implement it on every graph you do. It’s good to set your figure size at the beginning. Followed by a good style and context. Most use poster view because it works best for presentations. You MUST have indicators, label things appropriately, and make sure the scale is correct on your graph.
  4. Keep trying to integrate new graphs into your work.
  5. The one thing I still struggle the most with is colour, regardless of type of library I’m using. The best place to go for seaborn colour is here.
  6. Evolution of tech, how to keep on top of features. We can’t possibly know all of the newest trends. A good place to start is the news. The FT and other valid news sources that depend on compiling new summaries of data will always give a good up-to-date version of visualisations. Chances are these will come with new features. These sources help to set the standard for visualisations, so always pay attention to them.

References

--

--