Guided Python Exercise

This exercise is inspired by part of a bootcamp assessment. I thought it might make for an interesting task for Python beginners by adding more guidance. By adding some suggestions for advanced Python techniques to apply to the completed task I hope it also can be educational to intermediate Python users.

If you would like to do some basic exercises first I would recommend the old Google Edu Python Class. It explains the basics of python well and also comes with lecture videos. Sadly they haven't updated it for Python 3, but that should only mean that print 'hello' now has to be print('hello'). If you download their exercises (which I would recommend) you can simply run 2to3 ./ inside the exercise folder to convert the code to python 3.

Task outline

Fit a line through a given set of datapoints.

The following equation can be used to draw a line on a two-dimensional plot. a controls the line's slop and b the place where it intersects the y-axis.

y = a * x + b

The goal is to adjust a and b so that the line approximates a set of points. The quality of the fit will be determined using the mean squared error.

Exercise Structure

The exercise is broken down as follows:

Each task comes with a set of suggested subtasks. Feel free to read ahead and deviate, the subtasks are one way of approaching the task.

The suggested syntax sections are both collections of hints and best-practice educational content. Feel free to copy and run the code to get started.

Use of Python Libraries

While the original task allows the use of numpy and/or pandas I would strongly recommend only using pure Python. The goal is to hone your Python skills, not to solve the problem as fast as possible by using data-science tools designed exactly for this purpose.

The only external library you need is pygal for plotting things. So the only import you need is pygal. I recommend completing all tasks without adding any additional import statements.

Task 1 – Load Data

Get the x and y values from the file into a python data structure.

Download the csv file here

Example data structures you could use:

Which data structure is up to you. It is relatively easy to convert from one to the other and writing conversion functions is also a useful skill to practice.

Suggested Subtasks

Relevant syntax

Reading Files

The open function returns a filehandle pointing at a file. This can be used as an iterator, just like you would e.g. use range(42). When iterating over it, you get one line at a time as a single string.

with open('./example_data.csv') as filehandle_to_our_csv:
    for line in filehandle_to_our_csv:
        print(line)

filehandles should be closed when you are done with them. This can be done by calling their close() function. Here we are instead using a with ... as ... :, which automatically closes it once we leave the with block.

Other Syntax you might need

Task 2 – Scatterplot

Plot the x and y values as a scatterplot. You should end up with a PNG, SVG or similar file containing the scatterplot.

Ideally/optionally the plot should have:

Suggested Subtasks

Relevant Syntax

This is an example of using the Pygal plotting library to create a bar chart and save it as a svg-file.

import pygal
my_bar_chart = pygal.Bar()
my_bar_chart.title = 'Browser usage evolution (in %)'
my_bar_chart.x_labels = map(str, range(2002, 2013))
my_bar_chart.add('Firefox', [None, None, 0, 16.6,   25,   31, 36.4, 45.5, 46.3, 42.8, 37.1])
my_bar_chart.add('Chrome',  [None, None, None, None, None, None,    0,  3.9, 10.8, 23.8, 35.3])
my_bar_chart.add('IE',      [85.8, 84.6, 84.7, 74.5,   66, 58.6, 54.7, 44.8, 36.2, 26.6, 20.1])
my_bar_chart.add('Others',  [14.2, 15.4, 15.3,  8.9,    9, 10.4,  8.9,  5.8,  6.7,  6.8,  7.5])
my_bar_chart.render_to_file('./example_bar_chart.svg')

This code instantiates i.e. creates a new Bar object called my_bar_chart. This object has data fields that you can assign stuff to e.g. title and x_labels. It also has functions for adding datasets and rendering the object as an svg.

Note how the map function of python is used to elegantly used to turn a range of numbers into a list of strings. To learn about more about this have look at this tutorial.

Take a look at the other chart types in the pygal documentation to figure out how to turn this into a scatterplot without any lines connecting the dots.

Task 3 – A Line Function

Generate and draw a second dataset that follows a line with a slope of 10 and intersect of 0.

The dataset should use the same x values as the dataset from the csv but have y-values so that: y = x * 10 + b where b = 0. The goal is to generate the y-values by passing a list of x-values to a function that you will write yourself.

Suggested Subtasks

Manually plot a line:

Create a line function:

Use the x values from the csv:

Relevant Syntax

The zip() function takes several iterators and returns a list of tuples.

spam = ['spam', 'spamspam', 'spamspamspam', 'spamspamspamspam']
eggs = ['eggs', 'eggseggs', 'eggseggseggs', 'eggseggseggseggs']
ands = ['and', 'and', 'and', 'and']

spam_eggs = zip(spam, eggs)
spam_and_eggs = zip(spam, ands, eggs)
print(spam_eggs)
print(list(spam_eggs))
print(list(spam_and_eggs))

print("This loop won't print anything because spam_eggs was already consumed by list()")
for single_tuple in spam_eggs:
    print(single_tuple)
print("See? Nothing to see here?")

for single_tuple in zip(spam, ands, eggs):
    print(single_tuple)

for s, a, e in zip(spam, ands, eggs):
    print(e)
    # Using Python3's fancy new f-strings to insert variables into a string
    print(f"{s} {a} {e}")

Note how the first print only prints <zip object at 0x7f22561bec00> when you use python3. This is because zip() returns an iterator which has to be consumed for example by passing it into list() or using it as the iterator in a for loop.

Task 4 – Mean Squared Error

Write a function to calculate the mean squared error.

The mean squared error works by:

This should be implemented as a function that takes a lists of numbers and returns the MSE (a single number).

Suggested Subtasks

Relevant Syntax

Task 5 – Optimize the Slope

Find the a value that results in the lowers MSE with b = 0.

Generate y-values for 100 different slope values and determine which one has the lowest MSE when compared with the data from the csv. Keep b = 0 and try a values between 0 and 10 in 0.1 increments. Plot the data that has the lowest MSE.

Suggested Subtasks

Make the line function variable:

Calculate lots of MSE:

Determine the best MSE:

Plot the best fit:

Task 6 – Optimize Intersection

Find the best combination of a and b.

In addition to a now also try values for b between 0 and 10 in 0.1 increments.

Suggested Subtasks

Things to Read About, Learn and Try