Guided Python Exercise

This exercise is inspired by part of a bootcamp assessment. I thought it might make for an interesting task for Python beginners by adding more guidance. By adding some suggestions for advanced Python techniques to apply to the completed task I hope it also can be educational to intermediate Python users.

If you would like to do some basic exercises first I would recommend the old Google Edu Python Class. It explains the basics of python well and also comes with lecture videos. Sadly they haven't updated it for Python 3, but that should only mean that print 'hello' now has to be print('hello'). If you download their exercises (which I would recommend) you can simply run 2to3 ./ inside the exercise folder to convert the code to python 3.

Task outline

Fit a line through a given set of datapoints.

The following equation can be used to draw a line on a two-dimensional plot. a controls the line's slop and b the place where it intersects the y-axis.

y = a * x + b

The goal is to adjust a and b so that the line approximates a set of points. The quality of the fit will be determined using the mean squared error.

Exercise Structure

The exercise is broken down as follows:

importing the csv-file (Task 1)
plotting x,y data as a scatterplot (Task 2)
function for the line (Task 3)
function for mean squared error (Task 4)
brute force optimize a and b (Task 5)
apply advanced techniques from list

Each task comes with a set of suggested subtasks. Feel free to read ahead and deviate, the subtasks are one way of approaching the task.

The suggested syntax sections are both collections of hints and best-practice educational content. Feel free to copy and run the code to get started.

Use of Python Libraries

While the original task allows the use of numpy and/or pandas I would strongly recommend only using pure Python. The goal is to hone your Python skills, not to solve the problem as fast as possible by using data-science tools designed exactly for this purpose.

The only external library you need is pygal for plotting things. So the only import you need is pygal. I recommend completing all tasks without adding any additional import statements.

Task 1 – Load Data

Get the x and y values from the file into a python data structure.

Download the csv file here

Example data structures you could use:

Two lists: [x1, x2, x3, ...], [y1, y2, y3, ...]
List of tuples: [(x1, y1), (x2, y2), ...]
List of dictionaries: [{'x': x1, 'y': y1}, {'x': x2, 'y': y2}, ...]

Which data structure is up to you. It is relatively easy to convert from one to the other and writing conversion functions is also a useful skill to practice.

Suggested Subtasks

print every line in the csv
print every line except the one containing x,y
save the x and y values into a data structure of your choosing
make sure that the x and y are floats and not strings

Relevant syntax

Reading Files

The open function returns a filehandle pointing at a file. This can be used as an iterator, just like you would e.g. use range(42). When iterating over it, you get one line at a time as a single string.

with open('./example_data.csv') as filehandle_to_our_csv:
    for line in filehandle_to_our_csv:
        print(line)

filehandles should be closed when you are done with them. This can be done by calling their close() function. Here we are instead using a with ... as ... :, which automatically closes it once we leave the with block.

Other Syntax you might need

.split() to split the lines of csv
.strip() to remove linebreak characters (\n)
float() to turn strings into float numbers
continue to skip the first line of the csv containing headers
.startswith to check what a line i.e. string starts with

Task 2 – Scatterplot

Plot the x and y values as a scatterplot. You should end up with a PNG, SVG or similar file containing the scatterplot.

Ideally/optionally the plot should have:

a title of your choosing
a legend indicating with color/shape of dots is which set of data
(for now there is only one set of data)

Suggested Subtasks

plot the example bar chart
modify the code so that it is a scatterplot
modify the code to plot the data from task one instead

Relevant Syntax

This is an example of using the Pygal plotting library to create a bar chart and save it as a svg-file.

import pygal
my_bar_chart = pygal.Bar()
my_bar_chart.title = 'Browser usage evolution (in %)'
my_bar_chart.x_labels = map(str, range(2002, 2013))
my_bar_chart.add('Firefox', [None, None, 0, 16.6,   25,   31, 36.4, 45.5, 46.3, 42.8, 37.1])
my_bar_chart.add('Chrome',  [None, None, None, None, None, None,    0,  3.9, 10.8, 23.8, 35.3])
my_bar_chart.add('IE',      [85.8, 84.6, 84.7, 74.5,   66, 58.6, 54.7, 44.8, 36.2, 26.6, 20.1])
my_bar_chart.add('Others',  [14.2, 15.4, 15.3,  8.9,    9, 10.4,  8.9,  5.8,  6.7,  6.8,  7.5])
my_bar_chart.render_to_file('./example_bar_chart.svg')

This code instantiates i.e. creates a new Bar object called my_bar_chart. This object has data fields that you can assign stuff to e.g. title and x_labels. It also has functions for adding datasets and rendering the object as an svg.

Note how the map function of python is used to elegantly used to turn a range of numbers into a list of strings. To learn about more about this have look at this tutorial.

Take a look at the other chart types in the pygal documentation to figure out how to turn this into a scatterplot without any lines connecting the dots.

Task 3 – A Line Function

Generate and draw a second dataset that follows a line with a slope of 10 and intersect of 0.

The dataset should use the same x values as the dataset from the csv but have y-values so that: y = x * 10 + b where b = 0. The goal is to generate the y-values by passing a list of x-values to a function that you will write yourself.

Suggested Subtasks

Manually plot a line:

manually add a dataset to the plot with two points, one for x=0 and one for x=20
(calculate y by hand)
tell the plot to draw a line connecting the points
(you can add a stroke=True to individual .add() function of your pygal chart object)

Create a line function:

write a python function that takes a single x value, calculates y and returns it
(it can help to do this in a separate file before integrating it with your existing code)
modify the function to take a list of x values and return a list of y values

Use the x values from the csv:

extract the x values from the csv into a list by e.g.:
- copy pasting and modifying your existing csv extraction code
- iterating over the list of tuples you are putting into the pygal chart to create a new list
generate a list of y values from the x values using the function you wrote
combine the two x and y list into a new list of tuples with the zip() function
plot the dataset together with the dataset from the csv

Relevant Syntax

The zip() function takes several iterators and returns a list of tuples.

spam = ['spam', 'spamspam', 'spamspamspam', 'spamspamspamspam']
eggs = ['eggs', 'eggseggs', 'eggseggseggs', 'eggseggseggseggs']
ands = ['and', 'and', 'and', 'and']

spam_eggs = zip(spam, eggs)
spam_and_eggs = zip(spam, ands, eggs)
print(spam_eggs)
print(list(spam_eggs))
print(list(spam_and_eggs))

print("This loop won't print anything because spam_eggs was already consumed by list()")
for single_tuple in spam_eggs:
    print(single_tuple)
print("See? Nothing to see here?")

for single_tuple in zip(spam, ands, eggs):
    print(single_tuple)

for s, a, e in zip(spam, ands, eggs):
    print(e)
    # Using Python3's fancy new f-strings to insert variables into a string
    print(f"{s} {a} {e}")

Note how the first print only prints <zip object at 0x7f22561bec00> when you use python3. This is because zip() returns an iterator which has to be consumed for example by passing it into list() or using it as the iterator in a for loop.

Task 4 – Mean Squared Error

Write a function to calculate the mean squared error.

The mean squared error works by:

subtracting two values for the same x-value from each other
squaring the differences so that all values are positive
(this also punishes large differences more)
summing all of this and dividing by the number of point pairs you are comparing

This should be implemented as a function that takes a lists of numbers and returns the MSE (a single number).

Suggested Subtasks

write a mse function that takes two floats and returns their difference i.e. subtracts them
extend mse to take two lists of floats and returns a list of their differences
have mse return the sum of all differences instead of a list of them
square the diffs before summing them
divide the sum by the number of point pairs

Relevant Syntax

** lets you take number to the power of e.g. 5**2 will return 25
len() counts how many items are in a list or characters are in a string

Task 5 – Optimize the Slope

Find the a value that results in the lowers MSE with b = 0.

Generate y-values for 100 different slope values and determine which one has the lowest MSE when compared with the data from the csv. Keep b = 0 and try a values between 0 and 10 in 0.1 increments. Plot the data that has the lowest MSE.

Suggested Subtasks

Make the line function variable:

extend the line generating function so that it takes a list and a float as input
use the float passed to the function to set the slope i.e. the a variable

Calculate lots of MSE:

for-loop that prints the number 0.1 through 0.5 in 0.1 steps
print y-values using these numbers as slopes
calculate and print the MSE for each set of y-values

Determine the best MSE:

create a best_mse variable and set it to 1000
in the for loop, update best_mse if the current MSE is better i.e. lower

Plot the best fit:

generate 100 slope values and print the resulting MSE
print only the best slope value
draw the line of the best slope value in scatterplot

Task 6 – Optimize Intersection

Find the best combination of a and b.

In addition to a now also try values for b between 0 and 10 in 0.1 increments.

Suggested Subtasks

nested for-loop that iterates through all slope-intersect combinations
print the best slope-intersect combination
plot the best slope-intersect combination

Things to Read About, Learn and Try

identify areas of your code that could be isolated into functions
add a main
sum() could be useful in your mse function
map to generate y-values from x-values without using for-loops
csv.DictReader to ingest the data
list/dict comprehensions to replace for-loops
test your functions with pytest
replace the nested for loops in the optimization by using itertools.combinations() (docs, tutorial)