Guided Python Exercise
This exercise is inspired by part of a bootcamp assessment. I thought it might make for an interesting task for Python beginners by adding more guidance. By adding some suggestions for advanced Python techniques to apply to the completed task I hope it also can be educational to intermediate Python users.
If you would like to do some basic exercises first I would recommend the old Google Edu Python Class.
It explains the basics of python well and also comes with lecture videos.
Sadly they haven't updated it for Python 3, but that should only mean that print 'hello'
now has to be print('hello')
.
If you download their exercises (which I would recommend) you can simply run 2to3 ./
inside the exercise folder to convert the code to python 3.
Task outline
Fit a line through a given set of datapoints.
The following equation can be used to draw a line on a two-dimensional plot.
a
controls the line's slop and b
the place where it intersects the y-axis.
y = a * x + b
The goal is to adjust a
and b
so that the line approximates a set of points.
The quality of the fit will be determined using the mean squared error.
Exercise Structure
The exercise is broken down as follows:
- importing the csv-file (Task 1)
- plotting x,y data as a scatterplot (Task 2)
- function for the line (Task 3)
- function for mean squared error (Task 4)
- brute force optimize
a
andb
(Task 5) - apply advanced techniques from list
Each task comes with a set of suggested subtasks. Feel free to read ahead and deviate, the subtasks are one way of approaching the task.
The suggested syntax sections are both collections of hints and best-practice educational content. Feel free to copy and run the code to get started.
Use of Python Libraries
While the original task allows the use of numpy
and/or pandas
I would strongly recommend only using pure Python.
The goal is to hone your Python skills, not to solve the problem as fast as possible by using data-science tools designed exactly for this purpose.
The only external library you need is pygal
for plotting things.
So the only import
you need is pygal
.
I recommend completing all tasks without adding any additional import
statements.
Task 1 – Load Data
Get the
x
andy
values from the file into a python data structure.
Download the csv file here
Example data structures you could use:
- Two lists:
[x1, x2, x3, ...], [y1, y2, y3, ...]
- List of tuples:
[(x1, y1), (x2, y2), ...]
- List of dictionaries:
[{'x': x1, 'y': y1}, {'x': x2, 'y': y2}, ...]
Which data structure is up to you. It is relatively easy to convert from one to the other and writing conversion functions is also a useful skill to practice.
Suggested Subtasks
- print every line in the csv
- print every line except the one containing
x,y
- save the x and y values into a data structure of your choosing
- make sure that the x and y are floats and not strings
Relevant syntax
Reading Files
The open
function returns a filehandle
pointing at a file.
This can be used as an iterator
, just like you would e.g. use range(42)
.
When iterating over it, you get one line at a time as a single string
.
with open('./example_data.csv') as filehandle_to_our_csv:
for line in filehandle_to_our_csv:
print(line)
filehandles
should be closed when you are done with them.
This can be done by calling their close()
function.
Here we are instead using a with ... as ... :
, which automatically closes it once we leave the with
block.
Other Syntax you might need
.split()
to split the lines of csv.strip()
to remove linebreak characters (\n
)float()
to turn strings into float numberscontinue
to skip the first line of the csv containing headers.startswith
to check what a line i.e. string starts with
Task 2 – Scatterplot
Plot the x and y values as a scatterplot. You should end up with a PNG, SVG or similar file containing the scatterplot.
Ideally/optionally the plot should have:
- a title of your choosing
- a legend indicating with color/shape of dots is which set of data
(for now there is only one set of data)
Suggested Subtasks
- plot the example bar chart
- modify the code so that it is a scatterplot
- modify the code to plot the data from task one instead
Relevant Syntax
This is an example of using the Pygal plotting library to create a bar chart and save it as a svg-file.
import pygal
my_bar_chart = pygal.Bar()
my_bar_chart.title = 'Browser usage evolution (in %)'
my_bar_chart.x_labels = map(str, range(2002, 2013))
my_bar_chart.add('Firefox', [None, None, 0, 16.6, 25, 31, 36.4, 45.5, 46.3, 42.8, 37.1])
my_bar_chart.add('Chrome', [None, None, None, None, None, None, 0, 3.9, 10.8, 23.8, 35.3])
my_bar_chart.add('IE', [85.8, 84.6, 84.7, 74.5, 66, 58.6, 54.7, 44.8, 36.2, 26.6, 20.1])
my_bar_chart.add('Others', [14.2, 15.4, 15.3, 8.9, 9, 10.4, 8.9, 5.8, 6.7, 6.8, 7.5])
my_bar_chart.render_to_file('./example_bar_chart.svg')
This code instantiates i.e. creates a new Bar
object called my_bar_chart
.
This object has data fields that you can assign stuff to e.g. title
and x_labels
.
It also has functions for adding datasets and rendering the object as an svg.
Note how the map
function of python is used to elegantly used to turn a range of numbers into a list of strings.
To learn about more about this have look at this tutorial.
Take a look at the other chart types in the pygal documentation to figure out how to turn this into a scatterplot without any lines connecting the dots.
Task 3 – A Line Function
Generate and draw a second dataset that follows a line with a slope of 10 and intersect of 0.
The dataset should use the same x values as the dataset from the csv but have y-values so that: y = x * 10 + b
where b = 0
.
The goal is to generate the y-values by passing a list of x-values to a function that you will write yourself.
Suggested Subtasks
Manually plot a line:
- manually add a dataset to the plot with two points, one for
x=0
and one forx=20
(calculate y by hand) - tell the plot to draw a line connecting the points
(you can add astroke=True
to individual.add()
function of your pygal chart object)
Create a line function:
- write a python function that takes a single
x
value, calculatesy
and returns it
(it can help to do this in a separate file before integrating it with your existing code) - modify the function to take a
list
ofx
values and return alist
ofy
values
Use the x values from the csv:
- extract the
x
values from the csv into a list by e.g.:- copy pasting and modifying your existing csv extraction code
- iterating over the list of tuples you are putting into the pygal chart to create a new list
- generate a list of
y
values from thex
values using the function you wrote - combine the two x and y list into a new list of tuples with the
zip()
function - plot the dataset together with the dataset from the csv
Relevant Syntax
The zip()
function takes several iterators and returns a list of tuples.
spam = ['spam', 'spamspam', 'spamspamspam', 'spamspamspamspam']
eggs = ['eggs', 'eggseggs', 'eggseggseggs', 'eggseggseggseggs']
ands = ['and', 'and', 'and', 'and']
spam_eggs = zip(spam, eggs)
spam_and_eggs = zip(spam, ands, eggs)
print(spam_eggs)
print(list(spam_eggs))
print(list(spam_and_eggs))
print("This loop won't print anything because spam_eggs was already consumed by list()")
for single_tuple in spam_eggs:
print(single_tuple)
print("See? Nothing to see here?")
for single_tuple in zip(spam, ands, eggs):
print(single_tuple)
for s, a, e in zip(spam, ands, eggs):
print(e)
# Using Python3's fancy new f-strings to insert variables into a string
print(f"{s} {a} {e}")
Note how the first print only prints <zip object at 0x7f22561bec00>
when you use python3.
This is because zip()
returns an iterator which has to be consumed for example by passing it into list()
or using it as the iterator in a for loop.
Task 4 – Mean Squared Error
Write a function to calculate the mean squared error.
The mean squared error works by:
- subtracting two values for the same x-value from each other
- squaring the differences so that all values are positive
(this also punishes large differences more) - summing all of this and dividing by the number of point pairs you are comparing
This should be implemented as a function that takes a lists of numbers and returns the MSE (a single number).
Suggested Subtasks
- write a
mse
function that takes two floats and returns their difference i.e. subtracts them - extend
mse
to take two lists of floats and returns a list of their differences - have
mse
return the sum of all differences instead of a list of them - square the diffs before summing them
- divide the sum by the number of point pairs
Relevant Syntax
**
lets you take number to the power of e.g.5**2
will return 25len()
counts how many items are in a list or characters are in a string
Task 5 – Optimize the Slope
Find the
a
value that results in the lowers MSE withb = 0
.
Generate y-values for 100 different slope values and determine which one has the lowest MSE when compared with the data from the csv.
Keep b = 0
and try a
values between 0 and 10 in 0.1 increments.
Plot the data that has the lowest MSE.
Suggested Subtasks
Make the line function variable:
- extend the line generating function so that it takes a list and a float as input
- use the float passed to the function to set the slope i.e. the
a
variable
Calculate lots of MSE:
- for-loop that prints the number 0.1 through 0.5 in 0.1 steps
- print y-values using these numbers as slopes
- calculate and print the MSE for each set of y-values
Determine the best MSE:
- create a
best_mse
variable and set it to 1000 - in the for loop, update
best_mse
if the current MSE is better i.e. lower
Plot the best fit:
- generate 100 slope values and print the resulting MSE
- print only the best slope value
- draw the line of the best slope value in scatterplot
Task 6 – Optimize Intersection
Find the best combination of
a
andb
.
In addition to a
now also try values for b
between 0 and 10 in 0.1 increments.
Suggested Subtasks
- nested for-loop that iterates through all slope-intersect combinations
- print the best slope-intersect combination
- plot the best slope-intersect combination
Things to Read About, Learn and Try
- identify areas of your code that could be isolated into functions
- add a main
- sum() could be useful in your
mse
function - map to generate y-values from x-values without using for-loops
- csv.DictReader to ingest the data
- list/dict comprehensions to replace for-loops
- test your functions with pytest
- replace the nested for loops in the optimization by using
itertools.combinations()
(docs, tutorial)