# Guided Python Exercise

This exercise is inspired by part of a bootcamp assessment. I thought it might make for an interesting task for Python beginners by adding more guidance. By adding some suggestions for advanced Python techniques to apply to the completed task I hope it also can be educational to intermediate Python users.

If you would like to do some basic exercises first I would recommend the old Google Edu Python Class.
It explains the basics of python well and also comes with lecture videos.
Sadly they haven't updated it for Python 3, but that should only mean that `print 'hello'`

now has to be `print('hello')`

.
If you download their exercises (which I would recommend) you can simply run `2to3 ./`

inside the exercise folder to convert the code to python 3.

## Task outline

Fit a line through a given set of datapoints.

The following equation can be used to draw a line on a two-dimensional plot.
`a`

controls the line's slop and `b`

the place where it intersects the y-axis.

`y = a * x + b`

The goal is to adjust `a`

and `b`

so that the line approximates a set of points.
The quality of the fit will be determined using the mean squared error.

### Exercise Structure

The exercise is broken down as follows:

- importing the csv-file (Task 1)
- plotting x,y data as a scatterplot (Task 2)
- function for the line (Task 3)
- function for mean squared error (Task 4)
- brute force optimize
`a`

and`b`

(Task 5) - apply advanced techniques from list

Each task comes with a set of suggested subtasks. Feel free to read ahead and deviate, the subtasks are one way of approaching the task.

The suggested syntax sections are both collections of hints and best-practice educational content. Feel free to copy and run the code to get started.

### Use of Python Libraries

While the original task allows the use of `numpy`

and/or `pandas`

I would strongly recommend only using pure Python.
The goal is to hone your Python skills, not to solve the problem as fast as possible by using data-science tools designed exactly for this purpose.

The only external library you need is `pygal`

for plotting things.
So the only `import`

you need is `pygal`

.
I recommend completing all tasks without adding any additional `import`

statements.

## Task 1 – Load Data

Get the

`x`

and`y`

values from the file into a python data structure.

Download the csv file here

Example data structures you could use:

- Two lists:
`[x1, x2, x3, ...], [y1, y2, y3, ...]`

- List of tuples:
`[(x1, y1), (x2, y2), ...]`

- List of dictionaries:
`[{'x': x1, 'y': y1}, {'x': x2, 'y': y2}, ...]`

Which data structure is up to you. It is relatively easy to convert from one to the other and writing conversion functions is also a useful skill to practice.

### Suggested Subtasks

- print every line in the csv
- print every line except the one containing
`x,y`

- save the x and y values into a data structure of your choosing
- make sure that the x and y are floats and not strings

### Relevant syntax

#### Reading Files

The `open`

function returns a `filehandle`

pointing at a file.
This can be used as an `iterator`

, just like you would e.g. use `range(42)`

.
When iterating over it, you get one line at a time as a single `string`

.

```
with open('./example_data.csv') as filehandle_to_our_csv:
for line in filehandle_to_our_csv:
print(line)
```

`filehandles`

should be closed when you are done with them.
This can be done by calling their `close()`

function.
Here we are instead using a `with ... as ... :`

, which automatically closes it once we leave the `with`

block.

#### Other Syntax you might need

`.split()`

to split the lines of csv`.strip()`

to remove linebreak characters (`\n`

)`float()`

to turn strings into float numbers`continue`

to skip the first line of the csv containing headers`.startswith`

to check what a line i.e. string starts with

## Task 2 – Scatterplot

Plot the x and y values as a scatterplot. You should end up with a PNG, SVG or similar file containing the scatterplot.

Ideally/optionally the plot should have:

- a title of your choosing
- a legend indicating with color/shape of dots is which set of data

(for now there is only one set of data)

### Suggested Subtasks

- plot the example bar chart
- modify the code so that it is a scatterplot
- modify the code to plot the data from task one instead

### Relevant Syntax

This is an example of using the Pygal plotting library to create a bar chart and save it as a svg-file.

```
import pygal
my_bar_chart = pygal.Bar()
my_bar_chart.title = 'Browser usage evolution (in %)'
my_bar_chart.x_labels = map(str, range(2002, 2013))
my_bar_chart.add('Firefox', [None, None, 0, 16.6, 25, 31, 36.4, 45.5, 46.3, 42.8, 37.1])
my_bar_chart.add('Chrome', [None, None, None, None, None, None, 0, 3.9, 10.8, 23.8, 35.3])
my_bar_chart.add('IE', [85.8, 84.6, 84.7, 74.5, 66, 58.6, 54.7, 44.8, 36.2, 26.6, 20.1])
my_bar_chart.add('Others', [14.2, 15.4, 15.3, 8.9, 9, 10.4, 8.9, 5.8, 6.7, 6.8, 7.5])
my_bar_chart.render_to_file('./example_bar_chart.svg')
```

This code instantiates i.e. creates a new `Bar`

object called `my_bar_chart`

.
This object has data fields that you can assign stuff to e.g. `title`

and `x_labels`

.
It also has functions for adding datasets and rendering the object as an svg.

Note how the `map`

function of python is used to elegantly used to turn a range of numbers into a list of strings.
To learn about more about this have look at this tutorial.

Take a look at the other chart types in the pygal documentation to figure out how to turn this into a scatterplot without any lines connecting the dots.

## Task 3 – A Line Function

Generate and draw a second dataset that follows a line with a slope of 10 and intersect of 0.

The dataset should use the same x values as the dataset from the csv but have y-values so that: `y = x * 10 + b`

where `b = 0`

.
The goal is to generate the y-values by passing a list of x-values to a function that you will write yourself.

### Suggested Subtasks

Manually plot a line:

- manually add a dataset to the plot with two points, one for
`x=0`

and one for`x=20`

(calculate y by hand) - tell the plot to draw a line connecting the points

(you can add a`stroke=True`

to individual`.add()`

function of your pygal chart object)

Create a line function:

- write a python function that takes a single
`x`

value, calculates`y`

and returns it

(it can help to do this in a separate file before integrating it with your existing code) - modify the function to take a
`list`

of`x`

values and return a`list`

of`y`

values

Use the x values from the csv:

- extract the
`x`

values from the csv into a list by e.g.:- copy pasting and modifying your existing csv extraction code
- iterating over the list of tuples you are putting into the pygal chart to create a new list

- generate a list of
`y`

values from the`x`

values using the function you wrote - combine the two x and y list into a new list of tuples with the
`zip()`

function - plot the dataset together with the dataset from the csv

### Relevant Syntax

The `zip()`

function takes several iterators and returns a list of tuples.

```
spam = ['spam', 'spamspam', 'spamspamspam', 'spamspamspamspam']
eggs = ['eggs', 'eggseggs', 'eggseggseggs', 'eggseggseggseggs']
ands = ['and', 'and', 'and', 'and']
spam_eggs = zip(spam, eggs)
spam_and_eggs = zip(spam, ands, eggs)
print(spam_eggs)
print(list(spam_eggs))
print(list(spam_and_eggs))
print("This loop won't print anything because spam_eggs was already consumed by list()")
for single_tuple in spam_eggs:
print(single_tuple)
print("See? Nothing to see here?")
for single_tuple in zip(spam, ands, eggs):
print(single_tuple)
for s, a, e in zip(spam, ands, eggs):
print(e)
# Using Python3's fancy new f-strings to insert variables into a string
print(f"{s} {a} {e}")
```

Note how the first print only prints `<zip object at 0x7f22561bec00>`

when you use python3.
This is because `zip()`

returns an iterator which has to be consumed for example by passing it into `list()`

or using it as the iterator in a for loop.

## Task 4 – Mean Squared Error

Write a function to calculate the mean squared error.

The mean squared error works by:

- subtracting two values for the same x-value from each other
- squaring the differences so that all values are positive

(this also punishes large differences more) - summing all of this and dividing by the number of point pairs you are comparing

This should be implemented as a function that takes a lists of numbers and returns the MSE (a single number).

### Suggested Subtasks

- write a
`mse`

function that takes two floats and returns their difference i.e. subtracts them - extend
`mse`

to take two lists of floats and returns a list of their differences - have
`mse`

return the sum of all differences instead of a list of them - square the diffs before summing them
- divide the sum by the number of point pairs

### Relevant Syntax

`**`

lets you take number to the power of e.g.`5**2`

will return 25`len()`

counts how many items are in a list or characters are in a string

## Task 5 – Optimize the Slope

Find the

`a`

value that results in the lowers MSE with`b = 0`

.

Generate y-values for 100 different slope values and determine which one has the lowest MSE when compared with the data from the csv.
Keep `b = 0`

and try `a`

values between 0 and 10 in 0.1 increments.
Plot the data that has the lowest MSE.

### Suggested Subtasks

Make the line function variable:

- extend the line generating function so that it takes a list and a float as input
- use the float passed to the function to set the slope i.e. the
`a`

variable

Calculate lots of MSE:

- for-loop that prints the number 0.1 through 0.5 in 0.1 steps
- print y-values using these numbers as slopes
- calculate and print the MSE for each set of y-values

Determine the best MSE:

- create a
`best_mse`

variable and set it to 1000 - in the for loop, update
`best_mse`

if the current MSE is better i.e. lower

Plot the best fit:

- generate 100 slope values and print the resulting MSE
- print only the best slope value
- draw the line of the best slope value in scatterplot

## Task 6 – Optimize Intersection

Find the best combination of

`a`

and`b`

.

In addition to `a`

now also try values for `b`

between 0 and 10 in 0.1 increments.

### Suggested Subtasks

- nested for-loop that iterates through all slope-intersect combinations
- print the best slope-intersect combination
- plot the best slope-intersect combination

## Things to Read About, Learn and Try

- identify areas of your code that could be isolated into functions
- add a main
- sum() could be useful in your
`mse`

function - map to generate y-values from x-values without using for-loops
- csv.DictReader to ingest the data
- list/dict comprehensions to replace for-loops
- test your functions with pytest
- replace the nested for loops in the optimization by using
`itertools.combinations()`

(docs, tutorial)