HW #2

go to the Purdue RCAC Scholar page: https://www.rcac.purdue.edu/compute/scholar

click on the Jupyter Hub Launch button: https://scholar-fe00.rcac.purdue.edu:8000/

log in with your Purdue career account

start a new notebook by clicking "New" and selecting Python[default]

rename the notebook "EAPS 221 HW #2" or something similar

your new notebook should look something like this

#in this webpage, python code looks like this, copy and paste these lines into the python notebook

comments and instructions look like this

copy and paste the following block of code into the open cell in your notebook

hit Shift+Enter keys simultaneously to execute the code in the cell

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

this will set up the programming environment to use several Python packages

"numpy" is a fundamental package in Python for scientific computing http://www.numpy.org/

"pandas" is a fundamental package in Python for data analysis https://pandas.pydata.org/

"matplotlib" is a fundamental package in Python for plotting data https://matplotlib.org/

these packages need to be imported prior to using them in a Python program, for example "import numpy as np" will import the numpy package using the short alias "np", so it can be used later in the program by simply referring to "np" rather than "numpy"

"%matplotlib inline" is a command that will return plots/graphics in the notebook immediately after a plotting function is executed

df=pd.read_csv('/scratch/scholar/mebaldwi/EAPS221/WLAF.csv',delimiter=',',na_values=['M'])

this step will read in the data using pandas

the file is on the "scholar" computing system and has "comma separated values" (csv) so the delimiter is specified as a comma

there are missing values in the data, those are indicated by "M" and will be set to "NaN" (not a number) using na_values in this command

"df" is short for dataframe and is an object that pandas uses to represent the data in the program

the dataframe is similar to an Excel spreadsheet

df

you can easily examine the contents of the dataframe by typing "df" and executing the cell

note that there are several columns of values, with headers

these data are daily temperatures for West Lafayette, IN going back to 1901 through 2018, maximum and minimum temperatures for each day in degrees Fahrenheit (F)

"Year", "Month", "Day" is the date for each observation

"MinTempF" is the minimum (low) temperature for that date (deg F)

"MaxTempF" is the maximum (high) temperature for that date (deg F)

if you see "NaN" it means the value was missing for that date

df.MaxTempF.plot()

next let's make a simple plot of all of the maximum temperatures for the entire dataset

here we refer to the column of data that we want to analyze using "df.MaxTempF"

adding ".plot()" will create the plot, which should look similar to this:

the y-axis is temperature (deg F) and x-axis is the index of the data. this is not very informative since every day increments the index by 1. note the large gap in the dataset where the values are missing.

df.sort_values(by='MaxTempF',ascending=True)
df.sort_values(by='MaxTempF',ascending=False)

next let's sort the values so we can find the highest and lowest maximum temperatures in the record

these would be considered the "all time" record warmest and coldest daily maximum temperatures

"df.sort_values" does the sorting

by='MaxTempF' selects the column on which the sorting is performed

"ascending=True" sorts the values from low-to-high, so the lowest value is printed first

"ascending=False" sorts the values from high-to-low, so the highest value is printed first

use this to find the hottest and coldest daily maximum temperatures and dates that correspond with those temperatures

this will help you answer question #7 in the assignment

jan=df[df.Month==1]

next let's focus on the month of January

this command will create a new dataframe called "jan" that contains all of the data in the original dataframe where the "Month" is equal to 1 (January)

jan.describe()

the ".describe()" function will print several basic statistics for each column in the dataframe

"count" is the number of values (not counting missing values) in the column

"mean" is the mean of the values in the column

"std" is the standard deviation of the values in the column

"min" is the minimum value in the column

"25%" is the 25th percentile value in the column (also called the "lower quartile" or Q1)

"50%" is the 50th percentile value (also called "median")

"75%" is the 75th percentile value (also called the "upper quartile" or Q3)

"max" is the maximum value

recall that percentile values are determined by sorting the values from smallest-to-largest. Q1 (the 25th percentile value) is the value where 25% of the numbers in the column are below it (and 75% are above it). The median splits the sorted values in half, it is the middle number in the sorted values. Q3 (75th percentile) is the value where 75% of the sorted numbers are below it (and 25% are above it).

these statistics are useful in providing a brief overview of the overall distribution (group) of values. they tell use what the "average" or "center" of the distribution is as well as the "variability" or "spread" of the values.

the inter-quartile range (IQR) is often used to describe the degree of variability in the group of numbers. IQR is the difference between the 75th and 25th percentiles (IQR=Q3-Q1)

use this to find information regarding temperatures during the month of January in West Lafayette

this will help you answer question #8 in the assignment

jan.boxplot(column='MinTempF')

a common method of displaying a summary of a group of numbers is called a "box and whisker" diagram

pandas uses "boxplot" to create these

this command selects the MaxTempF column using "(column='MinTempF')"

your plot should look something like this...

the "box" indicates the lower and upper quartiles, the line near the middle of the box indicates the median value. you can see the inter-quartile range (IQR) easily by looking at the height of the box. the "whiskers" extend up and down beyond the box by +/- 1.5 IQR. the whiskers are meant to represent the typical range of values, this would be approximately the 1st and 99th percentiles if the data follow a "normal" distribution (Gaussian). values beyond the whiskers are called "outliers" and are plotted using circle symbols, these represent very rare small or large values.

jan[jan.Day==14].boxplot(column=['MaxTempF','MinTempF'])

create a box and whisker diagram for the maximum and minimum daily temperatures for a specific day of the month. this allows us to see the typical distribution of temperatures on a specific day in January (14th in this example) as well as what the "record" coldest and warmest temperatures were for a particular day. The maximum temperature on January 14, 2019 was 26F and the minimum temperature was 5F. how did the high/low temperatures on January 14, 2019 compare to the distribution of temperatures in the historical record for all other January 14ths?

copy and paste this image (right click on the image, select "copy image", then paste into the worksheet document) to answer question #9 in the assignment

df.boxplot(column='MaxTempF',by='Month')

this command will create a boxplot of maximum temperatures, where each month in the full dataset will have its own box and whisker plot. this will allow you to see how the temperatures vary from month to month in West Lafayette

copy and paste this image (right click on the image, select "copy image", then paste into the worksheet document) to answer question #10 in the assignment