-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathStatistical Test
203 lines (174 loc) · 7.78 KB
/
Statistical Test
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
import pandas
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
turnstile_weather= pandas.read_csv("/Users/skootergirl01/Downloads/improved-dataset/turnstile_weather_v2.csv")
"""Quiz 1: Exploratory data analysis. Doing a little extra exploratory data analysis..
I find that this is not normal data. I ran a Shapiro-Wilk test and an Anderson-Darling test.
The results really don't make sense. I should really investigate a little more into what these
tests really do to understand my results though..."""
def entries_histogram(turnstile_weather):
'''
Before we perform any analysis, it might be useful to take a
look at the data we're hoping to analyze. More specifically, let's
examine the hourly entries in our NYC subway data and determine what
distribution the data follows. This data is stored in a dataframe
called turnstile_weather under the ['ENTRIESn_hourly'] column.
Let's plot two histograms on the same axes to show hourly
entries when raining vs. when not raining. Here's an example on how
to plot histograms with pandas and matplotlib:
turnstile_weather['column_to_graph'].hist()
Your histogram may look similar to bar graph in the instructor notes below.
You can read a bit about using matplotlib and pandas to plot histograms here:
http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms
You can see the information contained within the turnstile weather data here:
https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv
'''
plt.figure()
rainy = turnstile_weather[turnstile_weather['rain']==1]['ENTRIESn_hourly']
no_rainy = turnstile_weather[turnstile_weather['rain']==0]['ENTRIESn_hourly']
plt.hist([rainy,no_rainy], bins=25, alpha=0.85,range=(0,20000),color=('blue','red'))
plt.show()
return plt
entries_histogram(turnstile_weather)
#rainy=turnstile_weather[turnstile_weather['rain']==1]['ENTRIESn_hourly']
#print stats.shapiro(rainy)
#print stats.anderson(rainy)
def mann_whitney_plus_means(turnstile_weather):
'''
This function will consume the turnstile_weather dataframe containing
our final turnstile weather data.
You will want to take the means and run the Mann Whitney U-test on the
ENTRIESn_hourly column in the turnstile_weather dataframe.
This function should return:
1) the mean of entries with rain
2) the mean of entries without rain
3) the Mann-Whitney U-statistic and p-value comparing the number of entries
with rain and the number of entries without rain
You should feel free to use scipy's Mann-Whitney implementation, and you
might also find it useful to use numpy's mean function.
Here are the functions' documentation:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
You can look at the final turnstile weather data at the link below:
https://s3.amazonaws.com/content.udacity-data.com/courses/ud359/turnstile_data_master_with_weather.csv
'''
x=turnstile_weather[['ENTRIESn_hourly']][turnstile_weather.rain==0]
y=turnstile_weather[['ENTRIESn_hourly']][turnstile_weather.rain==1]
without_rain_mean=np.mean(x)
with_rain_mean=np.mean(y)
Up=scipy.stats.mannwhitneyu(x,y)
U,p=Up
#print with_rain_mean[0]
results=(with_rain_mean[0],without_rain_mean[0],U,p)
return results
print mann_whitney_plus_means(turnstile_weather)
def day_of_week_visualization(turnstile_weather):
'''
I want to take a look at ridership on specific days of the week. I want to see
if there is a clear distinction between weekends and weekdays. I am guessing that
there will be not much difference between a Saturday and a Friday. But a lot of difference between
Tuesdays and Fridays, etc.'''
plt.figure()
sun= turnstile_weather[turnstile_weather['day_week']==6]['ENTRIESn_hourly']
mon = turnstile_weather[turnstile_weather['day_week']==0]['ENTRIESn_hourly']
tues = turnstile_weather[turnstile_weather['day_week']==1]['ENTRIESn_hourly']
wed = turnstile_weather[turnstile_weather['day_week']==2]['ENTRIESn_hourly']
thur = turnstile_weather[turnstile_weather['day_week']==3]['ENTRIESn_hourly']
fri = turnstile_weather[turnstile_weather['day_week']==4]['ENTRIESn_hourly']
sat = turnstile_weather[turnstile_weather['day_week']==5]['ENTRIESn_hourly']
plt.hist([sun,mon,tues,fri,sat], bins=10, alpha=0.85,range=(0,10000),color=('yellow','red','orange','black','purple'))
plt.show()
return plt
day_of_week_visualization(turnstile_weather)
"""Not a very good plot to say anything definitive but let's check up if it is normal.
It doesn't look it, but what can we see from the Shapiro-Wilk test or the Anderson-Darling
test."""
sun= turnstile_weather[turnstile_weather['day_week']==6]['ENTRIESn_hourly']
mon = turnstile_weather[turnstile_weather['day_week']==0]['ENTRIESn_hourly']
tues = turnstile_weather[turnstile_weather['day_week']==1]['ENTRIESn_hourly']
wed = turnstile_weather[turnstile_weather['day_week']==2]['ENTRIESn_hourly']
thur = turnstile_weather[turnstile_weather['day_week']==3]['ENTRIESn_hourly']
fri = turnstile_weather[turnstile_weather['day_week']==4]['ENTRIESn_hourly']
sat = turnstile_weather[turnstile_weather['day_week']==5]['ENTRIESn_hourly']
#print stats.shapiro(sun)
#This gave me a p-value of zero... lol
#print stats.anderson(sun) This gives us a test statistic of infinity...
#Creating a dataframe where can compare multiple days
def p_val(x,y):
U, p =stats.mannwhitneyu(x,y)
return p
week=[sun,mon,tues,wed,thur,fri,sat]
sweek=['sun','mon','tues','wed','thurs','fri','sat']
j=0
for day in week:
print sweek[j]
print np.mean(day)
for i in range(len(sweek)):
print sweek[j]+" v.s. "+sweek[i]+":"+ str(p_val(day,week[i]))
j+=1
"""I find the results of the Mann-Whitney test for the days of the week to be interesting. It appears that Thursday has the highest average
of entries for any day of the week, but it's value is significant. While Sunday has the lowest. I have pasted the output below:
sun
1066.43610578
sun v.s. sun:0.499999122641
sun v.s. mon:1.99201837983e-54
sun v.s. tues:1.66247747528e-131
sun v.s. wed:1.74860236449e-126
sun v.s. thurs:3.25455951892e-131
sun v.s. fri:2.02572602214e-143
sun v.s. sat:5.05820863686e-16
mon
1825.26490728
mon v.s. sun:1.99201837983e-54
mon v.s. mon:0.499999167537
mon v.s. tues:5.51614447275e-18
mon v.s. wed:2.14084485486e-21
mon v.s. thurs:3.29853056073e-23
mon v.s. fri:1.89164933113e-27
mon v.s. sat:1.06841922381e-11
tues
2164.83643334
tues v.s. sun:1.66247747528e-131
tues v.s. mon:5.51614447275e-18
tues v.s. tues:0.499999173384
tues v.s. wed:0.0533591712554
tues v.s. thurs:0.0199596481657
tues v.s. fri:0.0019104880063
tues v.s. sat:1.74949816452e-50
wed
2297.09795695
wed v.s. sun:1.74860236449e-126
wed v.s. mon:2.14084485486e-21
wed v.s. tues:0.0533591712554
wed v.s. wed:0.499998796274
wed v.s. thurs:0.348077564471
wed v.s. fri:0.123254094302
wed v.s. sat:5.92590298717e-53
thurs
2317.07237922
thurs v.s. sun:3.25455951892e-131
thurs v.s. mon:3.29853056073e-23
thurs v.s. tues:0.0199596481657
thurs v.s. wed:0.348077564471
thurs v.s. thurs:0.499998797261
thurs v.s. fri:0.225143481255
thurs v.s. sat:8.21241907566e-56
fri
2277.37229358
fri v.s. sun:2.02572602214e-143
fri v.s. mon:1.89164933113e-27
fri v.s. tues:0.0019104880063
fri v.s. wed:0.123254094302
fri v.s. thurs:0.225143481255
fri v.s. fri:0.499998785657
fri v.s. sat:1.65513865744e-62
sat
1383.90147874
sat v.s. sun:5.05820863686e-16
sat v.s. mon:1.06841922381e-11
sat v.s. tues:1.74949816452e-50
sat v.s. wed:5.92590298717e-53
sat v.s. thurs:8.21241907566e-56
sat v.s. fri:1.65513865744e-62
sat v.s. sat:0.499998772129"""