Edu Data Analysis –

This analysis is based on the edu data made available from data.gov and downloaded from Kaggle. I did this basic analysis using Apache Spark and iPython. While the whole thing could be done using iPython, I included Spark to further my own learning of the technology.

Falling Admission Rate
While the admission rate has been falling in general, 5 universities have been consistently reducing the rate EVERY single year for the last decade. This analysis is from 2003 through 2013. I have not checked how the admission rate held up in 2014 and 2015.

The five are
(1) ‘Cornell University’,
(2) ‘Vanderbilt University’,
(3) ‘LeTourneau University’,
(4) ‘University of Louisiana at Lafayette’,
(5) ‘Massachusetts Institute of Technology’

adm_rates

3 out of the 5 is actually not a big surprize but I was surprized by these two schools — University of Louisiana at Lafayette and LeTourneau University where admission rates consistently fell year after year. Clearly they are doing something that is making it popular among students!

Largest Percent Change in Admission Rate
From 2003 through 2013, the admission rate dropped the most for these universities

Univ Percent Change Adm Rate in 2003 Adm Rate in 2013
University of Chicago 0.777 0.3962 0.0881
Mississippi Valley State University 0.770 0.9888 0.2272
Adventist University of Health Sciences 0.743 0.5213 0.1336
Vanderbilt University 0.725 0.4626 0.1274
Robert Morris University Illinois 0.723 0.7591 0.21
Pitzer College 0.710 0.501 0.1451
Missouri Valley College 0.668 0.6677 0.2218
Claremont McKenna College 0.622 0.3102 0.1173
Colorado College 0.618 0.5833 0.2228
Corban University 0.614 0.8364 0.3225

There are some universities that I have not heard about.., so it was interesting to see that these schools have reduced admission rate by a large percent points.

Mean ACT Change Year over Year
This shows how the mean ACT changes from 2003 through 2013. Interesting to see rate increased rapid from 2006 onwards and seems to be stablizing around 2011,

act_mid_mean_yoy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
    SparkConf conf = new SparkConf().setAppName("test").setMaster("local");
    JavaSparkContext sc = new JavaSparkContext(conf);
    SQLContext sqlContext = new SQLContext(sc);
    DataFrame df2013 = sqlContext.read()
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("/home/.../MERGED2013_PP.csv");


  // Load all the individual files from 2013 through 2003
  // Next select the interesting columns and rename columns
    DataFrame output2013 = df2013.filter("ADM_RATE <> 'NULL'")
        .sort("ADM_RATE")
        .select("OPEID", "HIGHDEG", "ADM_RATE", "SAT_AVG", "ACTCMMID", "TUITIONFEE_IN", "INEXPFTE")
        .withColumnRenamed("ADM_RATE", "ADM_RATE_2012")
        .withColumnRenamed("HIGHDEG", "HIGHDEG_2012")
        .withColumnRenamed("SAT_AVG", "SAT_AVG_2012")
        .withColumnRenamed("ACTCMMID", "ACTCMMID_2012")
        .withColumnRenamed("TUITIONFEE_IN", "TUITIONFEE_IN_2012")
        .withColumnRenamed("INEXPFTE", "INEXPFTE_2012");
 
  // Join all the individual data frames with join key of OEID
    DataFrame joinedDf = output2013.join(output2012, "OPEID")
        .join(output2011, "OPEID")
        .join(output2010, "OPEID")
        .join(output2009, "OPEID")
        .join(output2008, "OPEID")
        .join(output2007, "OPEID")
        .join(output2006, "OPEID")
        .join(output2005, "OPEID")
        .join(output2004, "OPEID")
        .join(output2003, "OPEID");


  // Filter. YOY reduce
    DataFrame dfOutput = joinedDf.filter("ADM_RATE <= ADM_RATE_2012 and "
        + "ADM_RATE_2012 <= ADM_RATE_2011 and "
        + "ADM_RATE_2011 <= ADM_RATE_2010 and "
        + "ADM_RATE_2010 <= ADM_RATE_2009 and "
        + "ADM_RATE_2009 <= ADM_RATE_2008 and "
        + "ADM_RATE_2008 <= ADM_RATE_2007 and "
        + "ADM_RATE_2007 <= ADM_RATE_2006 and "
        + "ADM_RATE_2006 <= ADM_RATE_2005 and "
        + "ADM_RATE_2005 <= ADM_RATE_2004 and "
        + "ADM_RATE_2004 <= ADM_RATE_2003 and "
        + "ADM_RATE < 1 and "
        + "ADM_RATE_2012 < 1 and "
        + "ADM_RATE_2011 < 1 and "
        + "ADM_RATE_2010 < 1 and "
        + "ADM_RATE_2009 < 1 and "
        + "ADM_RATE_2008 < 1 and "
        + "ADM_RATE_2007 < 1 and "
        + "ADM_RATE_2006 < 1 and "
        + "ADM_RATE_2005 < 1 and "
        + "ADM_RATE_2004 < 1 and "
        + "ADM_RATE_2003 < 1 ");

    // Write out to file
    dfOutput.limit(1000).coalesce(1).write()
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .save("/home/.../adm_rate_yoy.csv");
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import graphlab
adm_rate_yoy = graphlab.SFrame('adm_rate_yoy.csv/part-00000.csv')
import matplotlib.pyplot as plt
year = [2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003]
adm_rate_v = [0.1274, 0.1423, 0.1642, 0.1795, 0.2015, 0.2533, 0.3282, 0.3528, 0.3528, 0.3831, 0.4626]
adm_rate_c = [0.1556, 0.1655, 0.1797, 0.1836, 0.191, 0.2066, 0.214, 0.2468, 0.2708, 0.2944, 0.3099,]
adm_rate_l = [0.4066, 0.4335, 0.5594, 0.5874, 0.5904, 0.6619, 0.7015, 0.7345, 0.7641, 0.7937, 0.8012,]
adm_rate_ul = [0.593, 0.6036, 0.6593, 0.6688, 0.6718,  0.6784, 0.7023, 0.731, 0.758, 0.8496, 0.8703,]
adm_rate_m = [0.0815, 0.0895, 0.0973, 0.1008, 0.107, 0.1186, 0.1248, 0.1331,  0.1431, 0.1591, 0.1645,]

plt.xlabel('Year')
plt.ylabel('Admission Rate')
plt.plot(year, adm_rate_v, label='Vanderbilt University')
plt.plot(year, adm_rate_c, label='Cornell University')
plt.plot(year, adm_rate_l, label='LeTourneau University')
plt.plot(year, adm_rate_ul, label='University of Louisiana at Lafayette')
plt.plot(year, adm_rate_m, label='Massachusetts Institute of Technology')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()