This analysis is based on the edu data made available from data.gov and downloaded from Kaggle. I did this basic analysis using Apache Spark and iPython. While the whole thing could be done using iPython, I included Spark to further my own learning of the technology.
Falling Admission Rate
While the admission rate has been falling in general, 5 universities have been consistently reducing the rate EVERY single year for the last decade. This analysis is from 2003 through 2013. I have not checked how the admission rate held up in 2014 and 2015.
The five are
(1) ‘Cornell University’,
(2) ‘Vanderbilt University’,
(3) ‘LeTourneau University’,
(4) ‘University of Louisiana at Lafayette’,
(5) ‘Massachusetts Institute of Technology’
3 out of the 5 is actually not a big surprize but I was surprized by these two schools — University of Louisiana at Lafayette and LeTourneau University where admission rates consistently fell year after year. Clearly they are doing something that is making it popular among students!
Largest Percent Change in Admission Rate
From 2003 through 2013, the admission rate dropped the most for these universities
Univ | Percent Change | Adm Rate in 2003 | Adm Rate in 2013 |
---|---|---|---|
University of Chicago | 0.777 | 0.3962 | 0.0881 |
Mississippi Valley State University | 0.770 | 0.9888 | 0.2272 |
Adventist University of Health Sciences | 0.743 | 0.5213 | 0.1336 |
Vanderbilt University | 0.725 | 0.4626 | 0.1274 |
Robert Morris University Illinois | 0.723 | 0.7591 | 0.21 |
Pitzer College | 0.710 | 0.501 | 0.1451 |
Missouri Valley College | 0.668 | 0.6677 | 0.2218 |
Claremont McKenna College | 0.622 | 0.3102 | 0.1173 |
Colorado College | 0.618 | 0.5833 | 0.2228 |
Corban University | 0.614 | 0.8364 | 0.3225 |
There are some universities that I have not heard about.., so it was interesting to see that these schools have reduced admission rate by a large percent points.
Mean ACT Change Year over Year
This shows how the mean ACT changes from 2003 through 2013. Interesting to see rate increased rapid from 2006 onwards and seems to be stablizing around 2011,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | SparkConf conf = new SparkConf().setAppName("test").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sc); DataFrame df2013 = sqlContext.read() .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .load("/home/.../MERGED2013_PP.csv"); // Load all the individual files from 2013 through 2003 // Next select the interesting columns and rename columns DataFrame output2013 = df2013.filter("ADM_RATE <> 'NULL'") .sort("ADM_RATE") .select("OPEID", "HIGHDEG", "ADM_RATE", "SAT_AVG", "ACTCMMID", "TUITIONFEE_IN", "INEXPFTE") .withColumnRenamed("ADM_RATE", "ADM_RATE_2012") .withColumnRenamed("HIGHDEG", "HIGHDEG_2012") .withColumnRenamed("SAT_AVG", "SAT_AVG_2012") .withColumnRenamed("ACTCMMID", "ACTCMMID_2012") .withColumnRenamed("TUITIONFEE_IN", "TUITIONFEE_IN_2012") .withColumnRenamed("INEXPFTE", "INEXPFTE_2012"); // Join all the individual data frames with join key of OEID DataFrame joinedDf = output2013.join(output2012, "OPEID") .join(output2011, "OPEID") .join(output2010, "OPEID") .join(output2009, "OPEID") .join(output2008, "OPEID") .join(output2007, "OPEID") .join(output2006, "OPEID") .join(output2005, "OPEID") .join(output2004, "OPEID") .join(output2003, "OPEID"); // Filter. YOY reduce DataFrame dfOutput = joinedDf.filter("ADM_RATE <= ADM_RATE_2012 and " + "ADM_RATE_2012 <= ADM_RATE_2011 and " + "ADM_RATE_2011 <= ADM_RATE_2010 and " + "ADM_RATE_2010 <= ADM_RATE_2009 and " + "ADM_RATE_2009 <= ADM_RATE_2008 and " + "ADM_RATE_2008 <= ADM_RATE_2007 and " + "ADM_RATE_2007 <= ADM_RATE_2006 and " + "ADM_RATE_2006 <= ADM_RATE_2005 and " + "ADM_RATE_2005 <= ADM_RATE_2004 and " + "ADM_RATE_2004 <= ADM_RATE_2003 and " + "ADM_RATE < 1 and " + "ADM_RATE_2012 < 1 and " + "ADM_RATE_2011 < 1 and " + "ADM_RATE_2010 < 1 and " + "ADM_RATE_2009 < 1 and " + "ADM_RATE_2008 < 1 and " + "ADM_RATE_2007 < 1 and " + "ADM_RATE_2006 < 1 and " + "ADM_RATE_2005 < 1 and " + "ADM_RATE_2004 < 1 and " + "ADM_RATE_2003 < 1 "); // Write out to file dfOutput.limit(1000).coalesce(1).write() .format("com.databricks.spark.csv") .option("header", "true") .save("/home/.../adm_rate_yoy.csv"); |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import graphlab adm_rate_yoy = graphlab.SFrame('adm_rate_yoy.csv/part-00000.csv') import matplotlib.pyplot as plt year = [2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003] adm_rate_v = [0.1274, 0.1423, 0.1642, 0.1795, 0.2015, 0.2533, 0.3282, 0.3528, 0.3528, 0.3831, 0.4626] adm_rate_c = [0.1556, 0.1655, 0.1797, 0.1836, 0.191, 0.2066, 0.214, 0.2468, 0.2708, 0.2944, 0.3099,] adm_rate_l = [0.4066, 0.4335, 0.5594, 0.5874, 0.5904, 0.6619, 0.7015, 0.7345, 0.7641, 0.7937, 0.8012,] adm_rate_ul = [0.593, 0.6036, 0.6593, 0.6688, 0.6718, 0.6784, 0.7023, 0.731, 0.758, 0.8496, 0.8703,] adm_rate_m = [0.0815, 0.0895, 0.0973, 0.1008, 0.107, 0.1186, 0.1248, 0.1331, 0.1431, 0.1591, 0.1645,] plt.xlabel('Year') plt.ylabel('Admission Rate') plt.plot(year, adm_rate_v, label='Vanderbilt University') plt.plot(year, adm_rate_c, label='Cornell University') plt.plot(year, adm_rate_l, label='LeTourneau University') plt.plot(year, adm_rate_ul, label='University of Louisiana at Lafayette') plt.plot(year, adm_rate_m, label='Massachusetts Institute of Technology') plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.show() |