FOR PRIVATE USE ONLY

View on website: https://dairew.github.io/sternhonorsthesisprivate/

An honors thesis submitted in partial fulfillment of the requirements for the degree of Bachelor of Science from the Undergraduate College of the Leonard N. Stern School of Business of New York University.

Abstract

This analysis attempts to shed more light on the world of extracurricular activities at the undergraduate level, and how that might translate into future job prospects. I collected data from NYU Stern’s Office of Student Engagement for club activity data, and built a web scraper to mine LinkedIn for graduate job placement data. While certain shortcomings prevented the analysis from being statistically conclusive, there were numerous other interesting findings regarding how graduates placed at clubs and how students engaged with clubs on campus. Key visualization tools were left out of this version of the thesis due to data sensitivity concerns.

Acknowledgements

Background

Every August, wide-eyed freshmen step foot into Stern and are inundated with amazing opportunities to join the vibrant extracurricular scene. From learning career skills, to meeting new friends, to networking into desirable jobs, clubs have so much to offer to Stern students. And indeed, Stern students take advantage of these offerings. Since the start of the 2014 Academic Year, 2,562 unique NYU students have recorded 51,134 check-ins to over 1,602 events, across 211 clubs who’ve held events at Stern.

Trying to map all this analysis of club activity to a career path - in an ideal world (this is the type of data I would require [measure of intent, randomly assign clubs]), but since not an ideal world, what is the process to approach it to try to proxy it ??? say the problems up front, identify caveats, and say what you can learn from these data sets

In 2016, 179 clubs held events at Stern, most of them being Stern clubs. Of the top 10 clubs with the most check-ins, all of them are Stern clubs. However, this is aggregated across the whole year - so the same student could have checked in multiple times to various events. Therefore, a look into the total number of unique attendees for each club will reveal how big each club’s audience really is.

The top clubs by unique attendees differs slightly from that of check-ins. A side-by-side comparison shows that…

Some of these differences are understandable - DSP and AKPsi are fraternities and thus have a capped number of attendees that can possibly attend an event (as some events are exclusive to fraternity members). These figures can be combined to find the Average Number of Check-Ins per Unique Attendee, for each club, to get a sense of the engagement level for each club’s members.

One club averaged an impressive 8.4 events per unique attendee. When asked to comment on these figures, an executive board member had this to say: “Our members are very committed. At the beginning of every semester, we tell all of our new candidates, that this organization is what you make of it. The more work you put into it, the more we can help you. At the end of the day, you can do the bare minimum of meeting enough requirements to cross as a member, but we always encourage our candidates to do more and see them go above and beyond!”

Other interesting observations can be observed from the seasonality in the data. Fall semesters have higher number of total check-ins (Exhibit A) and unique attendees (Exhibit B). However, the average # of check-ins is higher in the spring. This correlates well with a qualitative observation of students exploring many clubs in the fall, and then narrowing their commitments in the spring. Thus, they attend a fewer number of clubs, and more events of the same club.

Exhibit A: Total Check-ins

Exhibit B: Unique Attendees

Exhibit C: Average # of Check-ins/Attendee

On a club to club basis, clubs of the same nature seem to be correlated in the beginning of the year in regards to the high numbers of check-ins, yet they begin to outperform/underperform each other as the year goes on, most likely due to the quality of events. Here are tools to add/analyze your own set of clubs:

Total Check-ins by Club, over Time

Total Unique Attendees by Club, over Time

There’s no doubt that undergraduates at Stern are very involved in their clubs. However, average Stern student only has so much time to dedicate to extracurricular activities. And frankly speaking, most Stern students join Stern clubs for one main reason: to better their career prospects, which can be achieved through either 1) learning valuable skills, or 2) networking with older students. Therefore, Stern students need clarity on this somewhat ubiquitous understanding of whether or not joining clubs helps in job placements, and furthermore, which clubs place better at which firms. It is very possible that the data reveals trends of how certain firms only hire from certain clubs - thus, revealing some internal loyalties between these clubs and firms. In the spirit of transparency and for the sake of helping freshmen and sophomores better allocate their time based on job interest, the following analysis will look into club participation as a measure of eventual post-graduation job placement

Introduction to the LinkedIn Analysis

To keep the analysis concise and to avoid “noisy” data, the only career paths analyzed in this analysis are those pertaining to financial services, particularly investment banking. This is because investment banking is the most desired career path amongst Stern students. In addition, only pre-professional clubs will be analyzed. This is because it would be unfair to measure pre-professional clubs against clubs that are clearly social in nature, such as various fraternities or even the Stern Student Council. As long as the club has a significant component of its membership activities dedicated to educating students about a specific professional field, then it can be considered a pre-professional organization. Furthermore, my analysis will only focus on the top 11 investment banks from the 2017 Vault Banking 50 Guide (an industry-wide accepted standard for ranking investment banks by performance and prestige). Thus, the scope of the question at hand is condensed into: how does undergraduate pre-professional extracurricular involvement affect post-graduation job placement into the top 11 investment banks, according to Vault?

Note: I only included the Top 11 banks because LinkedIn seemed to have caught onto my scraper’s activities while I was in the middle of scraping data for the 11th firm. This done introduce some sampling bias so just be sure to append all the results you see with “for the top 11 investment banks”.

Hypothesis

Our null hypothesis in this experiment says that “graduates going to ‘better’ firms for investment banking or similar jobs, do not statistically significantly differ in extracurricular participation.” “Better” in this case, refers to the 2017 Vault IB Rankings.

Data Collection Methodology

The data collected and cross-validated in two ways: LinkedIn web scrapers and primary research. LinkedIn web scraping will involve building a web scraper tool that scrapes LinkedIn public user profiles for information. Key information to gather will be: first job out of NYU Stern and club involvement (if any). This raw data set will be performed across all graduates of NYU Stern from 2013-2016. This date range was chosen since it is close to the years that I, myself, attended NYU Stern and thus I will have more domain expertise in being able to fill in missing information or detect data anomalies among my raw data set. This additional cross-validation step is very important as LinkedIn varies widely in terms of how people report where they work, what titles they have, and what school activities they were involved in. Primary research gathers the data from the Office of Student Engagement, current club leadership, and my own, manual verification via LinkedIn

Challenges to data collection will not be in the actual collection itself, but actually in understanding the data and making sure it is accurate. For example, on LinkedIn, anyone can claim they are part of a certain organization even when they are not. In addition, people may not list the organizations they are part of. Therefore, sample size is critical for this analysis in reducing outliers and identifying the true trends.

Building the web scraper

Ironically, the most difficult and challenging part of this thesis is creating the scraper and it will receive almost none of the credit in terms of the end-deliverable. The technical challenge required to subvert LinkedIn’s anti-bot detection security is tremendous. Thus, I had to be very careful in making sure that I gathered this information in a way that would avoid detection and not result in me getting my IP address banned. In addition, to protect anonymity of future students and to avoid condoning such behavior, my scraper’s specific algorithmic tricks will not be disclosed.

The first step in building the scraper was to collect a list of names of every person that worked at each one of these top 11 firms. This can be done on LinkedIn search with the help of some filtering aspects on the side. For example, I filtered for people who worked at “JP Morgan” and were from “New York University - Leonard N. Stern School of Business”. Then I copy and pasted this big list of names into excel, saved it as a CSV, and used some python code to strip out meaningless information until I was left with observations that contained: name, company, and current job title. The problem on the LinkedIn Search Results page was that each person’s link did not have their profile URL. In addition, scraping the LSR page would be impossible due to the fact that these search results are only made possible after having created an account with LinkedIn. LinkedIn does not allow scraping period, so they certainly do not have permission fields that allow me to enter their site as a scraping bot despite me having an account with LinkedIn. Thus, this part of the process had to be manual. Now that I had the list of observations, I had to get the actual profile URL so I could visit the profile page and get the desired information.

The second step was to run each of these observations through a search engine so search results of the person would appear. For example, my script ran in a way that for every observation in the previous list, a search would be conducted for the name, firm, and current job title in the search string. This would hope to narrow down the person in the upcoming search results. When these search results appear, the script can break the page down into raw HTML, and save that information. This is important because each search result is hyperlinked to that person’s LinkedIn profile URL. Thus, the true purpose of this step is to get the profile URLs. The search engine used was Bing, as Google is much more strict with automated searches on its platform. The output from this stage was, for each observation, a raw string of HTML on that first page.

# Snippet of code:

name = name.replace(" ", "+")
title = title.replace(" ", "+")
company = "Moelis"

url = "https://www.bing.com/search?q=" + name + "+" + title + "+" + company + "+nyu+stern" + "+linkedin"