Kate Reagor
School of Information, University of Texas at Austin
INF 385T - Data Wrangling
Dr. James Howison
May 13, 2020
Or, does New York City's rat problem keep visitors from the theatre?
This project attempts to discover if attendance of Broadway performances in New York City is impacted by the number of reported rat sightings in the vicinity of the theatres where they are held. Do theatres with rat problems bring in less revenue than those which do not? Were there fewer rats around showings of Cats vs Of Mice and Men? Are more rats to be found congregating around Something Rotten! or Superior Donuts?
Please note that the subject of the project is intentionally whimsical in nature. Its purpose is intended to demonstrate proficiency in managing the data wrangling process: moving information from original data sources, through cleanup and transformation, insertion into a database, then pulling it back out again in new configurations for analysis. Any analysis resulting from this process is intended purely for the purposes of demonstration and entertainment.
The data for this project came from three found sources: one containing Broadway show attendance and ticket sale revenue, one containing dates and locations of reported rat sightings in New York City, and one listing all Broadway theatres and their location coordinates. The data sources range in dates from 1990 - 2020, but the project limits its scope to the years in which the Broadway and rat sighting datasets overlap: 2010 - 2016.
Source: NYC OpenData
The first dataset lists all Broadway theatres operating in New York City, along with their addresses, zip codes, and geographical coordinates. It serves as the linchpin between the other two datasets, connecting the Broadway attendance data via theatre name to the rat sightings data via geographical coordinates.
Cleanup and challenges: The geo coordinates listed in the file are listed together in a single cell, with extra text surrounding them. Before they could be used, it was necessary to remove the extra text and separate the latitude and longitude strings as distinct values.
Source: CORGIS Dataset Project, Broadway League
This dataset contains information on Broadway performances from 1990-2016. Performances are listed by the start of each performance week. Pertinent data include the names and types of shows performed, which theatre they were performed in, weekly ticket sales by amount and percent of theatre capacity, and weekly revenue by amount and percent of the estimated maximum possible revenue.
Cleanup and challenges: The dates in the file are not in DATETIME format and needed to be converted before import into the database. Additionally, the theatre names do not match up exactly with those in the theatre locations dataset, which required the creation of a new dataset modeling the relationship between each theatre and the various names by which it is known.
The show names in the file also contain a number of capitalization errors, as though the file had been run through a faulty title case converter. All articles, such as "and" and "the" are capitalized, along with letters after apostrophes, such as "Who'S" and "Don'T". This would have to be cleaned up before import.
Source: NYC OpenData, Department of Health and Mental Hygiene
The final dataset contains a listing of every reported rat sighting in New York City from 2010 to the present. The file contains massive amounts of information, but the pertinent data for this project are the date of the reported rat sighting, and the zip code and geo coordinates indicating the location of the sighting.
Cleanup and challenges: The dates in the file are not in DATETIME format and needed to be converted before import into the database.
Source: Manually created with data from Broadway Attendance and Theatre Location datasets
The project additionally required one manually-created source: a lookup table linking all the different iterations of theatre names to a single "canonical" version of that theatre's name.
As it turned out, there was no way to automate this process. Not all of the names from the Broadway Attendance file were similar enough to those from the Theatre Locations file to match them reliably using a query, and in some cases the theatre name had changed and it was necessary to conduct research to match it to its canonical name.
To create the file, I exported the Theatres table from the database (which was populated using the Theatre Locations dataset, and contained the canonical name and a unique id number for each theatre). I then added the theatre names from the Broadway Attendance dataset, manually matched each up with the unique theatre id for its canonically named equivalent, and saved the file as a CSV.
The data workflow for this project, as illustrated in the diagram below, is composed of three distinct steps. First the data sources are processed and cleaned using a series of python scripts. The cleaned data is then inserted into the proper tables within a MySQL relational database. Next this data is rearranged via a series of SQL queries, and exported as new CSV files. Finally, these files are imported into Tableau and used to analyze and create visualizations of the data.
The Broadway Rats relational database is composed of five tables:
The Theatres table contains the locations and names of record for Broadway theatres in New York city.
The Theatre_Alt_Names table contains all iterations of the theatre names from the data sources.
The Shows table lists all shows that ran on Broadway from 2010 to 2016 and connects them to the theatre in which they were performed.
The Performances table lists all show performances and their attendance and revenue data by week.
The Rat_Sightings table lists all reported rat sightings in NYC from 2010 to 2016.
Below are examples of the database tables and examples of the data contained in each.
After the data sources were identified and the database designed and created, the next step was to write the Python scripts to clean, arrange, and insert that data into the database tables. The following section displays and describes all code used in the processing and insertion of the project data.
The first step to processing and importing the data was to import the necessary code libraries and establish the connection to the database.
# Import all necessary code libraries
import pymysql
import csv
import datetime
import re
# Open connection to the database
connection = pymysql.connect(
host="mariadb",
user="root",
passwd="",
db="broadway_rats",
autocommit=True,
cursorclass=pymysql.cursors.DictCursor
)
cursor = connection.cursor()
The first data source for import was a list of all reported rat sightings in NYC (Rat_Sightings.csv). This file was used to populate one table: rat_sightings
The values ere mapped as follows (source -> table)
Cleanup in code: Filtered out incidents with dates after 2016, and rows with null values in latitude or zip code fields. Converted dates to datetime format.
# Clear rat_sightings on Run for testing purposes
truncate_rats_sql = "TRUNCATE TABLE rat_sightings"
cursor.execute(truncate_rats_sql)
print("Rat_sightings table cleared")
# ---- Creating Rat_Sightings Table ----
with open('Rat_Sightings.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
# Convert dates to datetime format, save as variable
date = datetime.datetime.strptime(row['Created Date'], "%m/%d/%Y %I:%M:%S %p")
# Create variable for date used as cutoff point for data import
cutoff_date = datetime.datetime.strptime("2016-12-31", "%Y-%m-%d")
# Create a param dictionary to link date variable
param_dict = { "date": date, "longitude": row["Longitude"], "latitude": row["Latitude"], "zip": row["Incident Zip"]}
# Create if conditions to only print rows that are not empty and dated 2016 or earlier
if ( (date <= cutoff_date) and (row['Latitude'] != '' ) \
and (row['Incident Zip'] != '') ):
# Insert all data that passes into rat_sightings table
rats_sql = """INSERT INTO rat_sightings (date, longitude, latitude, zip)
VALUES (%(date)s, %(longitude)s, %(latitude)s, %(zip)s)"""
cursor.execute(rats_sql, param_dict)
print("Rat sightings data inserted into table")
The second data source was a list of Broadway theatres in New York City, and their location information (DOITT_THEATER_01_13SEPT2010.csv). This file was used to populate the theatres table.
The values are mapped as follows (source -> table):
Cleanup in code: I corrected the coordinates string, which was formatted in the source as:
POINT (-73.99061840882582 40.75985115447559)
Using regular expressions, I removed the extra text and separated the coordinates into two distinct values for longitude and latitude.
# Clear theatres table on Run for testing purposes
truncate_theatres_sql = "TRUNCATE TABLE theatres"
cursor.execute(truncate_theatres_sql)
print("theatres table cleared")
# ---- Creating Theatres Table ----
with open('DOITT_THEATER_01_13SEPT2010.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
# Identify geocoordinates column in data, use regex to separate out the long and lat
geoString = row['\ufeffthe_geom']
geoCoord = re.search('\\((.+)\s(.+)\\)', geoString)
# Save long and lat strings as values
longitude = geoCoord.group(1)
latitude = geoCoord.group(2)
# Create query to insert data into theatres table
theatre_sql = """INSERT INTO theatres (canonical_name, longitude, latitude, zip)
VALUES (%(canonical_name)s, %(longitude)s, %(latitude)s, %(zip)s)"""
# Create a param dictionary
param_dict = {"canonical_name": row["NAME"], "longitude": longitude, "latitude": latitude, "zip": row["ZIP"]}
# Execute query
cursor.execute(theatre_sql, param_dict)
print("Theatre location data inserted into theatres table")
The manually created data source theatre_alt_names_import.csv was used to populate the theatre_alt_names table. It required no additional cleanup and was imported unaltered.
# Clear theatre_alt_names table on Run for testing purposes
truncate_alt_sql = "TRUNCATE TABLE theatre_alt_names"
cursor.execute(truncate_alt_sql)
print("theatre_alt_names table cleared")
# ---- Creating Theatre_Alt_Names Table ----
with open('theatre_alt_names_import.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
alt_names_import_sql = """INSERT INTO theatre_alt_names (theatre_id, alt_name)
VALUES (%(theatre_id)s, %(alt_name)s)"""
cursor.execute(alt_names_import_sql, row)
print("Theatre alt names data inserted into theatre_alt_names table")
The final data source was a file containing detailed information on Broadway show performances, organized by performance week (broadway.csv). This file was used to populate fields in two tables: shows and performances.
This import was by far the most complicated, and had to be broken out into three distinct parts: one to fix capitalization and unicode errors in the show titles, one to condense the show titles down to only unique values for import into the shows table, and one to import the entire set of show performance data into the performances table.
The source file contained major errors in show title capitalization, with names formatted in the source as:
A Gentleman'S Guide To Love And Murder
Don'T Dress For Dinner
No Man'S Land/Waiting For Godot
The Curious Incident Of The Dog In The Night-Time
The Motherf**Ker With The Hat
Twelfth Night/Richard Iii
One show title also contained unicode replacement characters:
th�r�se raquin
To correct these errors, it was necessary to use regular expressions to update the show names to proper title case. The corrected names were then organized into a dictionary, with the source file name as the key and the corrected name as the value. This dictionary could then be used to match and insert the properly formatted title when it came time to import the source file data into the database.
# ---- Creating Dictionary of Show Names with proper Title Case ----
with open('broadway.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
# --- Create an empty dict to insert the names into later
properCaseShows = {}
for row in myCSVReader:
# --- Now we need to fix capitalization and replacement character problems
# first set everything lowercase to create a baseline
lowercaseTitle = row["Show.Name"].lower()
# --- define a function to apply capitalization rules---
def title_except(string, exceptions, upper): # title_except runs s and then exceptions and upper
#word_list: takes string and splits it into a list of words using space as delimiter
word_list = re.split('\s', string)
title = [word_list[0].capitalize()] # create "title" list with first word from word_list, capitalized
for word in word_list[1:]: # for all subsequent words in word_list
if word in exceptions: # if word is in exceptions (articles) append as-is
title.append(word)
elif word in upper: # if word is in upper (roman numerals) append in full uppercase
title.append(word.upper())
else: # all other words append capitalized
title.append(word.capitalize())
return " ".join(title) # join all words in title list with space
articles = ['a', 'an', 'of', 'the', 'is', 'in', 'for', 'from'] # define articles for capitalization exceptions
romanNum = ['ii', 'iii', 'iv', 'vi', 'vii', 'viii', 'ix'] # define roman numerals to enforce uppercase
# Define the result of above as uppercaseTitles
uppercaseTitle = title_except(lowercaseTitle, articles, romanNum)
# --- The function above doesn't capitalize letters after hyphens or slashes in divided names
# --- Or articles that come after a colon. A different regex function is needed for those titles.
# Regex to split string into 3 parts:
# all characters before hyphen/slash/colon+space, the single letter after, and then all after that
hyphenCheck = re.search('(.+)-(.)(.+)$', uppercaseTitle)
slashCheck = re.search('(.+)/(.)(.+)$', uppercaseTitle)
colonCheck = re.search('(.+):\s(.)(.+)$', uppercaseTitle)
# --- Now we start writing titles to the dictionary
# First we check for and fix the title with unicode replacement character and fix it
# then add to dict with capitalization function
if (lowercaseTitle == "th�r�se raquin"):
fixedTitle = lowercaseTitle.replace("th�r�se", "therese")
properCaseShows.update( {row["Show.Name"]: title_except(fixedTitle, articles, romanNum)} )
# Next see if hyphenCheck function finds a hyphen in the name, and the character after hyphen isn't a space
# if so, insert the three parts of hyphenCheck list, with character after hyphen capitalized
elif ( (hyphenCheck != None) and (hyphenCheck.group(2) != ' ') ):
hyphTitle1 = hyphenCheck.group(1)
hyphTitle2 = hyphenCheck.group(2).capitalize()
hyphTitle3 = hyphenCheck.group(3)
properCaseShows.update( {row["Show.Name"]: "{}-{}{}".format(hyphTitle1, hyphTitle2, hyphTitle3)} )
# Then perform the same check, but for forward slash
elif ( (slashCheck != None) and (slashCheck.group(2) != ' ') ):
slashTitle1 = slashCheck.group(1)
slashTitle2 = slashCheck.group(2).capitalize()
slashTitle3 = slashCheck.group(3)
properCaseShows.update( {row["Show.Name"]: "{}/{}{}".format(slashTitle1, slashTitle2, slashTitle3)} )
# finally check for colon+space
elif (colonCheck != None):
colonTitle1 = colonCheck.group(1)
colonTitle2 = colonCheck.group(2).capitalize()
colonTitle3 = colonCheck.group(3)
properCaseShows.update( {row["Show.Name"]: "{}: {}{}".format(colonTitle1, colonTitle2, colonTitle3)} )
# For all other rows just insert the title as run through the uppercaseTitle capitalization function
else:
properCaseShows.update( {row["Show.Name"]: uppercaseTitle} )
print("Created properCaseShows dictionary")
Creating the Shows table from the source date required a few steps before the data could be inserted: filtering out shows before 2010, bringing in the theatre id values for foreign key shows.theatre_id, and then narrowing down the results to only one instance of each unique Show Name.
The values for the Shows table are mapped as follows (source -> table):
# Clear shows table on Run for testing purposes
truncate_shows_sql = "TRUNCATE TABLE shows"
cursor.execute(truncate_shows_sql)
print("Show table cleared")
# ---- Creating Shows Table ----
with open('broadway.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
# Use an if statement to only check shows from 2010 or later
if (row["Date.Year"] >= "2010"):
# Use SELECT statement to find theatre_id in theatre_alt_names based on theatre name in file
sql = "SELECT theatre_id FROM theatre_alt_names WHERE alt_name = %(Show.Theatre)s"
cursor.execute(sql, row)
result = cursor.fetchone()
theatre_id = result['theatre_id']
# Define showname as corrected show name (value) from properCaseShows dictionary
# that matches up with the original source name (key)
showname = properCaseShows.get(row["Show.Name"])
# Create query to insert data into shows table
shows_sql = "INSERT INTO shows (theatre_id, name, type) VALUES (%(theatre_id)s, %(showname)s, %(type)s)"
# Create param dictionary
param_dict = { "theatre_id": theatre_id, "showname": showname, "type": row["Show.Type"]}
# Next we need to strip the data down to only unique values for Show.Name
# Run a query for each iteration to check if that show name is already in the table
sql = "SELECT name FROM shows WHERE name = %(showname)s"
cursor.execute(sql, param_dict)
# If the query returns no results (less than one), write the row
if (cursor.rowcount < 1):
cursor.execute(shows_sql, param_dict)
print("Unique show names inserted into to shows table")
I inserted the data from broadway.csv into the Performances table using a separate code block from that of the Shows table. I did this because the Shows table had to be created prior to Performances, to generate the shows.id for use as the foreign key theatres.show_id in the Theatres table, but the code needed to pull out only unique show names from the data complicated combining the two imports, so I made the decision to keep them separate.
The values for the Performances table are mapped as follows (source -> table):
Cleanup in code: Converted dates to datetime format, filtered out rows with dates earlier than 2010.
# Clear performances table on Run for testing purposes
truncate_perf_sql = "TRUNCATE TABLE performances"
cursor.execute(truncate_perf_sql)
print("Performances table cleared")
# --- Creating Performances Table ---
with open('broadway.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
# Convert dates to datetime format, save as variable
start_date = datetime.datetime.strptime(row['Date.Full'], "%m/%d/%Y")
# Create variable for date used as cutoff point for date import
cutoff_date = datetime.datetime.strptime("2010-01-01", "%Y-%m-%d")
# Use an if statement to only check shows from after the cutoff date
if (start_date >= cutoff_date):
# Use SELECT statement to find shows.id and insert as foreign key show_id
# Match up show names from source with values from properCaseShows dict
showname = properCaseShows.get(row["Show.Name"])
param_dict = {"showname": showname}
sql = "SELECT id FROM shows WHERE name = %(showname)s"
cursor.execute(sql, param_dict)
result = cursor.fetchone()
show_id = result['id']
#define value for end_date by adding 6 (days) to start_date
end_date = start_date + datetime.timedelta(days=6)
#Create query to insert data into performances table
perf_sql = """INSERT INTO performances (show_id, week_start, week_end, percent_filled,
tickets_sold, total_revenue, percent_max_revenue)
VALUES (%(show_id)s, %(week_start)s, %(week_end)s, %(percent_filled)s,
%(tickets_sold)s, %(total_revenue)s, %(percent_max_revenue)s)
"""
# Create a param dict
param_dict = {"show_id": show_id,
"week_start": start_date,
"week_end": end_date,
"percent_filled": row["Statistics.Capacity"],
"tickets_sold": row["Statistics.Attendance"],
"total_revenue": row["Statistics.Gross"],
"percent_max_revenue": row["Statistics.Gross Potential"]
}
#Execute query
cursor.execute(perf_sql, param_dict)
print("Performance information inserted into performances table")
In the process of doing research for the alternate theatre name lookup table, I discovered that the names of a number of theatres had changed since the creation of the source file. Additionally, two theatres for which I had performance data were not listed in the theatre locations file.
Under ordinary circumstances these updates might be made in the database as they arose, but for the sake of the project I decided to model the changes as though an updated file had been made available and needed to be imported into the database. I created a new file (DOITT_THEATER_01_01MAY2020.csv) which contained all the same data as the original, but with a few names updated and the new theatres and their location information added as additional rows.
The following code checks the update file against the data in the database, and makes updates or additions as needed. A second code block then checks the changes made to the theatres table, and updates the theatre_alt_names table with all new and updated theatre name iterations.
# ---- Updating Theatres Table ----
with open('DOITT_THEATER_01_01MAY2020.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
# Identify geocoordinates column in data, use regex to separate out the long and lat
geoString = row['\ufeffthe_geom']
geoCoord = re.search('\\((.+)\s(.+)\\)', geoString)
# Save long and lat strings as values
longitude = geoCoord.group(1)
latitude = geoCoord.group(2)
# Create query to insert data into theatres table
insert_theatre_sql = """INSERT INTO theatres (canonical_name, longitude, latitude, zip)
VALUES (%(canonical_name)s, %(longitude)s, %(latitude)s, %(zip)s)"""
#Create query to update existing theatre if coordinates are identical
update_theatre_sql = """UPDATE theatres SET theatres.canonical_name = %(canonical_name)s
WHERE theatres.latitude = %(latitude)s AND theatres.longitude = %(longitude)s"""
# Create a param dictionary
param_dict = {"canonical_name": row["NAME"], "longitude": longitude, "latitude": latitude, "zip": row["ZIP"]}
# Queries to to check for matches
name_match_sql = "SELECT * FROM theatres WHERE canonical_name = %(canonical_name)s"
coord_match_sql = """SELECT * FROM theatres WHERE theatres.latitude = %(latitude)s
AND theatres.longitude = %(longitude)s"""
# Check for a name match
cursor.execute(name_match_sql, param_dict)
# If the query returns no matches
if (cursor.rowcount < 1):
# Check for coordinates match
cursor.execute(coord_match_sql, param_dict)
# If those coordinates exist, update that field with new name
if (cursor.rowcount == 1):
cursor.execute(update_theatre_sql, param_dict)
# If the coordinates do not exist, insert new theatre into table
else:
cursor.execute(insert_theatre_sql, param_dict)
print("Theatre location data updates inserted into theatres table")
# Find all new/updated names and insert names and ids into to the theatre_alt_names table
find_new_alt_names_sql="""INSERT INTO theatre_alt_names (theatre_alt_names.theatre_id, theatre_alt_names.alt_name)
SELECT new_names.id, new_names.canonical_name
FROM (SELECT theatres.id, theatres.canonical_name
FROM theatres
WHERE theatres.canonical_name NOT IN
(SELECT theatre_alt_names.alt_name
FROM theatre_alt_names)
) as new_names"""
cursor.execute(find_new_alt_names_sql)
print("Theatre name updates applied to theatres and theatre_alt_names tables")
With the database now populated, it was time to pull the data back out for analysis. To this end, I constructed three SQL queries that combined the data from the database into new configurations. The results of these queries were then exported as CSV files to be loaded into the analysis tool.
The first data export for analysis combined data from the shows, theatres, performances, and rat_sightings tables. The SQL query for this dataset was created by joining two unique result tables as subqueries.
The first subquery, alias show_dates, lists every Broadway show in the dataset, its theatre, zip code, and the start and end dates of its entire run.
The second subquery, alias rats, counts the total number of rat sightings in each theatre zip on each day.
These two result tables were then joined using the rat sighting and theatre zip codes, and grouped to show the total sum of rat sightings in each show's area during the total time period it was in production. Finally, an additional column was added to calculate as the average number of rats per each show's performance week. This was done by calculating the total number of weeks the show ran (the difference between the start and end dates, plus one to be inclusive, divided by seven), and then dividing the total number of nearby rat sightings by that number:
( SUM(daily_rats) / (DATEDIFF(DATE_ADD(show_dates.end_date, INTERVAL 1 DAY), show_dates.start_date) /7) )
The resulting table was exported as rats_during_show_run.csv
#start by executing query to pull out rat sightings by show run
with connection.cursor() as cursor:
rats_by_show_sql = """
SELECT
show_dates.show_name,
show_dates.theatre_name,
show_dates.start_date,
show_dates.end_date,
SUM(daily_rats) as total_rats_during_showrun,
/* Calculate average rats per week as (total rats divided by ((end date +1) - start date / seven)) */
( SUM(daily_rats) / (DATEDIFF(DATE_ADD(show_dates.end_date, INTERVAL 1 DAY), show_dates.start_date) /7) ) as avg_rats_per_week
/*subquery1 - Show Dates */
FROM (
SELECT
shows.id,
theatres.canonical_name as theatre_name,
shows.name as show_name,
DATE(MIN(performances.week_start)) as start_date,
DATE(MAX(performances.week_end)) as end_date,
theatres.zip
FROM theatres
JOIN shows
ON theatres.id = shows.theatre_id
JOIN performances
ON shows.id = performances.show_id
GROUP BY shows.id
) as show_dates
/*end subquery1 ShowDates*/
/* join in daily rat table with subquery2 */
JOIN (
SELECT
DATE(rat_sightings.date) as day,
rat_sightings.zip,
COUNT(DISTINCT rat_sightings.id) as daily_rats
FROM rat_sightings
WHERE rat_sightings.zip in (10023, 10036, 10019)
GROUP BY day, rat_sightings.zip
) as rats
/*end subquery 2 Rats */
ON show_dates.zip = rats.zip
WHERE rats.day >= show_dates.start_date
AND rats.day <= show_dates.end_date
GROUP BY show_dates.show_name
"""
cursor.execute(rats_by_show_sql)
results = cursor.fetchall()
# Now write the results to a csv file
with open('rats_during_show_run.csv', 'w') as csvfile:
column_names =["show_name", "theatre_name", "start_date", "end_date", "total_rats_during_showrun", "avg_rats_per_week"]
myCsvWriter = csv.DictWriter(csvfile, fieldnames=column_names)
myCsvWriter.writeheader()
for row in results:
myCsvWriter.writerow(row)
print("Done writing rats_during_show_run.csv")
The second data export for analysis also combined data from the shows, theatres, performances, and rat_sightings tables. The SQL query for this dataset grouped all performance and rat sighting data by week, expressed as YEARWEEK (formatted: YYYYWW).
The query joins theatres, shows, and performances tables on their key ids. Then the rat_sightings table was brough in using a subquery, which grouped the data by week and zip code. This subquery was joined using a Left Join, to ensure that weeks with no match (i.e. there were no rat sightings in that area that week) would still be included in the results with a null result for rat_count.
The resulting table was exported as show_data_by_week.csv
# Execute query to pull show data by week/year
with connection.cursor() as cursor:
show_data_by_week_sql = """
SELECT
/* Select and group all data by year and week */
YEARWEEK(performances.week_start) as year_week,
shows.name as show_name,
shows.type as show_type,
theatres.canonical_name as theatre,
theatres.zip,
DATE(performances.week_start) as performance_week,
weekly_rats.rat_count,
performances.percent_filled as percent_of_capacity,
performances.tickets_sold,
performances.total_revenue,
performances.percent_max_revenue
FROM shows
JOIN performances
ON shows.id = performances.show_id
JOIN theatres
ON shows.theatre_id = theatres.id
/* Subquery to group all rat sightings by year/week, and zip */
/* Left join so null rows (weeks with no rats) aren't excluded */
LEFT JOIN (
SELECT
YEARWEEK(rat_sightings.date) as year_week,
rat_sightings.zip,
COUNT(DISTINCT rat_sightings.id) as rat_count
FROM rat_sightings
WHERE rat_sightings.zip in (10023, 10036, 10019)
GROUP BY year_week, rat_sightings.zip
) as weekly_rats
/* End Subquery */
ON YEARWEEK(performances.week_start) = weekly_rats.year_week
AND theatres.zip = weekly_rats.zip
GROUP BY YEARWEEK(performances.week_start), shows.name
ORDER BY show_name, year_week
"""
cursor.execute(show_data_by_week_sql)
# Now write the results to a csv file
with open('show_data_by_week.csv', 'w') as csvfile:
column_names =["year_week", "show_name", "show_type", "theatre", "zip", "performance_week",
"rat_count", "percent_of_capacity", "tickets_sold", "total_revenue", "percent_max_revenue"]
myCsvWriter = csv.DictWriter(csvfile, fieldnames=column_names)
myCsvWriter.writeheader()
for row in cursor:
myCsvWriter.writerow(row)
print("Done writing show_data_by_week.csv")
The final data export for analysis was a list of the latitude and longitude coordinates for all theatres and rat sightings, to be used in creating a map visualization.
This was accomplished by creating two queries: one pulling the names and coordinates of all theatres in the theatres table that are associated with a show:
SELECT
theatres.canonical_name as name,
'theatre' as loc_type,
theatres.latitude,
theatres.longitude
FROM theatres, shows
WHERE theatres.id = shows.theatre_id
GROUP BY theatres.canonical_name
And one pulling the coordinates for all rat sightings in zip codes associated with those same theatres:
SELECT
'rat' as name,
'rat_sighting' as loc_type,
rat_sightings.latitude,
rat_sightings.longitude
FROM rat_sightings
WHERE rat_sightings.zip in (10023, 10036, 10019)
Both results tables included SELECT fields populated with text indicating the name and type for each row of coordinates, and the column names were set to match exactly. These two result tables were then appended together into a single results table using "UNION ALL".
The resulting table was exported as all_locations.csv
# Execute query to pull all coordinate data for theatres and rat sightings in relevant areas
with connection.cursor() as cursor:
all_locations_sql = """
SELECT
theatres.canonical_name as name,
'theatre' as loc_type,
theatres.latitude,
theatres.longitude
FROM theatres, shows
WHERE theatres.id = shows.theatre_id
GROUP BY theatres.canonical_name
UNION ALL
SELECT
'rat' as name,
'rat_sighting' as loc_type,
rat_sightings.latitude,
rat_sightings.longitude
FROM rat_sightings
WHERE rat_sightings.zip in (10023, 10036, 10019)
"""
cursor.execute(all_locations_sql)
# Now write the results to a csv file
with open('all_locations.csv', 'w') as csvfile:
column_names =["name", "loc_type", "latitude", "longitude"]
myCsvWriter = csv.DictWriter(csvfile, fieldnames=column_names)
myCsvWriter.writeheader()
for row in cursor:
myCsvWriter.writerow(row)
print("Done writing all_locations.csv")
So were rats more afraid of Cats or a Woolf? Did they congregate around Big Fish, or Mice and Men? And to which odorous play were rats most attracted?
As it turns out, the play with the largest total rat sightings was none other than The Lion King, with nearly 450 rats reported nearby while it was showing. Perhaps the rats wished to greet the newborn future King of Beasts?
Rats were indeed curious to see the Curious Incident of the Dog in the Night-Time, and many more were attracted to Something Rotten! than deigned to nibble on Superior Donuts. But, as might be expected, rats wouldn't go anywhere near Cats.
Of course, these numbers don't adjust for the length of a show's production. When viewed by average weekly rat appearances, it appears that rats made their best laid plans to visit Of Mice and Men. And while they might not disdain a Fish in the Dark, rats were most interested to investigate A Raisin in the Sun.
If you don't like rats and were looking for places to avoid, you might be interested to know which are the rattiest theatres on Broadway.
Yet, though one might suppose theater-goers would avoid those venues with the most rats in residence, the data indicates that at least a few of the most rat-riddled theatres also brought in the highest total revenue during the time period analyzed.
Perhaps tourists view rats as just one more part of the New York Experience?
And finally, because a picture is worth a thousand words (and also possibly to put you off your dinner), here is a map of all rat sightings in New York's theater district from 2010 to 2016. The green dots indicate Broadway Theatres, and rat sightings are shown as red dots with an overlaid heat map to indicate density.
I experimented with several different analysis tools before settling on Tableau. I looked at a few options for creating plot maps with Python and OpenStreetMaps, and I'm familiar with creating charts in Excel, but Tableau offered the option to create all of my analysis displays using one interface and maintain a consistant look throughout.
Tableau itself is not the most intuitive interface, but it is in wide enough use that any difficulty I ran into was easily resolved by searching for similar problems and their solutions online. In this way, I was able to find how to create plot maps using latitude and longitude coordinates, and how to create dual-axis charts. The rest I was able to accomplish through exploration and experimentation.
The challenge I'm most proud of resolving was the correction of the title case errors in the Broadway show titles dataset. It's the part I struggled with the longest, though it was far from the most critical. I likely could have let it go, or at least have left it at being mostly correct, but the intersection of coder and librarian hit me just so and I was determined to make it right.
To summarize the problem: the show titles in the Broadway Shows dataset came in with some weird formatting. Specifically, it appeared that the creators had attempted to impose some form of Title Case on the names, but didn't make exceptions for articles, letters that come after apostrophes, and the like. So the play titles appeared as:
A Gentleman'S Guide To Love And Murder
Don'T Dress For Dinner
No Man'S Land/Waiting For Godot
Rain: A Tribute To The Beatles On Broadway
Spider-Man Turn Off The Dark
The Curious Incident Of The Dog In The Night-Time
The Motherf**Ker With The Hat
In my quest to correct this, I started out by doing some research and came across the article Titlecasing a string with exceptions on Stack Overflow, which described how to fix just this sort of formatting problem by running a function that capitalizes the first and all subsequent words in a string, but makes exceptions for articles. The code was more complex than I could have written myself, but with a little more research I was able to parse out exactly what the code did, add some articles that were missing from the original code, and then apply it to the show titles from the file. And it worked perfectly! ....except for a few things:
A Gentleman's Guide to Love and Murder
Don't Dress for Dinner
No Man's Land/waiting For Godot
Rain: a Tribute To the Beatles On Broadway
Spider-man Turn Off the Dark
The Curious Incident of the Dog in the Night-time
The Motherf**ker With The Hat
The second half of hyphenated names were now lowercase, as well as the second half of '/' divided titles. The problem was, the code split up the words using space as the delimiter, and then joined them back together using spaces. I could add hyphens or slashes as another delimiter, but I didn't know how to add them back in at the end. And on top of that, articles that came after a colon were now lower-case, as the only exception to not capitalizing them was if they were the first word in the string.
So I returned to the drawing board and, after some more experimentation, I came up with a regular expression that separated out a text string into three parts: all characters up to the indicated character (hyphen, slash, colon+space), the single character after that character, and then all characters after that. In the case of hyphens, it looked like:
hyphenCheck = re.search('(.+)-(.)(.+)$', uppercaseTitle)
After creating this check, I set a condition to only run a show title through this expression if a hyphen was found in the name (hyphenCheck != None)
and if the character after the hyphen was not a space (hyphenCheck.group(2) != ' ')
so that only hyphenated names and not hyphenated titles would be affected. Once this was done, I repeated the same process for forward slashes, and then again for colons (minus the space check).
In the process of fixing these errors, I noticed that show names with roman numerals were also not coming out correct. They were displaying in both versions with only the first letter capitalized, as in the case of Twelfth Night/Richard Iii. To fix this, I added a second exception function to the Titlecasing code for roman numerals. Similar to the exception that kept articles from being capitalized, this function checked for multi-character roman numerals and set them entirely to uppercase.
So finally, after much research, experimentation, and troubleshooting, I was able to successfully print out a list of all properly formatted show names. It was finished!
.....except....
Now it became necessary to figure out how to insert all of this into my already-created code for importing only unique show names into the Shows table. Figuring out how to merge the two proved the most difficult part of the entire process. But finally I hit upon turning the list of corrected show names into a Python dict, with the original name as the key (to use as a lookup), and the corrected name as the value. I hit a couple of snags figuring out how to do this (it turned out I needed to define the empty dict outside of the for loop, not inside), but once I got it working it was simple enough to write a variable into my insert query that pulled the corrected show name from the dictionary based on the original show name from the source data.
And of course, because there's always one last thing, I tried restarting and running the kernel and ended up getting an error on the Performances table import. The query to find the show id from the shows table was based on a match with the show name from the file. Which was now different. So that required some tweaking as well.
Was the resulting beautifully formatted show name list worth the blood, sweat and tears that went into automating the cleanup process? Well in real life, for a static one-time import, it would likely have been more efficient to simply clean up the data manually. But if this were a database with continual updates, from a source that always contained these sorts of errors, having the cleanup process automated in this way could very well be invaluable.