About the Data

Original Data Set

This dataset contains information on 5000+ Movies on IMDB. The data was scraped from the IMDB website by user chuansun76. You can find this dataset on Kaggle like I did. The original dataset contained 28 different columns, but I only chose to focus on a few, such as gross, movie titles, duration, budget, genre, facebook likes for a movie, content ratings, and of course, the imdb rating.

Data Processing

In order to remove unused columns in my dataset, I had to do some data proccessing. This was mostly done using Trifacta Wrangler removing unwanted columns and removing rows with empty gross or budget values, however some processing was done in excel. I normalized all of the gross and budgets to 2017 dollars with inflation rates found here to get a better comparison of these movies. To do this, I had to find the inflation for the certain years and then apply that rate rate to the budget and gross values for movies in that year.

Motivation

I chose this dataset for a variety of reasons. One of the reasons is that I wanted to visualize something I thought was interesting. I enjoy watching movies so to be able to see a comparison about various movies or genres made me interested. I didn't want to choose a dataset that would make this project feel like a project. Another reason I chose this is because it was different than what we visualized in class. Many of the visualizations we made or saw in class were related to death or emergency calls, so I wanted to do something less depressing.

Visualizations

Parallel Coordinates

A Parallel Coordinates graph that maps various attributes of movies in order to see how they might affect imdb scores. Interactivity includes filtering movies by Content Rating, brushing, tooltips to provide more information about a movie, and reorderable axis.

Zoomable Scatterplot

A Scatterplot that shows the Average Gross for a particular genre in the US. It also shows the Average Duration of movies with that genre. Due to the density of the points, panning and zooming has been enabled to get a better look at certain places. Also, tooltips are present to provide more information about the specific genre, as well as filtering by Content Rating to hide movies.

Sortable Bar Chart

A Bar Chart that shows the Average Gross for movie content rating. Can be sorted by two different metrics, such as by the average gross per content rating or the average imdb scores per content rating. Tooltips are used to show both.

Findings

Parallel Coordinates

By drawing lines for all of the movies in my dataset in a Parallel Coordinates visualization, I noticed some interesting trends for movies with certain content ratings. For instance, most movies had a larger gross than budget, but the size of the gross had little correlation to the imdb score. Although the higher grossing movies on this visualization have higher imdb scores, on average the budget affects the score in a minimal way. What was really surprising was how movies of certain content ratings had imdb scores that were all over the place. Before making this visualization, I thought that having a high gross led to a high rating. However after making this visualization it is clear that these attributes are not the only things that affect a movie's rating.

Zoomable Scatterplot

By visualizing the various genres found in my dataset and how much they gross on average, I found some pretty interesting things. The biggest cluster of movie genre's have an average duration between 110 and 200 minutes, and average gross of up to 100 Million. By utilizing the tooltips, you can see that for this cluster, most of these genres's average imdb scores are average. However as you go up in Average gross, most of the genre's average imdb scores increase. Another finding was that different content ratings had different average durations. Some were more spread out, while others were clustered. For instance, PG-13 movies were clustered around the 125-175 minute range, while rated R movies had more of a spread, despite slight clustering at the 125-150 minute range.

Sortable Bar Chart

One thing you can instantly notice when looking at this Bar Chart is that movies that have a content rating G have the highest average gross, and the highest budget. It surprised me at first that rated R movies on average grossed less than movies that were rated G, PG, and PG13. However, it is only natural that those movies would gross higher since more people are able to view those movies. Another interesting thing is that if you sort by IMDB Score, it is clear that the Average gross has minimal affect on the scores. To me, this is the most interesting since most of the people I know judge how good a movie is based on how many people went to see that movie. By looking at the data, we can see that how much a movie in a certain rating makes has almost no correlation to its score.