Above is a visualization of 4800 movies (each dot is a movie). The x axis represents the year in which the movie was made and the y axis is the movie’s IMDb score (from 1-10). If the dot is black that means that the movie or at least one scene in the movie is in black and white. While, if a dot is yellow, then the movie was made entirely with color film. We can see from this visualization the evolution of movies from black and white to color. We can even see that the black and white films seem to have a higher average rating or at least they rarely get below a 6. Color films, on the other hand, get ratings as low as 1.6. Why do you think that is? Could it be that older films were made less frequently so they tended to be of higher quality and more widely liked? Could it be that black and white movies are generally just better? Or, are there just not enough black and white films represented in this data set to balance out the scores?
Let’s explore the data some more and see if we have any more questions!
Straight to the next visualization!
This data set is a collection of 5043 movies from the Internet Movie Database with each movie/row having 28 feature variables. Each movie has information about the year it was made, the director, actors, earnings, genre, and much more. This data set comes from the data science competition website kaggle user chuansun76, who scrapped all 5043 movies himself from IMDb.com.
To make this data useable for each of the visualizations several null rows had to be removed. This however, was done for each visualization, using Tableau and Excel, to retain as many data points as possible overall. In addition, for the hierarchical visualization, rows referencing Actors needed to be inserted to generate parent-child relationships. This insertion was done in Excel.
Unfortunately, this dataset is not perfect. There were some ways in which the data was scrapped from IMDb that have made for some inaccuracies. For starters, we can see above that some films that we know are not in black and white are still labeled as such (like The Aviator). This is because IMDb records all versions that a film is released in as well as if only a single scene in the film is in black and white. The scrapper unfortunately could not differentiate between these. A film’s country can refer to both where it was released in theaters and where it was made, so there are often multiple countries listed on the movie’s IMDb page. However, the scrapper only collected the first listed country. The biggest issue comes with the collection of the Gross earnings reports. The IMDb does not have a live feed of every film’s gross earnings so the figures they give are from various times after the film’s release, which could skew some films’ earnings. More importantly, the earnings are sometimes posted in foreign currencies or both U.S dollar and foreign currency and it is impossible to tell from the dataset whether the figure given is in dollars, some other currency, or not even updated. Any extreme outliers were removed, but it is impossible to say that all figures are perfect. Inflation thus could also not be accounted for. There were also some ‘repeats’ because two films might have the same name, but for some reason this usually resulted in several missing values for each of those two films.
There’s a lot of data out there in the world! Why visualize IMDb movies? I ended up choosing this data set because simply I love movies and films. This past summer I tried to count how many films I’ve seen in my life time and I stopped counting once I hit 1200 movies! That’s somewhat disturbing if you think about how much of my life that equates to. Say every movie is 90 min (at least), that would be 108,000 minutes, or 1,800 hours, or 75 days of my life. Plus, that’s not including the countless times I have re-watched certain films. So, since I clearly love movies, as do most people, and have spent a chunk of my life watching them I want to discover more about them. What or who makes a movie successful?