From historic blockbusters to inflation-adjusted earnings, Anton Goncharuk reveals the money-side of Hollywood.
Hey there! I'm Anton Goncharuk, Principal Analytics Engineer at Hubspot. I’ve been using dbt™ since early 2019, so when I discovered Paradime’s Movie Data Modeling Challenge, I couldn't resist the chance to participate!
If you aren't already familiar with Paradime’s Movie Data Modeling Challenge, check out this blog for a brief overview!
In this blog, I'll share insights about my challenge submission, including the journey of building my project, the movie insights I uncovered, and how I used Paradime to bring my project to life. Enjoy!
Every data professional knows that uncovering insights is just the tip of the iceberg. Before reaching that point, countless hours are spent brainstorming, building a project plan, overcoming data issues, and hitting dead ends. Here’s a quick summary of my project-building process:
Overall, this project was a blend of tackling technical challenges, learning new tools, and reinforcing the importance of clean, reliable data in analytics.
Below are some of the key insights I uncovered during the challenge, but can view my additional data insights and visualizations in my GitHub repo.
Gone with the Wind (1939) earned $402 million in box office revenue back then. Adjusted for inflation, that’s equivalent to $8.7 billion in 2024. This insight helps contextualize historical data within modern economic conditions, providing a clearer picture of a movie's true financial success.
Approach: I started by creating int_inflation_adjustments__yearly.sql to compute CPI ratios, allowing me to adjust financial figures for inflation. Next, I built int_tmdb_media.sql to consolidate and enrich TMDB movies. After that, I merged this enriched dataset with OMDB data in media.sql, prioritizing TMDB data for accuracy.
Mel Blanc, known as "The Man of a Thousand Voices," is one of Hollywood's most prolific actors, with over a thousand screen credits. He created and performed nearly 400 distinct character voices, becoming renowned worldwide for his work in radio, television, cartoons, and movies.
Approach: I first processed IMDb principals data in stg_imdb__principals.sql to extract actor roles and characters. Then, I enriched this data with actor details using stg_imdb__names.sql, obtaining full names and notable titles. Finally, in crew.sql, I merged these datasets, ensuring unique actor-role combinations to accurately count appearances and determine the top actors.
Steven Spielberg is a legendary figure in cinema, directing iconic films like "Jaws," "E.T. the Extra-Terrestrial," and "Jurassic Park." His films have grossed immensely over the years, making him one of the highest-grossing directors of all time.
Approach: I used the same approach as “Top 10 Most Appearing Actors of All Time”, but afterward, I joined the table crew.sql with media.sql to identify the “Top 10 Highest-Grossing Directors of All Time.”
Although this doesn't directly relate to Profit or Return On Investment (ROI), Warner Bros. Pictures appears to be a leader in the movie production industry based on gross revenue (box office), the total number of movies produced, and the number of Oscars their movies have received.
Approach: I developed int_tmdb_media.sql to consolidate movie data from TMDB. Next, I used media.sql to adjust financial figures for inflation, and produce a comprehensive, unified dataset of movie details.
Paradime was obviously instrumental in my project, offering several features that enhanced my workflow and overall project quality. Here are the three key features that stood out:
The Movie Data Modeling Challenge was super fun, balancing data infrastructure and BI presentation. Paradime's features, especially Lineage and CLI, were game-changers. I tackled inconsistent data and learned the importance of data governance.
Thanks to Paradime, Lightdash, and the community for this awesome experience. Excited for future challenges, I’ll definitely join again… and so should you!