According to the 2018 Korean Film Council (KOFIC) closing report, the Korean domestic film industry grew steadily before 2013. In addition, the number of audiences has remained at about the same level for five years after exceeding 210 million reaching the world’s highest level in the number of movie-watching per capita (
Movie’s box office is the primary factor of movie sales and has a huge influence on additional sales. Therefore it is very important to predict the box office of films at the time of film industrial change. However, judging whether a movie is a hit or not is very complicated because the aspect of attraction of audiences changes due to a number of factors even after release. For this reason, many studies have been conducted to predict the success of movies.
In this paper, we consider a stacking model for predicting movie audiences. Stacking is an ensemble method that consists of two steps, Level 0 and Level 1. The first step is called Level 0, and any algorithms for regression or classification can make up Level 0 models. Fine-tuned Level 0 models are used to make predictions for the target variable. The second step, called Level 1, ensembles algorithms of Level 0 by taking predictions of Level 0 models as independent variables to predict the target variable. This results in reducing the biases of predicted values from Level 0. The model which ensembles the Level 0 predictions best is selected as the Level 1 model. We train 9 models for Level 0 and experiment 5 models for Level 1 after collecting and preprocessing independent variables which are known to be important for predicting movie audiences. The final stacking model outperforms in predicting the cumulative number of audiences of the test dataset.
This paper is organized into 5 sections. Section 2 describes the detailed process of collecting and preprocessing explanatory variables with some graphs. Section 3 explains the process of stacking and fitting with the hyperparameter tuning of regression models at Level 0 and its results. Section 4 explains Level 1 models with hyperparameter tuning, variable selections, and provides the predictions of the test dataset by the final stacking model. Section 5 describes the conclusion and some future challenges.
The data considered in this paper daily box office data of the Korea Film Council and movie comment data of Daum Movie (www.movie.daum.net). The films considered were released between July 2017 and June 2019. Box office data includes the film’s general information, including nationality, director, actors, genre, release date, cumulative audiences, and distributors. The movie comment data are collected one week from the release date of each movie. The title of the movie, the date of the comments, the number of comments per day, and ratings of comments are collected by crawling using Python API Selenium and Beautiful Soup (
In addition, we aim at commercial films having cumulative audiences of 150,000 or more. Movies with cumulative audiences of 150,000 or less account for only 3.2% of the total cumulative sales of movies released in two years between September 2017 and June 2019 as shown in
The prediction of the number of movie audiences should be based on independent variables that play an important role in the success of films. The collected independent variables are nationality, grade, the release date, the release day of the week, the average number of audiences by distributors, a cluster of distributors, the average number of audiences by directors, a cluster of directors, the average number of audiences by genres, a cluster of genres, weekly rating by web commenters, the weekly number of comments, and a cumulative number of audiences of the first week after release. Among them, the average number of audiences for distributors, the average number of audiences of last 5 movies for directors, the average number of audiences for genres, the number of audiences during the first week, and the final number of total audiences are taken log transformation in order to stabilize their distributions. Histograms of a cumulative number of audiences before and after log transformation are shown in
When there are two or more values in the genre, distributor and director variables for each film in the box office data of Korea Film Council, the first value is set as the representative value. Then distributor, director and genre variables are reduced to clusters, respectively, through
In the case of multiple values in the nationality variable, as in the case of distributor and director, the first value is set as the representative value based on the film registration information in the Korea Film Council. Korea, which has the highest frequency, is assigned to group 1, the United States to group 2, and the others to group 3.
It is known that there is a big difference in the success of films according to the day of the week and month of the release. Also, most films are known to be released mainly on Tuesday, Wednesday and Thursday in order to maximize the audience attraction of the first weekend (
The variable about the month of release is set by the fact that the average number of audiences by month varies due to the characteristics of the movie market. Assuming that weather would affect the sales of movies, variable of released month was added based on the average of total audiences per month. In
In terms of the movie rating, the 15+ rated movies are assigned to be in group 1, 12+ rated movies in group 2, the all rated movies in group 3, and the 19+ rated movies in group 4 by their frequencies as in
It is known that the success of a movie is highly correlated to the success of the movie during the first week and as well as by the word of mouth (
Stacking (also called stacked generalization, will be used interchangeably) was first proposed in 1992 by David H. Wolpert on the journal
It is known that the fine hyperparameter tuning of the Level 1 model results in higher performance than just selecting the single best model. However, the best ensemble of the predictions from the Level 0 models is somewhat unclear because of the fact that the stacking model structure is so complicated. Therefore, it is important to carefully and accurately adjust the hyperparameter tuning of each level.
Additionally, one of the most important things to note about stacking is to prevent overfitting (
Assume that
where
where
While keeping in mind that model index in Level 0 is 1 to
where
When training Level 1 models, we need to find an algorithm that can ensemble this
and the RMSE in first fold is calculated by
where
We train several regression models for Level 0 and Level 1, and select one as the final meta learner among Level 1 models which has the best subset of variables
Now, we will briefly describe the nine models for Level 0 and the five models for Level 1, and show the process of hyperparameter tuning and variable selection. The nine models used in the Level 0 are from the field of four most common themes used in the regression problem: Regression models with penalty term (Ridge, Lasso, and Elastic Net), distance based algorithms (KNN regression and Support Vector Regression (SVR)), tree based ensemble models (Extra Tree and Random Forest), and boosting methods (AdaBoost.R2 and Gradient Boosting). In Level 1, Ridge, Random Forest, AdaBoost.R2, Gradient Boosting and Artificial Neural Network (ANN) are used. Then the final stacking model is obtained after all the hyperparameter tuning and variable selection by the criteria of 5-fold cross validation RMSE.
A ridge regression estimator is biased unlike the traditional Ordinary Least Squares (OLS) method. The vector of regression coefficients
where
Lasso regression is similar but somewhat different. Lasso regression is an abbreviation for least absolute shrinkage and selection operator, and the coefficient of the variable is obtained by adding the L1 norm penalty, unlike the ridge regression, as follows:
There is also the Elastic Net which combines the ridge and the lasso regression methods.
where
KNN is one of a kind of supervised learning and is very intuitive and an easy algorithm for regression and classification. KNN calculates the predicted value according to the similarity of
SVM, like KNN, is a type of supervised learning that can be used for regression and classification (
where
Applying the SVM with this feature to a regression problem, we consider non-negative slack variables
subject to
Both the Extra tree and the random forest are ensemble models derived from the tree method. Random Forest was first proposed by
The Extra Tree is an extremely randomized tree that ensemble unpruned regression or decision trees. It is similar to Random Forest that it extracts
In this study, hyperparameter tuning is performed through randomized search with 5-fold cross validation using a combination of number of estimators, which is the number of trees to ensemble, and max of depth, the maximum size of tree. Python libraries sklearn.ensemble.ExtraTreesRegressor and sklearn.ensemble.RandomForestRegressor were used for Extra Tree and Random Forest, respectively.
AdaBoost.R2 (
where
at linear, square law, exponential, respectively,
Through
where
Gradient Boosting regression is a forward stage-wise fashion model, similar to AdaBoost.R2, and has the learning process to reduce the expected value of a given loss function. When the loss function is at the
where
The hyperparameters to tune at this time are the number of iterations, the size and learning rate of the regression tree, which is the base learner. The hyperparameter tuning is important for the prediction of unseen data because these three choices must be properly selected to avoid overfitting the train set. In this study, Python is used, and the default weak learner is set to CART, which is the default for both the AdaBoost.R2 and Gradient Boosting methods. Python libraries sklearn.ensemble.AdaBoostRegressor and sklearn.ensemble.GradientBoostingRegressor are used for AdaBoost.R2 and gradient boosting, respectively.
ANN is a model conceived by human neurons and consists of input layer, hidden layer and output layer. Each layer is connected by weight, and receives the information of the previous layer and passes the output which went through the activation function to the next layer. In the learning process, the weights connecting the nodes of the layers are updated through iteration using backpropagation in a way to reduce the predefined loss function so that the output value approaches the target value. Artificail Neural Networks are widely used in various fields because it has a high explanatory power for nonlinear relationships between independent and dependent variables.
In this study, the number of hidden layers is fixed to 1, and the hyperparameter tuning of activation function, number of nodes, and optimizer for all 2^{9} − 1 variable combinations is performed to achieve the best subset of variables and its the best combinations of hyperparameters. To make the ANN model simple, we fixed only one hidden layer, remaining more complex architecture of ANN for further study. Library named Keras.Sequential was used for building ANN in this study. Likewise, other 4 Level 1 models were trained in process of variable selection among all 2^{9} − 1 variable subsets each with its hyperparameter tuning simultaneously.
The purpose of this study is to predict the cumulative audience of films at September first in 2019 which was released between July 2017 and June 2019. When predicting the audience of each movie, cumulative audiences of movies were fixed at August first in 2019 because even if the movies were finished screening in big cities like Seoul, small theater in countryside still screens the movies and that audiences are also counted in KOFIC data. The result of Level 0 models is summarized in
For comparing all nine Level 0 models, the Gradient Boosting method performed the best. Also, SVR and Random Forest, which are known for good results, showed proper performance at this data. The lowest performing algorithm was Lasso regression, showing 0.16194 train RMSE. Comparing each theme, Ridge regression and Lasso regression resulted in train RMSE of about 0.15465 and 0.16194, respectively, and Elastic Net which combines Ridge and Lasso seems to perform poorer than Ridge and Lasso regression. Comparing KNN regression and SVR, which are algorithms based on the distance between data points, we can see that KNN which is a kind of lazy learner, yields a bigger train RMSE than SVR. And also, Extra Tree, a ensemble methods using complete learning sample and random splitting with CART, showed better performance than Random Forest which is using bootstrap resampling and perfect splitting. Also, for estimating variance importance, the rate variable of each movie was the most important variable recording 91% of total importance. Gradient Boosting showed superior performance among boosting-using algorithm for regression comparing to AdaBoost.R2 in this data set.
In
As a result, the final selected stacking model and test RMSE which is calculated by the test set that are never used in training of Level 0 and Level 1, are shown in
In this paper, collecting and preprocessing of 13 independent variables are described to predict the final cumulative audience of the film. To prevent overfitting, the model was constructed with process of estimating generalized error based on the RMSE calculated through 5-fold cross validation, starting with separating the train data set for learning Level 0 and Level 1. By the stacking models trained with two steps of Level 0 and Level 1, a wide range of hyperparameter tuning leads to derive the improved performance than just a single model. As a conclusion, the proposed stacking model has shown that predictions of movie audiences are very close to the actual final cumulative audience.
It is meaningful for measuring and predicting the cumulative audience to help understand Korean domestic movie market. Also this study is meaningful that it uses stacking which is not frequently used in prediction of movie audiences before. However, just 9 Level 0 models and 5 Level 1 models with movies within 2 years were used in this paper. If more samples were collected and more models were trained at each level, it could lead to improve the performance of stacking. With the result of this study, we are expecting that more studies would be followed to improve the performance and bring further growth of understanding movie industry.
Results of Level 0 models
Level 0 model | Hyperparameter region | Best hyperparameter | Level 0 train RMSE |
---|---|---|---|
Ridge regression | alpha : (0.001~1), 1,000 | alpha = 0.105 | 0.15465 |
Lasso regression | alpha : (0.001~1), 1,000 | alpha = 0.02 | 0.16194 |
Elastic net | alpha : (0.001~1), 1,000 l1_ratio : (0.001~1), 1,000 |
alpha = 0.18 l1_ratio = 0.728 |
0.22114 |
KNN regression | 0.27092 | ||
SVR | kernel : (rbf, sigmoid, poly), 3 gamma : (1e-10~1e-1), 10 |
kernel = sigmoid gamma = 0.01 |
0.09567 |
Extra tree | n_estimators : (10~1,000), 991 max_depth : (1~5), 5 |
n_estimators = 153 max_depth = 5 |
0.09955 |
Random forest | n_estimators : (10~1,000), 991 max_depth : (1 5), 5 min_samples_split : (1~5), 5 |
n_estimators = 72 max_depth = 5 min_samples_split = 2 |
0.10078 |
AdaBoost.R2 | learning_rate : (0.01~1), 100 n_estimators : (10~1,000), 991 |
learning_rate = 0.33 n_estimators = 19 |
0.14508 |
Gradient boosting | learning_rate : (0.01~1), 100 n_estimators : (10~1,000), 991 max_depth : (1~10), 10 |
learning_rate = 0.12 n_estimator = 537 max_depth = 5 |
0.0003122 |
Results of level 1 models
Level 0 model | Hyperparameter region | Best hyperparameter | Feature subset | RMSE |
---|---|---|---|---|
Ridge regression | alpha : (0.001~1), 1000 | alpha = 0.352 | Gradient Boosting | 0.009 |
Random forest | n_estimators : (10~1000), 991 max_depth : (1~5), 5 min_samples_split : (1~5), 5 |
n_estimators = 136 max_depth = 5 min_samples_split = 2 |
Elastic net, KNN regression, Random forest, Ridge, SVR |
0.01125 |
AdaBoost.R2 | learning_rate : (0.01~1),100 n_estimators : (10~1000), 991 |
learning_rate = 0.7 n_estimators = 96 |
Lasso, Gradient Boosting | 0.04018 |
Gradient boosting | learning_rate : (0.01~1), 100 n_estimators : (10~1000), 991 max_depth : (1~10), 10 |
learning_rate = 0.98 n_estimators = 30 max_depth = 5 |
Elastic net, Lasso | 0.00022 |
ANN | n_nodes : (1~20), 20 Optimizers : (sgd,rmsprop,adam), 3 Activation : (tanh,sigmoid,relu), 3 |
n_nodes = 20 Optimizer = adam Activation = relu |
Gradient boosting, Extra tree, SVR | 0.05278 |
Final stacking model
Level 0 model | Hyperparameter region | Level 1 model | Best hyperparameter | Feature selection | Test RMSE |
---|---|---|---|---|---|
Ridge regression | alpha = 0.105 | Gradient boosting | learning_rate = 0.98 n_estimators = 30 max_depth = 5 |
Elastic net, Lasso | 0.00074 |
Lasso regression | alpha = 0.02 | ||||
Elastic net | alpha = 0.018 1l_ratio = 0.728 |
||||
KNN regression | |||||
SVR | kernel = sigmoid C = 1100 gamma = 0.01 |
||||
Extra tree | n_estimators = 153 max_epth = 5 |
||||
Random forest | n_estimators = 72 min_samples_split = 2 max depth = 5 |
||||
AdaBoost.R2 | n_estimators = 19 learning_rate = 0.33 |
||||
Gradient boosting | learning_rate = 0.12 n_estimators = 537 max_depth = 5 |
Model comparison
Model step | Model name | Test set RMSE |
---|---|---|
Level 0 | Ridge regression | 0.39864 |
Level 0 | Lasso regression | 0.39367 |
Level 0 | Elastic net | 0.36080 |
Level 0 | KNN regression | 0.34334 |
Level 0 | SVR | 0.51540 |
Level 0 | Extra tree | 0.31316 |
Level 0 | Random forest | 0.35079 |
Level 0 | AdaBoost.R2 | 0.38480 |
Level 0 | Gradient boosting | 0.18156 |
Final prediction of movie audiences in test set
Movie title | Predicted value | Target value | Movie title | Predicted value | Target value |
---|---|---|---|---|---|
Alog with the Gods: The Two worlds | 14410221.9 | 14410721 | Aladdin | 11159042.2 | 11157279 |
1987:When the Day Comes | 7201120.1 | 7201370 | The Battleship Island | 6591941.2 | 6592170 |
Captain Marvel | 5800868.7 | 5801070 | Jurrassic World: Fallen Kingdom | 5661431.9 | 5661231 |
AQUAMAN | 5051016.9 | 5038143 | The Spy Gone North | 4951170.6 | 4951690 |
Kingsman: The Golden Circle | 4941728.2 | 4945486 | The Mummy | 3681479.5 | 3689290 |
Money | 3325048.8 | 3325311 | MEMOIR OF A MURDERER | 2223232.3 | 2223094 |
Innocent Witness | 2533246.1 | 2533334 | Detective K: Secret of the Living Dead | 2434349.2 | 2437146 |
Fantastic Beasts: The Crimes of Grindelwald | 2413978.2 | 2414062 | Toy Story 4 | 3328215.3 | 3328229 |
FENGSHUI | 2079253.8 | 2079326 | HitandRun Squad | 1828211.5 | 1826256 |
Justice League | 1788108.7 | 1786386 | Ralph Breakds the Internet | 1760587.2 | 1758891 |
Door Lock | 1557897.5 | 1559945 | Little Forest | 1488607.7 | 1485408 |
Foregotten | 1387017.2 | 1387011 | Golden Slumber | 1381317.2 | 1382358 |
Happy Death Day | 1380505.5 | 1382650 | Ocean’s 8 | 1331864.0 | 1331858 |
The Vanished | 1315740.9 | 1315735 | The Mimic | 1306443.8 | 1306438 |
Birthday | 1184056.6 | 1183530 | What a Man Wants | 1194134.8 | 1194229 |
Hello Carbot the Movie: The Cretaceous Period | 864337.8 | 864757 | Dark Phoenix | 851338.6 | 850960 |
Men in Blakc: International | 814188.7 | 813794 | LOVE+SLING | 766110.0 | 766104 |
47 Meters Down | 575115.0 | 575115 | Insidious: The Last Key | 551426.8 | 551953 |
The Grinch | 546562.0 | 546553 | Tomb Raider | 534041.7 | 534211 |
Mary and the Witch’s Flower | 536270.5 | 535448 | Seven Years Night | 525303.4 | 526078 |
The Shape of Water | 493973.7 | 494097 | Car 3 | 466728.8 | 466047 |
Detector Conan: Crimson Love Leter | 448508.1 | 448915 | THE SOUL-MATE | 446831.8 | 446717 |
Fall in Love at First Kiss | 369075.4 | 369230 | Theater Version Dionsaur Mecard: The Island of Tinysaurs | 381776.1 | 381973 |
Loving Vincent | 401736.5 | 401550 | Peter Rabbit | 379874.8 | 379748 |
MAN OF WILL | 357620.4 | 357874 | Godzilla: King of the Monsters | 353017.8 | 353012 |
Crayon Shin-chan: Burst Serving! Kung Fu Boys | 347852.5 | 347823 | Dumbo | 340152.3 | 340411 |
Paddington 2 | 339094.6 | 339014 | Truth or Dare | 310954.8 | 310962 |
Cinderella and Secret Prince | 284648.0 | 284763 | Paul, Apostle of Christ | 160308.0 | 160277 |
Wonder | 179840.9 | 179806 | Criminal Conspiracy | 260566.3 | 260512 |
12 Strong | 223765.5 | 223740 | Fifty Shades Freed | 217744.1 | 217785 |
The Whispering | 217534.2 | 217455 | The House with a Clock in its Walls | 214558.8 | 214452 |
The Hurricane Heist | 214966.3 | 214949 | American Assassin | 205430.2 | 205544 |
The Curse of La Llorona | 202683.6 | 202756 | Namiya | 181920.3 | 181885 |
Upgrade | 190429.0 | 190387 |