Tymoteusz Doligalski, Emilia Tomczyk, Nowcasting New Car Registrations with Google Search Data and Car Manufacturers’ Website Traffic, paper accepted at the 6th EMAC Regional Conference, Vienna 2015.
- Download the paper as pdf from SSRN: Nowcasting New Car Registrations with Google Search Data and Car Manufacturers’ Website Traffic
- See also: Predicting New Car Registrations: Nowcasting with Google Search and Macroeconomic Data
- See also my other articles on e-Business Models and Strategies.
Abstract: The purpose of this paper is an attempt to nowcast (here: to predict in a short time horizon) new car registrations in Poland based on data of Google search queries and website traffic of car manufacturers. The study covers 47 monthly observations for six automotive makes. The strongest explanatory power is exhibited by the autoregressive component (number of registrations lagged one month), followed by the number of search queries. The website traffic of car manufacturers significantly influences the number of registrations in two out of six cases.
Keywords: nowcasting, prediction, car, automotive, Internet, search, Google, website traffic, Poland, CEE
Nowcasting is defined as “the prediction of the present, the very near future and the very recent past” (Bańbura, Giannone, Modugno, and Reichlin, 2013, p. 4). The reason for nowcasting is the delay in availability of data, which in modern economy can amount to weeks or even months. The other circumstances increasing the application value of nowcasting are macroeconomic turbulences, great uncertainty and unique shocks, as they cause past values to lose their prediction power (Schmidt & Vosen, 2009). As Bańbura, et at. state nowcasting is based on ‘exploitation of information which is published early and possibly at higher frequencies than the target variable of interest in order to obtain an ‘early estimate’ before the official figure becomes available’ (2013, p. 4).
According to Choi and Varian (2011), nowadays there are several sources of data on real time economic activities which may help in predicting the present (as opposed to predicting the future). Possibly the most often used is data from Google Trends presenting number and location of chosen search queries. The data may be used to identify the current changes in unemployment, private consumption or – beyond the economic activities – development of infectious diseases (Choi & Varian, 2011). The other sources of information are parcel shipment companies or credit card operators, as they possess precise real-time data on transactions in certain locations.
There exists another source of data which can be used in nowcasting. This is the data on website traffic. Usually these data are fragmentarized. The website owner possesses thorough knowledge on his or her traffic, but does not know the traffic of other websites. There exist however research entities that provide data on traffic of various websites in a certain category. Megapanel PBI/Gemius is such a research program. It monitors online behaviour of Polish internet users, thus providing monthly data on traffic on most popular websites in Poland. This kind of data meets the above mentioned requirements of nowcasting.
The purpose of this paper is an attempt to nowcast new car registrations in Poland based on the data of Google search queries and website traffic of car manufacturers. Usefulness of web search data in predicting behaviour of economic variables has already been noted in literature (Askitas & Zimmermann, 2009; Choi & Varian, 2011; Li, Peng, Hang, Jiaxing, 2013). A few publications on nowcasting concern the automobile markets (Choi & Varian, 2011; Sun, Li, Li, Zhang, 2013; Carrière-Swallow & Labbé, 2013). Their common approach is the use of search data as the independent variable. The data on car manufacturers’ website traffic – to our best knowledge – has not served as predictor of car sales or registrations yet.
Description of data
Our sample covers 47 monthly observations from January 2011 to November 2014. Monthly data on new registrations of passenger cars are provided by the Polish Association of Automotive Industry (PAAI) on the basis of the Central Register of Vehicles database administered by the Ministry of the Interior. A one-month lag is allowed by Polish law between the purchase of a private car and its registration. Current data on number of registrations becomes available on the PAAI webpage around the 5th day of the next month. First registrations of makes of passenger cars included in the empirical analysis (that is, Fiat, Opel, Peugeot, Renault, Skoda, and Toyota) are coded with variables starting with the letter R; for example, R_fiat stands for the number of Fiat cars first registered in a given month. Seasonal effects are expected: so-called summer inertia, that is, lower numbers of first registrations in the summer months (June, July and August), and higher end-of-year sales in the winter months (November, December and January).
Relative numbers of queries pertaining to car makes are coded by variables starting with the letter S and defined in a way proposed by Choi and Varian: “The query index is based on query share: the total query volume for the search term in question within a particular geographic region divided by the total number of queries in that region during the time period being examined. The maximum query share in the time period specified is normalized to be 100 and the query share at the initial date being examined is normalized to be zero.” (Choi & Varian, 2011, p. 3). Original search data is defined relative to Opel sales in the week of March 31 – April 6, 2013 (the maximum query share in the period analysed). For aggregated monthly series, data is rescaled relative to the maximum monthly query in the period analysed, that is, number of Opel queries in October 2014, equal to 396.43. For example, S_fiat stands for the percentage of queries on Fiat cars in a given month relative to the maximum level of 396.43.
Traffic variables, coded with variables starting with the letter T, reflect the number of unique visitors of car manufacturers’ websites in a given month. This type of data has not been previously used to explain first registrations or other sales numbers. The source of traffic time series is Megapanel PBI/Gemius which monitors the online behaviour based on a panel of Polish internet users. The data on website traffic is subject to estimates and cannot be interpreted in a straightforward way but it allows to compare various websites visited by Polish consumers. Data provider records missing values when number of visits is lower that a minimum defined level (in case of car manufacturers, 40,000 hits); for the purpose of this paper, two approaches to missing data were undertaken:
- imputation of the “lower bound” value of 40,000 visits,
- calculation of an average of two months preceding and two month following a missing value; in special case of Renault, missing value for November 2014 is calculated as an average of four preceding months.
Results of empirical analysis (see next section ) show that treatment of missing values does not influence outcomes in a significant way.
We limit our dataset to internet data (that is, search and traffic data) and lagged values of the dependent variable (car registrations), omitting car market data and macroeconomic variables. The rationale for this approach is that internet data are available almost in real time, and car registrations data become accessible speedily, on the 5th of the next month. On the other hand, macroeconomic data are published with at least two-month lags which limits their application for nowcasting.
Linear models with HAC standard errors (to account for serial correlation in the error term) have been estimated for six variables describing number of first registrations: Fiat, Opel, Peugeot, Renault, Skoda, and Toyota. Results are summarised in Table 1. The strongest explanatory power is exhibited by the autoregressive component (that is, number of registrations lagged one month); it is the only regressor that is statistically significant at the 0.05 significance level in all six models, and the size of the estimated coefficients vary from 0.314 for Toyota to 0.642 for Fiat.
As far as search data is concerned, in four of the models (for Fiat, Opel, Peugeot and Skoda) search variables lagged either one (in case of Peugeot) or two months exhibit positive and statistically significant influence on the number or registrations. This result is consistent with the one-month delay between car sale and its registration allowed by Polish law, taking into account that additional delay may be expected between internet search and actual signing of the contract.
Table 1. Summary of estimation results
All variables are statistically significant at 0.05 level.
Traffic variables do not exhibit systematic and statistically significant influence on the number of car registrations for any lag considered in the models. In two cases (that is, Peugeot and Renault) traffic variable lagged one month is statistically significant; however, these two models exhibit the lowest coefficients of determination and their descriptive value is therefore limited.
The models provide only limited support for the hypothesis of seasonal behaviour of car registrations. Summer dummy variable estimated coefficient exhibits its expected negative sign (for summer consumer inertia) and is statistically different from zero in two cases only, for Skoda and Toyota. Winter dummy variable is only statistically significant (but with negative coefficient which contradicts the expectations of high end-of-year sales) in the Peugeot model.
As far as general statistical quality of the estimated models is concerned, they are correctly specified according to the RESET test. All but two (for Fiat and Peugeot) have normally distributed standard errors, and since sample size can be considered sufficient, lack of normality does not negatively influence estimation results. There is no multicollinearity in any of the models. Treatment of missing values in traffic data does not influence the general conclusion that traffic numbers do not have statistically significant impact on the dependent variable.
There are two major factors influencing number of first registrations: autoregressive component and search data. The result is coherent with Choi and Varian’s conclusion, as “simple seasonal AR models that include relevant Google Trends variables tend to outperform models that exclude these predictors by 5% to 20%.” (Choi & Varian, 2011, p. 8).
What remains to be explained are the differences in lag lengths between internet search and car registration. The lag may be non-existent (for Renault, where search variables do not exhibit significant impact at all, and Toyota, where only current value does), equal to one month (for Peugeot) or to two months (for the three remaining car makes). Certain delay between search and registration is expected but should be similar for all car makes; to explain the differences, factors such as length of order fulfilment and sales policies of car manufactures should be taken into consideration.
Econometric analysis of car registrations data suggests that there subsets of car makes may be distinguished: Fiat and Opel (of which registration numbers seem to follow similar patterns based on search data lagged two months); Skoda and Toyota (which exhibit significantly lower registrations in the summer months); and Peugeot and Renault, where influence of lagged traffic data may be observed. Otherwise, car manufacturer website traffic appears to have limited predictive value. Interestingly, in these two cases the search factor is either not-existent (Renault) or negatively correlated with number of registrations (Peugeot). For a Polish consumer these two brands are more difficult to spell than other included in the research. As we included only properly spelled brands, the search data might not have fully reflected the number of queries. The result may suggest that website traffic may be considered as a predictor when search data is unavailable or clear attribution of search queries with the nowcasted activity is problematic.
This study is burdened with the following limitations. Its purpose was to nowcast new passenger car registrations, performed both by consumers and businesses. Often Google search data is used for prediction of private consumer activities. Due to limited availability of detailed data we attempted to nowcast all passenger car registrations. It might have resulted in lower prediction quality, but it is of greater practical use as it reflects the entire sale volume.
Macroeconomic variables (e.g. consumer satisfaction index, general business conditions index) and automobile market data were not included in the research. They are published with a delay, thus do not meet nowcasting requirements. However, lagged data of this kind could be included into more sophisticated models.
The research project would not have been possible without the support of Polskie Badania Internetu Sp. z o.o (PBI). The company provided us with the data on website traffic of car manufacturers in the period from January 2011 to November 2014. The research program Megapanel PBI/Gemius presents the behaviour of Polish Internet users based on a study Net Track Millward Brown SMG/KRC, conducted on a sample selected and weighted by PBC.
Askitas N., Zimmermann K.F. (2009). Google Econometrics and Unemployment Forecasting. Applied Economics Quarterly, 55 (2), 107-120.
Bańbura M., Giannone D., Modugno M., Reichlin L. (2013). Now-Casting and the Real-Time Data Flow, Working Papers, European Central Bank, no. 1564.
Carrière-Swallow, Y., Labbé, F. (2013). Nowcasting with Google Trends in an Emerging Market. Journal of Forecasting, 32 (4), 289–298.
Choi H., Varian H. (2011). Predicting the Present with Google Trends. Economic Record, 88: 2–9. doi: 10.1111/j.1475-4932.2012.00809.x
Li N., Peng G., Chen H., Bao. J. (2013). A Prediction Study on E-commerce Orders Based on Site Search Data. 6th International Conference on Information Management, Innovation Management and Industrial Engineering, 2, 314-318.
Schmidt T., Vosen S. (2009). Forecasting Private Consumption: Survey-based Indicators vs. Google Trends. Ruhr Economic Papers, 155.
Sun B.; Li B.; Li G; Zhang K. (2013). Automobile Demand Forecasting: An Integrated Model of PLS Regression and ANFIS. Advances in Information Sciences & Service Sciences, 5(8), 429-436.
 For legal reasons, the category of personal cars includes also cars with cargo compartment (Polish: samochody z kratką) in the period January-July 2014.
 Polish Association of Automotive Industry, http://www.pzpm.org.pl/en, [2015.04.02].