Two sets of reference data were used: the number of laboratory verified influenza cases and the proportion of patients with influenza-like illness having seen any of the sentinel general practitioners (GPs) in Sweden. The influenza season normally lasts from the end of November to mid April, with a peak sometime in February/March for most seasons. The reporting is done from October (week number 40) to May (week number 20).
In total, twenty types of queries were included in the statistical analysis. For examples of each type, see
This definition is based on the ECDC definition, but adapted to the data source, where “sudden onset” is difficult to identify. Also the ECDC definition is supposed to be used by a doctor, while such a person normally is not involved in the formulation of the web queries.
We further counted influenza matches cleaned from queries on items not related to ordinary influenza, counting only queries not containing the Swedish words vaccin (vaccine), fågel (bird), or maginfluensa (stomach flu). Nineteen per cent of the queries matching influenza were on stomach flu, why we also specifically included this query in the examined set. As for the queries on symptoms, in addition to counting these when being the only submitted word, we counted the number of queries matching the ILI definition given above, allowing for other terms in the query. The two most frequently occurring symptoms of the ILI-symptoms (fever and cough), were also investigated when occurring in any constellation. (The Swedish word for cough (hosta) loses its a when being the first element in a compound. This is accounted for in the program counting the occurrence.) Additionally, we examined the term cold (förkylning). All selected query terms consist of one word in Swedish. This is worth noting as about 75 per cent of the queries contain a single term only (Swedish is rich in compounding, and the Swedish equivalence to, for example, influenza vaccine is influensavaccin).
Since the usage of the search engine on Vårdguiden.se increases over time, data were standardised by dividing the counts during one season by the total number of queries to the web site during that particular season. The calculated numbers for the different types of Vårdguiden.se queries were highly correlated, which poses a problem for regular regression due to colinearity issues. Therefore, partial least squares regression (PLSR), which is an approach designed for these kinds of data, was used to generate our estimating models. This is a method used in many application areas for multivariate, highly correlated data [19] . PLSR works by “relating two data matrices, X and Y, to each other by a linear multivariate model” [20, p71] . This is done by using an algorithm closely related to Principal Component Analysis, to transform a highly dependent set of input data into a set of independent components.
The entire PLSR procedure runs as follows (see also
The sentinel and the laboratory values from both seasons were used as input variables to two different models, one model for each source, where both used all twenty types of queries as predictor variables.
In our experiments, we used cross validation [23] to find the optimal number of components to include in the models. Briefly, cross validating a model is done by first splitting the data set into a number of equally sized partitions. One partition is then omitted, while the remaining partitions are used to estimate a model. This model is subsequently used to predict the omitted data. Thereafter the difference between the true and the predicted value is measured. This process is repeated multiple times with different partitions omitted. Finally, all obtained differences are squared and averaged to generate an estimate of the precision of the cross validated model. We thus get the mean predictive error. When only one observation is omitted, the process is referred to as leave-one-out cross validation.
We conducted a number of different cross validations, where the number of partitions that the data was split into was varied, from two up to 60. In addition, we tested the extreme case, where only one week was omitted at a time. In the presented research, the omission was done sequentially, that is, we first omitted the first n weeks, then the second n weeks and so on. The resulting 60 different mean predictive errors were used to select the optimal model.