Web queries submitted to the web site Vårdguiden.se (www.vardguiden.se ) were analysed. The web site is written in Swedish, thus the submitted queries are (mostly) in Swedish. Although the site is accessible by anybody, the primary users are residents in the Stockholm County [18] . However, as no information on the users submitting the queries was available, the data were aggregated on a national level. In the described study, we used Vårdguiden logs from June 27, 2005 to June 24, 2007, thus covering two influenza seasons. No spelling corrections were considered (these are provided by the search engine), nor did we remove possible duplicate searches where a user had submitted a query more than once. All queries were case-folded (that is, turned into lower case). The data were aggregated by week, which is the aggregation level for the sentinel and the laboratory reports. There were logs missing for in total five weeks during the summer of 2006, a period normally not affected by influenza in the northern hemisphere.
Two sets of reference data were used: the number of laboratory verified influenza cases and the proportion of patients with influenza-like illness having seen any of the sentinel general practitioners (GPs) in Sweden. The influenza season normally lasts from the end of November to mid April, with a peak sometime in February/March for most seasons. The reporting is done from October (week number 40) to May (week number 20).
In total, twenty types of queries were included in the statistical analysis. For examples of each type, seeTable 1 . More specifically, the analysis was performed on queries containing the word influensa (influenza in Swedish) in various variants and on queries on symptoms for influenza-like illness. The seven investigated symptoms were: fever, headache, myalgia, cough, sore throat, coryza, and shortness of breath. These symptoms were motivated by the following ILI-definition:
This definition is based on the ECDC definition, but adapted to the data source, where “sudden onset” is difficult to identify. Also the ECDC definition is supposed to be used by a doctor, while such a person normally is not involved in the formulation of the web queries.
We further counted influenza matches cleaned from queries on items not related to ordinary influenza, counting only queries not containing the Swedish words vaccin (vaccine), fågel (bird), or maginfluensa (stomach flu). Nineteen per cent of the queries matching influenza were on stomach flu, why we also specifically included this query in the examined set. As for the queries on symptoms, in addition to counting these when being the only submitted word, we counted the number of queries matching the ILI definition given above, allowing for other terms in the query. The two most frequently occurring symptoms of the ILI-symptoms (fever and cough), were also investigated when occurring in any constellation. (The Swedish word for cough (hosta) loses its a when being the first element in a compound. This is accounted for in the program counting the occurrence.) Additionally, we examined the term cold (förkylning). All selected query terms consist of one word in Swedish. This is worth noting as about 75 per cent of the queries contain a single term only (Swedish is rich in compounding, and the Swedish equivalence to, for example, influenza vaccine is influensavaccin).
Since the usage of the search engine on Vårdguiden.se increases over time, data were standardised by dividing the counts during one season by the total number of queries to the web site during that particular season. The calculated numbers for the different types of Vårdguiden.se queries were highly correlated, which poses a problem for regular regression due to colinearity issues. Therefore, partial least squares regression (PLSR), which is an approach designed for these kinds of data, was used to generate our estimating models. This is a method used in many application areas for multivariate, highly correlated data [19] . PLSR works by “relating two data matrices, X and Y, to each other by a linear multivariate model” [20, p71] . This is done by using an algorithm closely related to Principal Component Analysis, to transform a highly dependent set of input data into a set of independent components.
The entire PLSR procedure runs as follows (see alsoFigure 1 ): Given a set of outcome variables (Y) and a set of input variables (X) it creates new variables (“components”) by adding together the input variables in X, with individual weights for each variable. This is done as many times as there are input variables, with a different set of weights each time. The weights are chosen in such a way that a newly created component exhibits as much as possible of the variation in input and output data that has not been included in previous components. Thus, the first component to be created describes most of the variation in the data, the second describes a little less, and so on. The weights are also chosen so that each component is independent from the others, that is, a single component can not be described as the sum of the other components. Subsequently, the created components are used as input variables for a set of ordinary regressions predicting Y, with increasing number of components included. Finally, the models are validated, in order to establish the number of components to include. The model which exhibits the best validation performance is chosen as the final model. In our analysis the PLSR was applied by using the wide kernel algorithm, as implemented in the PLS package [21] in R 2.6.1 [22] .
The sentinel and the laboratory values from both seasons were used as input variables to two different models, one model for each source, where both used all twenty types of queries as predictor variables.
In our experiments, we used cross validation [23] to find the optimal number of components to include in the models. Briefly, cross validating a model is done by first splitting the data set into a number of equally sized partitions. One partition is then omitted, while the remaining partitions are used to estimate a model. This model is subsequently used to predict the omitted data. Thereafter the difference between the true and the predicted value is measured. This process is repeated multiple times with different partitions omitted. Finally, all obtained differences are squared and averaged to generate an estimate of the precision of the cross validated model. We thus get the mean predictive error. When only one observation is omitted, the process is referred to as leave-one-out cross validation.
We conducted a number of different cross validations, where the number of partitions that the data was split into was varied, from two up to 60. In addition, we tested the extreme case, where only one week was omitted at a time. In the presented research, the omission was done sequentially, that is, we first omitted the first n weeks, then the second n weeks and so on. The resulting 60 different mean predictive errors were used to select the optimal model.
Two sets of reference data were used: the number of laboratory verified influenza cases and the proportion of patients with influenza-like illness having seen any of the sentinel general practitioners (GPs) in Sweden. The influenza season normally lasts from the end of November to mid April, with a peak sometime in February/March for most seasons. The reporting is done from October (week number 40) to May (week number 20).
In total, twenty types of queries were included in the statistical analysis. For examples of each type, see
This definition is based on the ECDC definition, but adapted to the data source, where “sudden onset” is difficult to identify. Also the ECDC definition is supposed to be used by a doctor, while such a person normally is not involved in the formulation of the web queries.
We further counted influenza matches cleaned from queries on items not related to ordinary influenza, counting only queries not containing the Swedish words vaccin (vaccine), fågel (bird), or maginfluensa (stomach flu). Nineteen per cent of the queries matching influenza were on stomach flu, why we also specifically included this query in the examined set. As for the queries on symptoms, in addition to counting these when being the only submitted word, we counted the number of queries matching the ILI definition given above, allowing for other terms in the query. The two most frequently occurring symptoms of the ILI-symptoms (fever and cough), were also investigated when occurring in any constellation. (The Swedish word for cough (hosta) loses its a when being the first element in a compound. This is accounted for in the program counting the occurrence.) Additionally, we examined the term cold (förkylning). All selected query terms consist of one word in Swedish. This is worth noting as about 75 per cent of the queries contain a single term only (Swedish is rich in compounding, and the Swedish equivalence to, for example, influenza vaccine is influensavaccin).
Since the usage of the search engine on Vårdguiden.se increases over time, data were standardised by dividing the counts during one season by the total number of queries to the web site during that particular season. The calculated numbers for the different types of Vårdguiden.se queries were highly correlated, which poses a problem for regular regression due to colinearity issues. Therefore, partial least squares regression (PLSR), which is an approach designed for these kinds of data, was used to generate our estimating models. This is a method used in many application areas for multivariate, highly correlated data [19] . PLSR works by “relating two data matrices, X and Y, to each other by a linear multivariate model” [20, p71] . This is done by using an algorithm closely related to Principal Component Analysis, to transform a highly dependent set of input data into a set of independent components.
The entire PLSR procedure runs as follows (see also
The sentinel and the laboratory values from both seasons were used as input variables to two different models, one model for each source, where both used all twenty types of queries as predictor variables.
In our experiments, we used cross validation [23] to find the optimal number of components to include in the models. Briefly, cross validating a model is done by first splitting the data set into a number of equally sized partitions. One partition is then omitted, while the remaining partitions are used to estimate a model. This model is subsequently used to predict the omitted data. Thereafter the difference between the true and the predicted value is measured. This process is repeated multiple times with different partitions omitted. Finally, all obtained differences are squared and averaged to generate an estimate of the precision of the cross validated model. We thus get the mean predictive error. When only one observation is omitted, the process is referred to as leave-one-out cross validation.
We conducted a number of different cross validations, where the number of partitions that the data was split into was varied, from two up to 60. In addition, we tested the extreme case, where only one week was omitted at a time. In the presented research, the omission was done sequentially, that is, we first omitted the first n weeks, then the second n weeks and so on. The resulting 60 different mean predictive errors were used to select the optimal model.
Full text: Click here