Analyzing Web Search Queries to Monitor Influenza Trends

Web queries submitted to the web site Vårdguiden.se (www.vardguiden.se) were analysed. The web site is written in Swedish, thus the submitted queries are (mostly) in Swedish. Although the site is accessible by anybody, the primary users are residents in the Stockholm County [18] . However, as no information on the users submitting the queries was available, the data were aggregated on a national level. In the described study, we used Vårdguiden logs from June 27, 2005 to June 24, 2007, thus covering two influenza seasons. No spelling corrections were considered (these are provided by the search engine), nor did we remove possible duplicate searches where a user had submitted a query more than once. All queries were case-folded (that is, turned into lower case). The data were aggregated by week, which is the aggregation level for the sentinel and the laboratory reports. There were logs missing for in total five weeks during the summer of 2006, a period normally not affected by influenza in the northern hemisphere.
Two sets of reference data were used: the number of laboratory verified influenza cases and the proportion of patients with influenza-like illness having seen any of the sentinel general practitioners (GPs) in Sweden. The influenza season normally lasts from the end of November to mid April, with a peak sometime in February/March for most seasons. The reporting is done from October (week number 40) to May (week number 20).
In total, twenty types of queries were included in the statistical analysis. For examples of each type, see Table 1. More specifically, the analysis was performed on queries containing the word influensa (influenza in Swedish) in various variants and on queries on symptoms for influenza-like illness. The seven investigated symptoms were: fever, headache, myalgia, cough, sore throat, coryza, and shortness of breath. These symptoms were motivated by the following ILI-definition:
This definition is based on the ECDC definition, but adapted to the data source, where “sudden onset” is difficult to identify. Also the ECDC definition is supposed to be used by a doctor, while such a person normally is not involved in the formulation of the web queries.
We further counted influenza matches cleaned from queries on items not related to ordinary influenza, counting only queries not containing the Swedish words vaccin (vaccine), fågel (bird), or maginfluensa (stomach flu). Nineteen per cent of the queries matching influenza were on stomach flu, why we also specifically included this query in the examined set. As for the queries on symptoms, in addition to counting these when being the only submitted word, we counted the number of queries matching the ILI definition given above, allowing for other terms in the query. The two most frequently occurring symptoms of the ILI-symptoms (fever and cough), were also investigated when occurring in any constellation. (The Swedish word for cough (hosta) loses its a when being the first element in a compound. This is accounted for in the program counting the occurrence.) Additionally, we examined the term cold (förkylning). All selected query terms consist of one word in Swedish. This is worth noting as about 75 per cent of the queries contain a single term only (Swedish is rich in compounding, and the Swedish equivalence to, for example, influenza vaccine is influensavaccin).
Since the usage of the search engine on Vårdguiden.se increases over time, data were standardised by dividing the counts during one season by the total number of queries to the web site during that particular season. The calculated numbers for the different types of Vårdguiden.se queries were highly correlated, which poses a problem for regular regression due to colinearity issues. Therefore, partial least squares regression (PLSR), which is an approach designed for these kinds of data, was used to generate our estimating models. This is a method used in many application areas for multivariate, highly correlated data [19] . PLSR works by “relating two data matrices, X and Y, to each other by a linear multivariate model” [20, p71] . This is done by using an algorithm closely related to Principal Component Analysis, to transform a highly dependent set of input data into a set of independent components.
The entire PLSR procedure runs as follows (see also Figure 1): Given a set of outcome variables (Y) and a set of input variables (X) it creates new variables (“components”) by adding together the input variables in X, with individual weights for each variable. This is done as many times as there are input variables, with a different set of weights each time. The weights are chosen in such a way that a newly created component exhibits as much as possible of the variation in input and output data that has not been included in previous components. Thus, the first component to be created describes most of the variation in the data, the second describes a little less, and so on. The weights are also chosen so that each component is independent from the others, that is, a single component can not be described as the sum of the other components. Subsequently, the created components are used as input variables for a set of ordinary regressions predicting Y, with increasing number of components included. Finally, the models are validated, in order to establish the number of components to include. The model which exhibits the best validation performance is chosen as the final model. In our analysis the PLSR was applied by using the wide kernel algorithm, as implemented in the PLS package [21] in R 2.6.1 [22] .
The sentinel and the laboratory values from both seasons were used as input variables to two different models, one model for each source, where both used all twenty types of queries as predictor variables.
In our experiments, we used cross validation [23] to find the optimal number of components to include in the models. Briefly, cross validating a model is done by first splitting the data set into a number of equally sized partitions. One partition is then omitted, while the remaining partitions are used to estimate a model. This model is subsequently used to predict the omitted data. Thereafter the difference between the true and the predicted value is measured. This process is repeated multiple times with different partitions omitted. Finally, all obtained differences are squared and averaged to generate an estimate of the precision of the cross validated model. We thus get the mean predictive error. When only one observation is omitted, the process is referred to as leave-one-out cross validation.
We conducted a number of different cross validations, where the number of partitions that the data was split into was varied, from two up to 60. In addition, we tested the extreme case, where only one week was omitted at a time. In the presented research, the omission was done sequentially, that is, we first omitted the first n weeks, then the second n weeks and so on. The resulting 60 different mean predictive errors were used to select the optimal model.

Free full text: Click here

Hulth A., Rydevik G, & Linde A. (2009). Web Queries as a Source for Syndromic Surveillance. PLoS ONE, 4(2), e4378.

Publication 2009

Top 5 similar protocols

Protocol cited in 10 other protocols

Variable analysis

independent variables

Twenty types of queries containing the word 'influensa' (influenza in Swedish) in various variants
Queries on symptoms for influenza-like illness (fever, headache, myalgia, cough, sore throat, coryza, and shortness of breath)
Queries containing the Swedish word 'maginfluensa' (stomach flu)

dependent variables

Number of laboratory verified influenza cases
Proportion of patients with influenza-like illness having seen any of the sentinel general practitioners (GPs) in Sweden

control variables

No information on the users submitting the queries was available, so the data were aggregated on a national level
No spelling corrections were considered, and possible duplicate searches were not removed
All queries were case-folded (turned into lower case)
Data were aggregated by week, which is the aggregation level for the sentinel and the laboratory reports
Logs were missing for a total of five weeks during the summer of 2006, a period normally not affected by influenza in the northern hemisphere

Annotations

Based on most similar protocols

Etiam vel ipsum. Morbi facilisis vestibulum nisl. Praesent cursus laoreet felis. Integer adipiscing pretium orci. Nulla facilisi. Quisque posuere bibendum purus. Nulla quam mauris, cursus eget, convallis ac, molestie non, enim. Aliquam congue. Quisque sagittis nonummy sapien. Proin molestie sem vitae urna. Maecenas lorem.

As authors may omit details in methods from publication, our AI will look for missing critical information across the 5 most similar protocols.

About PubCompare

Our mission is to provide scientists with the largest repository of trustworthy protocols and intelligent analytical tools, thereby offering them extensive information to design robust protocols aimed at minimizing the risk of failures.

We believe that the most crucial aspect is to grant scientists access to a wide range of reliable sources and new useful tools that surpass human capabilities.

However, we trust in allowing scientists to determine how to construct their own protocols based on this information, as they are the experts in their field.

Ready to get started?

Revolutionizing how scientists
search and build protocols!