What is this place?
This website introduces a method called PRD (proportional reporting difference) analysis for finding and quantifying safety signals in post-approval vaccine pharmacovigilance data. The generated hits are listed on the main page and are specific to certain types of immunisation products.
I have decided that this is necessary because neither any of the regulatory agencies nor any of the medical journals have published such an analysis which can give a very comprehensive overview over the side effect profile of substances.
On January 29th of 2021 the CDC released a document titled 'Vaccine Adverse Event Reporting System (VAERS) Standard Operating Procedures for COVID-19' (for official use only) which announced the CDC's intention:
Alas, the CDC never followed up on their promise or at least they did not release their results. Instead, I created this website to do it for them. To be more precise we are doing a PRD analysis instead. A PRR analysis is very similar to a PRD analysis as the name suggests. Let me give you an idea of the differences between the two concepts:
Both methods generate nearly the exact same signals, depending on the method used for calculating the respective confidence intervals, but they will be ranked differently. PRD supplies information that is more useful for doctors and patients since it indicates how frequently a side effect occurs while PRR is more suitable for aiding regulatory agencies an pharmaceutical companies in detecting novel safety signals.
What constitutes an adverse event?
The terms 'adverse event', 'descriptor', 'medical concept', 'symptom' and 'side effect' will often be used interchangably and correspond to the titles of the charts displayed on the main page. With a few exceptions, they can be found in the VAERS public dataset files ending with 'SYMPTOMS.csv' and correspond to 'preferred terms' as defined by
MedDRA (Medical Dictionary for Regulatory Activities):
Additionally, some fields from the VAERS files ending with 'DATA.csv' have also been added as terms and/or merged with the MedDRA terms from the 'SYMPTOMS.csv' files. These are:
I will be also be using the term 'report' to refer to uniquely id'ed entries (VAERS_ID) processed by the CDC and published as part of the public dataset.
Step 1 - Downloading
The entire public dataset can downloaded by clicking this CDC link. Before you download the dataset, make sure to read the VAERS disclaimer which can be found on the CDC website and in the disclaimer section of perVAERS.
Step 2 - Preprocessing
A lot of VAERS entries specify no patient age, despite the patient age being mentioned in the SYMPTOM_TEXT fields of the files ending with 'DATA.csv'. I therefore run a regular expression search on all DATA files in order to fill in the missing age fields.
The fields BIRTH_DEFECT, DIED, DISABLE, HOSPITAL, L_THREAT, (NOT) RECOVD are OR'ed into the medical descriptors in the SYMPTOMS files.
Gary Hawkins has done some excellent work building a list of known batch codes and matching them to mistypes instances. This helps me assign the correct batch codes to about 70000 US reports. You can read more about his work on his substack.
Step 3 - Subsampling
While the VAERS database lists reports from dozens of countries, each regional subset has been subject to preprocessing according to the respective originating country's rules and regulations.
This results in a high heterogenicity of the side effect distribution between countries. As an excellent example of this issue serves the difference in reporting rate of COVID-19 infections between Austria and Germany for reports related to vaccinations against the disease:
|Country||Total reports (n)||Reports mentioning 'COVID-19' (x)||Percentage (p)|
It seems like the German Paul-Ehrlich-Institut actively filters out reports of vaccination failures, while the Austrian BASG / AGES does no such thing.
Since it is our approach to analyse the distribution of medical concepts across all reports, we require our dataset to introduce as little bias as possible.
It is for this reason that we include only the US American subset of the VAERS data which offers a relatively consistent reporting behaviour. These files' names are beginning with the year of the reports they contain.
Additionally all reports not including age and gender after processing and those that refer to vaccination dates before the introduction of VAERS in 1990 have been filtered out.
Step 4 - Age and gender stratification and adjustment
Each report is assigned to a number of study and control groups, according to the types of vaccines received by the patient.
The data of all groups is then stratified into 30 groups each, defined by an age range [0+, 0-4, 5-11, 12-18, 19-24, 25-34, 35-44, 45-59, 60-79, 80+] and the patient gender [M, F, MF].
In the next step, a weight between 0.0 and 1.0 is assigned to each age-gender combination ([M|F]+[0...119]) in the control group in order for both study and control group to share the same internal age and gender structure.
The weights are determined by dividing the fraction of individuals belonging to a specific age-gender combination in the study cohort by the fraction of individuals with the same combination of age and gender in the control cohort.
The result consists of 30 study and 30 control cohorts for every product type we are interested in. Each cohort contains a number of distinct adverse event reports, each of which is made up by a list of medical descriptors, patient age and gender and in the case of the control cohort an age and gender specific weight, as well as a list of vaccines that had been administered before the report was created.
Step 5 - From control group to placebo control group
There is a total of ~15000 medical concepts that are being mentioned across all reports in our filtered dataset.
The number of vaccine groups we define is around 30 and there are 30 study and control cohorts each in every group.
This makes for a total of roughly 2 x 13mio reporting frequencies that we now determine by counting the occurrences of each descriptor in the respective group (while applying the weights attached to each report in the case of the control cohorts) and dividing the number of occurences by the total number of reports in the cohort.
We can now calculate ratios and differences of a concept's proportional occurence for each study/control pair. For now, we are only interested in the ratios or more specifically we are only interested in the lower bound of the 50% confidence intervals of what we estimate the incidence proportion ratios to be. If this lower bound is larger than 1 for any given descriptor, we will use it's inverse square root to adjust this descriptor's vaccine -, age - and gender - group specific weight. This process will be repeated twice.
To get a point-estimate of incidence proportion ratio rp for every medical concept, we use the following formula:
is our point-estimate of the descriptor's incidence proportion ratio and:
In order to find the lower bound of the 50% confidence interval in which we suspect the incidence proportion ratio to reside, we first calculate the confidence interval for the natural logarithm of our incidence proportion ratio's point-estimate:
where z=0.67449 for the a confidence interval of 50%. The antilog of the lower confidence level is the lower bound of the 50% confidence interval of our incidence proportion ratio.
In other words: We are 75% confident, that the incidence proportion for this combination of...
for patients who received this type of vaccine product differs compared to it's incidence proportion in patients of the same age and gender group who received other types of products by a factor of at least:
We then calculate the weight w with the formula:
All reports belonging to control cohorts have weights applied to all the medical concepts listed inside them, according to the respective terms' weights for the received vaccine product types and the patient age and gender group.
This process is repeated twice to attenuate the effect by which multiple vaccines mutually lower each other's porportional reporting differences for the same descriptor if the respective proportional reporting ratios are increased in the same type of patient category.
Step 6a - Calculating point-estimates of incidence proportion differences
In order to yield proportional differences, we do something very similar to what we did in step 5 when adjusting the control groups, only this time we are interested in the point-estimate of the difference in incidence proportions between study and control, not in the ratios.
In order to calculate the bounds of our confidence interval we use the following formula by J. Haldane:
We will set a significance level of 0.005 in step 7 which means:
What remains for us to do is filling in the values and calculating the result. This yields 3 values for every comparison:
If the lower bound is larger than zero, it will result in a hit for the respective medical concept in this age, gender and vaccine product group.
Step 6b - Calculating point-estimates of incidence proportion ratios
If you have carefully read step 5 and 6a, you know what has to be done. The only difference to our calculations in step 5 is that we use the point-estimate as our result and calculate a suitable confidence interval, similar to step 6a, instead of only utilizing the lower bound as we did in step 5. I will briefly explain how I pick a suitable confidence interval in the next step.
Step 7 - Finding an appropriate significance level for signal detection
If the lower bound of the 99% confidence level for our proportional difference (CL0.005) is greater than zero, a hit is generated.
I picked the confidence interval by making sure some relatively rare signals that had already been established in literature (e.g. Narcolepsy for Influenza A) remained intact while raising the requirement for inclusion in order to exclude more false positive.
Increasing the confidence interval to 99.9% would put our lower confidence level at 0.0005. This would eliminate about 12.6% of signals of which I expect roughly 90% to be valid. In order not to miss out on too many signals, a cutoff at less than 99.5% confidence instead seems adequate which would put our confidence interval at 99% and our significance level at 0.005.
So what this means is: For each resulting hit we are at least 99.5% confident that it is based on an actual increase in incidence proportion difference (p < 0.005).
Using the search function
Upon loading the website, a default filter is active that hides all charts which medical concept belongs to any of the following categories:
The charts of the concepts "No adverse event" and "Immunisation" are also not displayed.
To remove the default filter, enter an empty space into the search box on the navigation bar.
Upon entering any other search term, the default filter will also become inactive and all medical concepts not containing the search term will be hidden.
Should the search term begin with a '*'' or '.', then the category names to which the medical concept belongs will also be searched. Hence, searching for '*cardiac' will provide different results from searching for 'cardiac'.
A leading '_' includes only results that begin with the search term.
A trailing '_' includes only results that end with the search term.
Use both trailing and leading '_' for an exact match (e.g. '_myocarditis_').
Searches are not case-sensitive.