perVAERS Help help section

What is this place?

This website introduces a method called PRD (proportional reporting difference) analysis for finding and quantifying safety signals in post-approval vaccine pharmacovigilance data. The generated hits are listed on the main page and are specific to certain types of immunisation products.

I have decided that this is necessary because neither any of the regulatory agencies nor any of the medical journals have published such an analysis which can give a very comprehensive overview over the side effect profile of substances.

On January 29th of 2021 the CDC released a document titled 'Vaccine Adverse Event Reporting System (VAERS) Standard Operating Procedures for COVID-19' (for official use only) which announced the CDC's intention:

  • "CDC will perform Proportional Reporting Ratio (PRR) analysis [...], excluding laboratory results, to identify AEs that are disproportionately reported relative to other AEs. [...] To determine if results need further clinical review, consider if clinically important, unexpected findings, seriousness, specific syndrome or diagnosis rather than non-specific symptoms"(CDC, 2022-09-26)
  • Alas, the CDC never followed up on their promise or at least they did not release their results. Instead, I created this website to do it for them. To be more precise we are doing a PRD analysis instead. A PRR analysis is very similar to a PRD analysis as the name suggests. Let me give you an idea of the differences between the two concepts:


  • Proportional Reporting Ratio
  • Ranges from 0 to Infinity
  • Represents the ratio at which the proportion in question is increased or decreased compared to it's proportion in a reference group of reports for other medications
  • The PRR contains no information about the size of the 2 proportions in the two groups that are being compared. Proportions of 10/100 and 1/100 respectively will result in a ratio of 10, the same way that proportions of 10/1000 and 1/1000 will result in a ratio of 10.
  • PRD:

  • Proportional Reporting Difference
  • Ranges from -1 to 1
  • If larger than 0 it represents the excess proportion compared to the proportion of occurence in a reference group of reports for other medications
  • The PRD retains quantitative information which hints at how commonly a side effect occurs. Proportions of 10/100 and 1/100 will result in a difference of 9%, while proportions of 10/1000 and 1/1000 will result in a difference of 0.9%, which suggests that the concept of scenario 1 occurs more frequently than the concept in scenario 2 (and/or it's occurence is more likely to be reported than the occurence of scenario 2).
  • Both methods generate nearly the exact same signals, depending on the method used for calculating the respective confidence intervals, but they will be ranked differently. PRD supplies information that is more useful for doctors and patients since it indicates how frequently a side effect occurs while PRR is more suitable for aiding regulatory agencies an pharmaceutical companies in detecting novel safety signals.

    What constitutes an adverse event?


  • Adverse reactions, also known as side effects, are considered to be caused by a vaccine. Usually, vaccine side effects are identified during clinical trials. The intensity of these reactions may range from mild to moderate to severe. They often resolve on their own, and may or may not require medical intervention. (CDC, 2022-09-26)
  • The terms 'adverse event', 'descriptor', 'medical concept', 'symptom' and 'side effect' will often be used interchangably and correspond to the titles of the charts displayed on the main page. With a few exceptions, they can be found in the VAERS public dataset files ending with 'SYMPTOMS.csv' and correspond to 'preferred terms' as defined by

    MedDRA (Medical Dictionary for Regulatory Activities):

  • Each member of the next level, “Preferred Terms” (PTs), is a distinct descriptor (single medical concept) for a symptom, sign, disease diagnosis, therapeutic indication, investigation, surgical or medical procedure, and medical social or family history characteristic. (MedDRA, 2022-09-26)
  • Additionally, some fields from the VAERS files ending with 'DATA.csv' have also been added as terms and/or merged with the MedDRA terms from the 'SYMPTOMS.csv' files. These are:

  • Birth defect (BIRTH_DEFECT)
  • Death (DIED)
  • Disability (DISABLE)
  • Hospitalisation (HOSPITAL)
  • Life threatening event (L_THREAT)
  • No recovery (RECOVD, inverted)
  • I will be also be using the term 'report' to refer to uniquely id'ed entries (VAERS_ID) processed by the CDC and published as part of the public dataset.

    Step 1 - Downloading

    The entire public dataset can downloaded by clicking this CDC link. Before you download the dataset, make sure to read the VAERS disclaimer which can be found on the CDC website and in the disclaimer section of perVAERS.

    Step 2 - Preprocessing

    A lot of VAERS entries specify no patient age, despite the patient age being mentioned in the SYMPTOM_TEXT fields of the files ending with 'DATA.csv'. I therefore run a regular expression search on all DATA files in order to fill in the missing age fields.

    The fields BIRTH_DEFECT, DIED, DISABLE, HOSPITAL, L_THREAT, (NOT) RECOVD are OR'ed into the medical descriptors in the SYMPTOMS files.

    Gary Hawkins has done some excellent work building a list of known batch codes and matching them to mistypes instances. This helps me assign the correct batch codes to about 70000 US reports. You can read more about his work on his substack.

    Step 3 - Subsampling

    While the VAERS database lists reports from dozens of countries, each regional subset has been subject to preprocessing according to the respective originating country's rules and regulations.

    This results in a high heterogenicity of the side effect distribution between countries. As an excellent example of this issue serves the difference in reporting rate of COVID-19 infections between Austria and Germany for reports related to vaccinations against the disease:

    Country Total reports (n) Reports mentioning 'COVID-19' (x) Percentage (p)
    AT 102,915 92,706 90.1%
    DE 63,794 2,563 4.0%

    It seems like the German Paul-Ehrlich-Institut actively filters out reports of vaccination failures, while the Austrian BASG / AGES does no such thing.

    Since it is our approach to analyse the distribution of medical concepts across all reports, we require our dataset to introduce as little bias as possible.

    It is for this reason that we include only the US American subset of the VAERS data which offers a relatively consistent reporting behaviour. These files' names are beginning with the year of the reports they contain.

    Additionally all reports not including age and gender after processing and those that refer to vaccination dates before the introduction of VAERS in 1990 have been filtered out.

    Step 4 - Age and gender stratification and adjustment

    Each report is assigned to a number of study and control groups, according to the types of vaccines received by the patient.

    The data of all groups is then stratified into 30 groups each, defined by an age range [0+, 0-4, 5-11, 12-18, 19-24, 25-34, 35-44, 45-59, 60-79, 80+] and the patient gender [M, F, MF].

    In the next step, a weight between 0.0 and 1.0 is assigned to each age-gender combination ([M|F]+[0...119]) in the control group in order for both study and control group to share the same internal age and gender structure.

    The weights are determined by dividing the fraction of individuals belonging to a specific age-gender combination in the study cohort by the fraction of individuals with the same combination of age and gender in the control cohort.

    The result consists of 30 study and 30 control cohorts for every product type we are interested in. Each cohort contains a number of distinct adverse event reports, each of which is made up by a list of medical descriptors, patient age and gender and in the case of the control cohort an age and gender specific weight, as well as a list of vaccines that had been administered before the report was created.

    Step 5 - From control group to placebo control group

    There is a total of ~15000 medical concepts that are being mentioned across all reports in our filtered dataset.

    The number of vaccine groups we define is around 30 and there are 30 study and control cohorts each in every group.

    This makes for a total of roughly 2 x 13mio reporting frequencies that we now determine by counting the occurrences of each descriptor in the respective group (while applying the weights attached to each report in the case of the control cohorts) and dividing the number of occurences by the total number of reports in the cohort.

    We can now calculate ratios and differences of a concept's proportional occurence for each study/control pair. For now, we are only interested in the ratios or more specifically we are only interested in the lower bound of the 50% confidence intervals of what we estimate the incidence proportion ratios to be. If this lower bound is larger than 1 for any given descriptor, we will use it's inverse square root to adjust this descriptor's vaccine -, age - and gender - group specific weight. This process will be repeated twice.

    To get a point-estimate of incidence proportion ratio rp for every medical concept, we use the following formula:

    Incidence rate ratio point-estimate


    is our point-estimate of the descriptor's incidence proportion ratio and:

  • ns is the total number of reports in the study cohort
  • xs is the number of reports mentioning the descriptor in the study cohort
  • ps is the portion of reports mentioning the descriptor in the study cohort
  • and:

  • nc is the total number of reports in the control cohort
  • xc is the number of reports mentioning the descriptor in the control cohort
  • pc is the portion of reports mentioning the descriptor in the control cohort
  • In order to find the lower bound of the 50% confidence interval in which we suspect the incidence proportion ratio to reside, we first calculate the confidence interval for the natural logarithm of our incidence proportion ratio's point-estimate:

    Confidence interval incidence rate proportion point-estimate

    where z=0.67449 for the a confidence interval of 50%. The antilog of the lower confidence level is the lower bound of the 50% confidence interval of our incidence proportion ratio.

    In other words: We are 75% confident, that the incidence proportion for this combination of...

  • age group
  • gender group
  • medical descriptor
  • for patients who received this type of vaccine product differs compared to it's incidence proportion in patients of the same age and gender group who received other types of products by a factor of at least:

    Lower confidence limit of the 95% confidence interval of the incidence proportion ratio

    We then calculate the weight w with the formula:

    Control group descriptor weight calculation

    All reports belonging to control cohorts have weights applied to all the medical concepts listed inside them, according to the respective terms' weights for the received vaccine product types and the patient age and gender group.

    This process is repeated twice to attenuate the effect by which multiple vaccines mutually lower each other's porportional reporting differences for the same descriptor if the respective proportional reporting ratios are increased in the same type of patient category.

    Step 6a - Calculating point-estimates of incidence proportion differences

    In order to yield proportional differences, we do something very similar to what we did in step 5 when adjusting the control groups, only this time we are interested in the point-estimate of the difference in incidence proportions between study and control, not in the ratios.

    Incidence rate difference point-estimate


  • is our point-estimate of the descriptor's difference in incidence proportions
  • and:

  • ns is the total number of reports in the study cohort
  • xs is the number of reports mentioning the descriptor in the study cohort
  • ps is the portion of reports mentioning the descriptor in the study cohort
  • and:

  • nc is the total number of reports in the control cohort
  • xc is the number of reports mentioning the descriptor in the control cohort
  • pc is the portion of reports mentioning the descriptor in the control cohort
  • In order to calculate the bounds of our confidence interval we use the following formula by J. Haldane:

    Difference in proportions confidence interval







    We will set a significance level of 0.005 in step 7 which means:


    What remains for us to do is filling in the values and calculating the result. This yields 3 values for every comparison:

  • The upper bound of the incidence proportion difference's 99% confidence interval
  • The point-estimate of our incidence proportion difference
  • The lower bound of the incidence proportion difference's 99% confidence interval
  • If the lower bound is larger than zero, it will result in a hit for the respective medical concept in this age, gender and vaccine product group.

    Step 6b - Calculating point-estimates of incidence proportion ratios

    If you have carefully read step 5 and 6a, you know what has to be done. The only difference to our calculations in step 5 is that we use the point-estimate as our result and calculate a suitable confidence interval, similar to step 6a, instead of only utilizing the lower bound as we did in step 5. I will briefly explain how I pick a suitable confidence interval in the next step.

    Step 7 - Finding an appropriate significance level for signal detection

    If the lower bound of the 99% confidence level for our proportional difference (CL0.005) is greater than zero, a hit is generated.

    I picked the confidence interval by making sure some relatively rare signals that had already been established in literature (e.g. Narcolepsy for Influenza A) remained intact while raising the requirement for inclusion in order to exclude more false positive.

    Increasing the confidence interval to 99.9% would put our lower confidence level at 0.0005. This would eliminate about 12.6% of signals of which I expect roughly 90% to be valid. In order not to miss out on too many signals, a cutoff at less than 99.5% confidence instead seems adequate which would put our confidence interval at 99% and our significance level at 0.005.

    So what this means is: For each resulting hit we are at least 99.5% confident that it is based on an actual increase in incidence proportion difference (p < 0.005).