Data Analysis for Public Health Practitioners

Course dataset — South Africa STI/HIV district indicators

Every lesson uses simulated DHIS2 exports covering 52 districts across all 9 provinces, April 2023 – September 2025. You will work with messy raw exports, inconsistent province prefixes, data quality flags, and cross-period comparisons — exactly what you encounter in practice.

Syphilis screening rate1st ANC visit (target ≥95%)

BPG treatment coverage1st dose after positive test

ANC syphilis positivityRate per 100 ANC clients

HIV positivity 15–24 yrsExcl. ANC clients

HIV positivity 25–49 yrsExcl. ANC clients

MUS incidence rateMale Urethritis Syndrome

Teen delivery rateFacility births, 10–19 years

Male STI screening15–49 years, new episodes

Course map

P1The Whole Game — end-to-end before we go deep2 lessons P2Define & Collect — from programme question to data3 lessons P3Clean & Prepare — Power Query for messy health data4 lessons P4Explore — pivot tables, summaries, and EDA3 lessons P5Analyse — comparisons, trends, and rankings4 lessons P6Visualise & Communicate — Power BI storytelling4 lessons P7Act & Monitor — scorecards, alerts, donor reports3 lessons P8Scale with R — reproducible analysis at district level5 lessons

Part One

The Whole Game

Before learning any single tool, you run the complete pipeline once. Messy DHIS2 export → clean data → insight → action. This section exists so every later lesson has a destination.

Lesson 1.1

From export to answer in 30 minutes

A guided walkthrough of the full 8-stage analysis cycle using a single DHIS2 file. No theory — just execution. You will produce one output: a ranked list of districts with syphilis screening below 80%.

OrientationExcel

Programme scenario

Your provincial STI coordinator has asked for a list of districts not hitting the 95% syphilis screening target at ANC. She needs it by 3pm. You have a DHIS2 export for April 2024 — 52 districts, 12 columns. What do you do first?

01
Opening the file: what are you actually looking at?
Reading column headers. Identifying the unit of observation (one row = one district-month). Spotting numeric vs rate columns. Checking for blanks.
ExcelData literacy
02
Filter, sort, and flag — finding the underperformers
Using AutoFilter on Syphilis screening at 1st ANC visit rate. Sorting descending. Adding a conditional format column: Below target / On target.
Excel
03
Turning the list into a decision — the one-sentence finding
Writing a programme-facing finding: "In April 2024, 9 of 52 districts reported syphilis screening below 80%, concentrated in Gauteng and the Northern Cape."
Communication

Practice task 1.1

Open the April 2025 dataset. Produce the same ranked list. How many districts have improved since 2024? Write one finding sentence suitable for a district review meeting.

Lesson 1.2

The 8-stage analysis cycle — your mental map

Every analysis task you will ever do fits the same eight stages. This lesson maps each stage to the course and to the cheat sheet.

FrameworkConcept

01
Stage 1–2: Define → Collect
The most skipped steps. What question are we actually answering? Which DHIS2 indicator is the right one? Covered fully in Part 2.
02
Stage 3–4: Clean → Explore
Why 60% of analysis time is data preparation. Why EDA before analysis saves rework.
03
Stage 5–8: Analyse → Communicate → Act → Monitor
The outputs programme managers need: a finding, a chart, a dashboard, a decision. Covered in Parts 5–7.

Part Two

Define & Collect

Most analysis errors happen before a single cell is touched. Translate a programme question into a data request — and understand what DHIS2 actually delivers.

Lesson 2.1

From programme problem to measurable question

The difference between "our syphilis numbers are bad" and "which districts have ANC screening below 80% for two consecutive quarters?" Using the Problem–Objective–Success Criteria framework.

DefineConcept

Programme scenario

Donor review in six weeks. The programme manager says: "We need to show progress on syphilis." Before pulling any data — Problem → Positivity has not declined in EC; Objective → Identify EC districts with declining vs rising positivity 2023–2025; Success → Ranked district table with trend direction.

01
The Problem–Objective–Success framework
Using the cheat sheet Stage 1 template. Worked example: syphilis in the Eastern Cape. Why "we need data" is not a question.
02
Choosing the right indicator
When to use rate vs count. Why Syphilis screening at 1st ANC visit rate and BPG dose 1st treatment coverage measure different things. Denominator traps.

Practice task 2.1

A colleague says: "KZN teen pregnancy is a problem." Complete the Problem–Objective–Success table using Delivery in 10-19 years in facility rate. Identify three KZN districts where the rate exceeds 20% in 2024.

Lesson 2.2

Reading a DHIS2 export — what the columns actually mean

Anatomy of the dataset: periodname, organisationunitname, and each indicator. What "rate" means in DHIS2 — it's a calculation, not a raw count.

CollectExcel

01
Anatomy of the DHIS2 export
periodname and organisationunitname are your time and place variables. The two-letter province prefix in district names (ec, kz, gp…) — and why it matters for grouping.
ExcelData literacy
02
What values above 100% mean — and when to flag them
Our dataset includes screening rates >100% (e.g. Thabo Mofutsanyana 116.5%, Capricorn 294.7%). These are DHIS2 data quality artefacts. How to identify and note them without deleting them.
Data quality

Lesson 2.3

Combining multiple monthly exports

You have six separate files — one per time point. Power Query's folder import appends them automatically. First step toward a longitudinal dataset.

CollectPower Query

Programme scenario

You have six DHIS2 exports — April 2023 through September 2025. The donor wants a trend table showing national screening rates across all periods. You need one combined table, not six files.

01
Get Data → Folder: importing all exports at once
Setting up the folder import. Expanding the Content column. Why the header row appears in every file and how Power Query skips it automatically.
Power Query
02
Result: 1,612 rows — verifying the append was correct
Checking row count (52 districts × ~31 periods). Confirming periodname has the right unique values. Adding a refresh step comment for the next analyst.
Power QueryReproducibility

Power Query M — folder import

// Step 1: Connect to folder
Source = Folder.Files("C:\Data\STI_Exports"),

// Step 2: Keep only .xlsx files
Filtered = Table.SelectRows(Source, each
  [Extension] = ".xlsx"),

// Step 3: Expand and combine
Combined = Table.Combine(
  List.Transform(Filtered[Content],
    each Excel.Workbook(_){0}[Data])),

// Step 4: Promote headers
WithHeaders = Table.PromoteHeaders(Combined)

Part Three

Clean & Prepare

The province prefixes are lowercase. District names have trailing spaces. Rates above 100% exist. This part teaches Power Query techniques using exactly the problems in this dataset.

Lesson 3.1

Extracting the province from the district name

The organisationunitname column encodes the province as a two-letter lowercase prefix (ec, kz, gp, lp…). Creating a clean province column is the most important cleaning step for provincial analysis.

CleanPower Query

Programme scenario

You need a province-level summary for the quarterly review. But the data has no province column — only long district names like kz eThekwini Metropolitan Municipality. You need to extract KZ from those two leading characters.

01
Extract First Characters — Power Query's built-in tool
Add Column → Extract → First Characters → 2. Rename to prov_code. Transform → Uppercase. Result: EC, KZ, GP, LP, MP, FS, WC, NC, NW.
Power Query
02
Mapping codes to full province names with a lookup table
Creating a small EC→Eastern Cape lookup table. Merging it into the main query. Why this is better than Replace Values — it's reusable and auditable.
Power QueryData modelling

Power Query M — extract province

AddProvCode = Table.AddColumn(
  Source, "prov_code",
  each Text.Upper(Text.Start([organisationunitname], 2)),
  type text
),

CleanDistrict = Table.AddColumn(
  AddProvCode, "district",
  each Text.Trim(Text.Middle([organisationunitname], 3)),
  type text
)

Lesson 3.2

Fixing the period column — text dates to sortable dates

periodname is stored as text ("April 2023"). Excel sorts "April" before "August" alphabetically, not chronologically. Converting to a real date type enables correct trend charts.

CleanPower Query

01
Converting "April 2023" to a proper Date column
Appending " 01" to make it parseable. Using Date.FromText. Why sortability matters for every chart you will ever make from this data.
Power Query
02
Adding Year and Quarter columns for grouping
Extracting Year and Quarter from the date column. These become your row/column headers in every pivot table going forward.
Power Query

Lesson 3.3

Flagging data quality issues — values above 100% and blanks

The syphilis screening column contains rates above 100% in several districts (e.g. June 2025: Thabo Mofutsanyana 116.5%, Capricorn 294.7%). The right response is to flag, not delete.

Data qualityExcel

Programme scenario

During review you notice Capricorn District (Limpopo) reported BPG treatment coverage = 294.7% in June 2025. This cannot be real. Before including it in any trend chart, you need a flag column — deleting it would hide the problem from programme leadership.

01
Adding a data quality flag column in Power Query
Conditional column: IF screening rate >100 THEN "Flag" ELSE "OK". Keeping the original value — never overwriting raw data.
Power QueryData quality
02
Documenting data quality for programme managers
Writing a one-line data quality note in the analysis header: "3 district-months flagged with screening >100% and excluded from national averages." Why transparency matters for donor reporting.
Communication

Lesson 3.4

Automating the clean — refresh with one click

Every transformation you built in Power Query is recorded as a step. When next month's DHIS2 export arrives, you drop it in the folder and click Refresh. The entire pipeline re-runs in seconds.

AutomationPower Query

01
Reviewing Applied Steps — your audit trail
Reading the steps pane from top to bottom. Naming each step descriptively (not "Removed Columns1"). Why named steps matter when a colleague picks up your file in six months.
Power QueryReproducibility
02
Test: add new data and verify the pipeline
Drop a new file in the folder. Refresh All. Confirm row count increases. Confirm province column still populates. Confirm data quality flags run. This is your regression test.
Power Query

Practice task 3.4 — mini project

Build the complete cleaning pipeline for the 6-file dataset. Your cleaned output must have: (1) a province column, (2) a district column without prefix, (3) a period_date as a real Date type, (4) a dq_flag column for values >100%. Verify it refreshes correctly when you add a new file.

Part Four

Explore

Before you analyse, you look. Pivot tables, summary statistics, and distribution checks — done on your combined 1,600-row dataset. This is where surprises live.

Lesson 4.1

Summary statistics — national picture at a glance

Using AVERAGE, MEDIAN, MIN, MAX, and STDEV to characterise each indicator nationally. The national syphilis screening average improved from 54% (April 2023) to 94% (September 2025) — what does that distribution look like?

ExploreExcel

Key finding from the data

National average syphilis screening at 1st ANC: April 2023 = 54.2% → April 2024 = 86.7% → April 2025 = 92.7%. A 38-percentage-point improvement in two years. But the average hides KZN districts consistently above 95% while some Gauteng metros were at 5–18% in April 2023.

01
AVERAGEIFS and COUNTIFS — conditional summaries by province
Calculating the provincial average screening rate for each period using =AVERAGEIFS(rate_col, prov_col, "KZ", period_col, "April 2024"). Building a 9-province summary table.
Excel
02
What does the distribution look like? Min, max, spread
In April 2023, GP districts ranged from 5.8% to 89.4%. In September 2025, the range narrowed to 82.9%–98.6%. Distribution shape tells you whether the problem is systemic or concentrated.
ExcelEDA

Lesson 4.2

Pivot tables for health data — the workhorse of EDA

Pivot tables answer "what is the value of X, broken down by Y?" in seconds. Three essential configurations every analyst should know.

ExploreExcel

01
Configuration 1: Province × Period cross-tab
Rows: province, Columns: periodname (April 2023, April 2024, April 2025), Values: AVERAGE of syphilis screening rate. Immediately shows Gauteng's dramatic catch-up.
Excel
02
Configuration 2: Top/bottom district ranking
Rows: district, Values: AVERAGE screening rate, Sort descending, Show Top 10 / Bottom 10. Identifies chronic underperformers for targeted support visits.
Excel
03
Configuration 3: % below target by province
Rows: province, Values: COUNTIF districts below 80%, COUNTA districts. Calculated field: % below target. Conditional formatting to flag provinces needing support visits.
Excel

Practice task 4.2

Build all three pivot configurations. Answer: (1) Which province had the largest improvement in screening rate between April 2023 and September 2025? (2) Which three districts have the highest Antenatal syphilis test positive rate in the most recent period?

Lesson 4.3

Exploring relationships between indicators

Does high syphilis positivity correlate with low BPG treatment coverage? Scatter plots and CORREL before formal analysis.

ExploreExcel

01
Scatter plot: syphilis positivity vs BPG coverage
Visualising districts where positivity is high but treatment coverage is low — the worst-case quadrant. How to add quadrant lines at national average.
Excel
02
=CORREL() — quantifying relationships
Calculating the correlation between HIV positive 15-24 and HIV positive 25-49 rates across districts. Interpreting r in a programme context.
ExcelStatistics

Part Five

Analyse

Moving from "what does the data look like?" to "what is actually happening?" Trend analysis, comparative ranking, performance gap measurement, and outlier detection.

Lesson 5.1

Trend analysis — is this indicator improving?

Using SLOPE() and trend lines to answer the most common programme question: "Are we getting better?" Using 30 time points of national screening data from the combined dataset.

AnalyseExcel

Key finding from the data

National syphilis screening improved from 54% (Apr 2023) to 94% (Sep 2025). But national BPG treatment coverage only improved from 63% to 72% over the same period. Screening improved far faster than treatment completion — what does that mean for programme design?

01
Building a trend table: monthly averages across all 30 periods
Calculating national monthly averages for both screening rate and BPG coverage. Ordering by date (not alphabetically). The single most important table in a programme review.
Excel
02
=SLOPE() — is the trend positive, negative, or flat?
Calculating =SLOPE(y_values, x_values) where x = period number 1–30. Interpreting slope: +1.2 pp/month means reaching target in X months. Calculating that projection.
ExcelStatistics
03
Adding a target reference line to your trend chart
Adding a 95% target line as a separate data series. The visual distance between the line and the trend is your programme gap — the most actionable information in the chart.
Excel

Lesson 5.2

Ranking districts — who needs support visits?

RANK() and PERCENTRANK() to produce performance league tables with traffic-light formatting. The KZN district table from April 2024 as the worked example.

AnalyseExcel

Programme scenario

The KZN STI coordinator wants a district scorecard for the quarterly review. In April 2024, KZN districts ranged from iLembe (100.3% screening, 98.4% BPG) to King Cetshwayo (85.6% screening, 85.7% BPG). She wants a ranked table with traffic-light flags — green ≥95%, amber 80–94%, red <80%.

01
=RANK() for performance league tables
=RANK(C2,$C$2:$C$12,0) for descending rank. Why simple RANK beats complex composite scores for programme use.
Excel
02
Traffic light formatting with conditional format rules
Three-colour conditional format: ≥95 = green, 80–94.9 = amber, <80 = red. Using icon sets vs colour fills — when each is appropriate for a programme audience.
Excel

Lesson 5.3

Outlier detection — which districts are anomalies?

Using AVERAGE ± 2×STDEV to flag statistical outliers. In the BPG coverage data, Sedibeng (August 2025: 43.1% MUS incidence) and Chris Hani (53% in September 2025) are outliers worth investigating.

AnalyseExcel

01
Statistical outliers: AVERAGE + 2*STDEV threshold
=IF(D2 > AVERAGE+2*STDEV, "High outlier", IF(D2 < AVERAGE-2*STDEV, "Low outlier", "Normal")). Why this threshold is a starting point, not a final answer.
ExcelStatistics
02
Data quality outliers vs clinical outliers
Not every outlier is a data error. iLembe's MUS incidence of 37–40/100,000 across multiple periods is probably real. How to distinguish systematic programme difference from data problems.
Data qualityProgramme context

Lesson 5.4

Performance gap analysis — how far from target?

Measuring the gap between actual and target — not just above/below but by how much, for how long. The BPG treatment coverage gap (target 95%, national average 70%) across all periods.

AnalyseExcel

01
=(Actual − Target) / Target × 100 — performance gap formula
Building a gap table for BPG treatment coverage against the 95% target. Which districts have a gap >25 percentage points? Which have been below target for >3 consecutive periods?
Excel
02
Persistent underperformance: counting consecutive periods below target
Using COUNTIFS with a period range to count how many months a district has been below 80% BPG coverage. Translating this into a programme escalation list.
ExcelProgramme context

Practice task 5.4 — mini project

Produce a "BPG Alert List" for September 2025: all districts with (1) syphilis screening ≥90% AND (2) BPG treatment coverage <60%. These are districts finding positive cases but not completing treatment — the most actionable gap in the data.

Part Six

Visualise & Communicate

Charts that tell the right story to a programme audience. Power BI dashboards for provincial coordinators and district managers. Every visual is built directly from the STI/HIV dataset.

Lesson 6.1

Choosing the right chart type for health data

The cheat sheet chart selection guide applied to our specific indicators. When a line chart beats a bar chart. Why a sorted horizontal bar is the right choice for district rankings.

VisualiseExcel · Power BI

01
Line chart: national syphilis screening trend (30 months)
Building the trend line with a 95% target reference. Writing the chart title as a finding: "National screening reached 94% — but BPG coverage lags at 72%."
Excel
02
Sorted bar chart: district ranking for a single period
52 districts sorted by syphilis screening rate for September 2025. Adding a vertical target line. Labelling only the top 5 and bottom 5.
Excel
03
Scatter plot: screening coverage vs BPG treatment — the quadrant chart
X: syphilis screening rate. Y: BPG treatment coverage. Quadrant lines at 95% and 90%. Labels: High screening/High BPG (ideal) vs High screening/Low BPG (priority).
Excel

Lesson 6.2

Power BI fundamentals — connecting to the clean dataset

Importing the cleaned Excel model into Power BI Desktop. Building the data model with a period dimension table for correct time-intelligence.

Power BIDashboard

01
Import, Model, Report: the three Power BI views
Data view — checking your 1,600 rows loaded correctly. Model view — confirming the period date column is recognised as Date. Report view — where the dashboard lives.
Power BI
02
DAX measures: average screening rate and BPG coverage
Avg Screening = AVERAGE([Syphilis screening at 1st ANC visit rate]). % Districts Above Target = DIVIDE(COUNTROWS(FILTER(data,[rate]>=95)), COUNTROWS(data)).
Power BIDAX

Lesson 6.3

Building the STI surveillance dashboard

A four-visual dashboard page: KPI cards, trend line, district bar chart, and province map. Slicers for period and province. The deliverable for a quarterly review meeting.

Power BIDashboard

Programme scenario

The national STI programme manager needs an interactive dashboard for a 30-minute district review meeting. She needs: (1) national KPI numbers at the top, (2) a trend line filterable by province, (3) a ranked district list for the selected period, (4) a map showing provincial averages. Everything must update when she selects a different province or period.

01
Page 1: National overview — KPI cards + trend
Four KPI cards: Avg Screening Rate, % Districts ≥95%, Avg BPG Coverage, National Syphilis Positivity Rate. Dual-axis line chart across 30 months.
Power BI
02
Page 2: District drilldown — sorted bar + map
Horizontal bar chart sorted by screening rate. Province slicer updates bar chart and map simultaneously. Conditional colour: red <80%, amber 80–94%, green ≥95%.
Power BI

Lesson 6.4

Storytelling — turning visuals into decisions

A chart without a narrative is just a picture. Framing findings for a programme director, a district manager, and a donor — three different audiences, same data.

CommunicationConcept

01
The finding sentence: Situation → Data → Implication
Formula: "[What the data shows] in [place/time], which means [programme implication]." Example: "Syphilis screening in EC improved from 68% to 92%, but BPG treatment coverage remains at 71%, suggesting a treatment pipeline gap requiring a stock or training intervention."
Communication
02
Dashboard design principles for health sector audiences
Sort bar charts by value; use target lines; limit charts to 8 per page; never use 3D charts; use consistent colour coding. Applying all to the STI dashboard.
Design

Practice task 6.4 — mini project

Publish the STI surveillance dashboard to Power BI Service. Write three finding sentences — one for a district manager, one for a provincial coordinator, one for a donor report. Each must use a different indicator and follow the Situation → Data → Implication format.

Part Seven

Act & Monitor

Analysis that doesn't inform a decision is wasted. District scorecards, donor progress reports, and automated alerts — the operational end of the analysis pipeline.

Lesson 7.1

District scorecard — one-page performance summary

A single-page Excel scorecard for a district manager: current values, trend direction (↑ ↓ →), traffic lights, and one action-oriented finding. Designed to print on one A4 page.

DeliverableExcel

Programme scenario

The Chris Hani District Manager needs a one-page scorecard for his quarterly self-assessment. Chris Hani has consistently high syphilis positivity (3.6–5.4% across periods) and variable BPG coverage (62–74%). The scorecard must show where he stands, whether he's improving, and what to do next.

01
XLOOKUP: pulling the latest district values automatically
The scorecard updates automatically when you change the district dropdown. Why XLOOKUP beats VLOOKUP for this use case.
Excel
02
Trend arrows with conditional formatting
Using ▲ ▼ → symbols based on whether the current value is higher, lower, or within 1pp of the previous period. How to fit 8 indicators on one readable A4 page.
Excel

Lesson 7.2

Donor progress report — linking Excel data to a Word narrative

How to produce a quarterly donor report where numbers update automatically when new data arrives. Excel → Word linked tables → PDF. The workflow that saves four hours per reporting cycle.

DeliverableExcel · Word

01
Linking Excel tables to Word — the paste-link technique
Paste Special → Paste Link in Word. When the Excel source updates, the Word table updates with one click. Breaking the copy-paste-reformat cycle that causes donor report errors.
Excel
02
Writing the standard indicator table for donor reporting
Structure: Indicator | Target | Baseline (Apr 2023) | Latest (Sep 2025) | % Change | Status. The narrative sentence accompanying each row.
Communication

Lesson 7.3

Automated alerts — monitoring without manual checking

Power BI data alerts send an email when a KPI crosses a threshold. Setting up both Power BI and Excel alerts so you are notified when a district drops below target — not after you happen to look.

MonitorPower BI

01
Power BI alert: notify when national BPG coverage drops below 65%
Setting up a data alert on a KPI card. Alert conditions: BPG coverage <65% (warning), <55% (critical). Connecting to email or Teams notification.
Power BI
02
Excel monitoring: the refreshable exception report
A Power Query-driven exception table: districts with >3 consecutive periods below 80% screening rate. This table re-runs on every Refresh All — your standing monitoring output.
ExcelMonitoring

Part Eight · Stretch Track

Scale with R

Everything you did in Excel, reproduced in R with the tidyverse. Not because R is better for every task — but because it handles 1,600 rows the same way it handles 160,000, and every step is documented in code.

Who this part is for: Analysts who have completed Parts 1–7 and want to reproduce the same analysis in R — for reproducibility, scale, or integration with DHIS2 APIs. No prior R knowledge required.

Lesson 8.1

Reading DHIS2 exports into R with readxl

Installing the tidyverse. Reading all six Excel files with map_dfr(). The R equivalent of Power Query's folder import — in six lines of code.

RImport

R — load all DHIS2 exports

library(tidyverse)
library(readxl)

# List all xlsx files in the data folder
files <- list.files(
  "data/", pattern = "\\.xlsx$", full.names = TRUE
)

# Read and combine — one function, six files
raw <- map_dfr(files, ~read_excel(.x, skip = 1))

glimpse(raw)  # Rows: 1,612  Cols: 13

Lesson 8.2

Cleaning with dplyr and stringr

Extracting province codes, parsing period dates, renaming columns, and flagging data quality issues — all in a single tidy pipeline using |>.

RClean

R — cleaning pipeline

clean <- raw |>
  rename(
    period          = periodname,
    district        = organisationunitname,
    syphilis_screen = `Syphilis screening at 1st ANC visit rate`,
    bpg_coverage    = `BPG dose 1st treatment coverage`,
    syph_positivity = `Antenatal syphilis test positive rate`
  ) |>
  mutate(
    prov_code   = str_to_upper(str_sub(district, 1, 2)),
    dist_clean  = str_trim(str_sub(district, 4)),
    period_date = as.Date(paste0("01 ", period), format = "%d %B %Y"),
    year        = year(period_date),
    dq_flag     = if_else(
      syphilis_screen > 100 | bpg_coverage > 150,
      "Flag", "OK"
    )
  )

Lesson 8.3

Exploratory analysis with dplyr

group_by() + summarise() replaces the entire pivot table interface. Producing the provincial cross-tab, district rankings, and trend table in code.

RExplore

R — national trend + district rankings

# National trend by period
national_trend <- clean |>
  filter(dq_flag == "OK") |>
  group_by(period_date, period) |>
  summarise(
    avg_screening = mean(syphilis_screen, na.rm = TRUE),
    avg_bpg       = mean(bpg_coverage,    na.rm = TRUE),
    n_districts   = n(),
    .groups = "drop"
  ) |>
  arrange(period_date)

# District rankings for September 2025
sep_ranking <- clean |>
  filter(period == "September 2025", dq_flag == "OK") |>
  select(dist_clean, prov_code, syphilis_screen, bpg_coverage) |>
  arrange(desc(syphilis_screen))

Lesson 8.4

Visualising with ggplot2

Reproducing the three key charts from Part 6 in ggplot2. Publication-quality output in 15 lines of code.

Rggplot2

R — ggplot2 trend chart

library(ggplot2)

ggplot(national_trend, aes(x = period_date, y = avg_screening)) +
  geom_line(colour = "#1d4ed8", linewidth = 1) +
  geom_line(aes(y = avg_bpg),
    colour = "#d97706", linewidth = 1, linetype = "dashed") +
  geom_hline(yintercept = 95,
    colour = "#dc2626", linetype = "dotted") +
  annotate("text", x = min(period_date),
    y = 96.5, label = "95% target",
    colour = "#dc2626", size = 3, hjust = 0) +
  labs(
    title = "Screening improved sharply; BPG coverage lags",
    x = NULL, y = "Rate (%)"
  ) +
  theme_minimal()

Lesson 8.5

Reproducible reports with R Markdown

Combining code and narrative in one document. When new exports arrive, the report re-renders with updated figures. The gold standard for donor reporting.

RReproducibility

01
Structure of an R Markdown quarterly report
YAML header, setup chunk, inline code for numbers (`r round(latest_screening, 1)`%), figure chunks, and knit-to-Word output for donor submission.
R Markdown
02
Parameterised reports — one template, nine provinces
Adding a params: province parameter. Running rmarkdown::render() in a loop across all 9 provinces. Nine separate district reports, automatically generated.
R MarkdownAutomation

Practice task 8.5 — final project

Build a parameterised R Markdown report for the KwaZulu-Natal provincial coordinator. The report must: (1) read the full 6-file dataset, (2) filter to KZN districts only, (3) include the trend chart, district ranking table, and the BPG alert list, (4) render to Word with one click, (5) update automatically when a new data file is added to the folder.

Course outcomes

By the end of this course, you will be able to:

Import and combine DHIS2 exports automatically

Clean district names, dates, and data quality flags

Build province and district pivot summaries

Produce ranked district performance tables

Detect trends, outliers, and performance gaps

Build interactive Power BI surveillance dashboards

Write finding sentences for programme audiences

Link Excel data to Word donor report templates

Set up automated monitoring alerts in Power BI

Reproduce the full pipeline in R for scale and rigour