Course dataset — South Africa STI/HIV district indicators
Every lesson uses simulated DHIS2 exports covering 52 districts across all 9 provinces, April 2023 – September 2025. You will work with messy raw exports, inconsistent province prefixes, data quality flags, and cross-period comparisons — exactly what you encounter in practice.
The Whole Game
Before learning any single tool, you run the complete pipeline once. Messy DHIS2 export → clean data → insight → action. This section exists so every later lesson has a destination.
Your provincial STI coordinator has asked for a list of districts not hitting the 95% syphilis screening target at ANC. She needs it by 3pm. You have a DHIS2 export for April 2024 — 52 districts, 12 columns. What do you do first?
- 01Opening the file: what are you actually looking at?Reading column headers. Identifying the unit of observation (one row = one district-month). Spotting numeric vs rate columns. Checking for blanks.
- 02Filter, sort, and flag — finding the underperformersUsing AutoFilter on
Syphilis screening at 1st ANC visit rate. Sorting descending. Adding a conditional format column: Below target / On target. - 03Turning the list into a decision — the one-sentence findingWriting a programme-facing finding: "In April 2024, 9 of 52 districts reported syphilis screening below 80%, concentrated in Gauteng and the Northern Cape."
Open the April 2025 dataset. Produce the same ranked list. How many districts have improved since 2024? Write one finding sentence suitable for a district review meeting.
- 01Stage 1–2: Define → CollectThe most skipped steps. What question are we actually answering? Which DHIS2 indicator is the right one? Covered fully in Part 2.
- 02Stage 3–4: Clean → ExploreWhy 60% of analysis time is data preparation. Why EDA before analysis saves rework.
- 03Stage 5–8: Analyse → Communicate → Act → MonitorThe outputs programme managers need: a finding, a chart, a dashboard, a decision. Covered in Parts 5–7.
Define & Collect
Most analysis errors happen before a single cell is touched. Translate a programme question into a data request — and understand what DHIS2 actually delivers.
Donor review in six weeks. The programme manager says: "We need to show progress on syphilis." Before pulling any data — Problem → Positivity has not declined in EC; Objective → Identify EC districts with declining vs rising positivity 2023–2025; Success → Ranked district table with trend direction.
- 01The Problem–Objective–Success frameworkUsing the cheat sheet Stage 1 template. Worked example: syphilis in the Eastern Cape. Why "we need data" is not a question.
- 02Choosing the right indicatorWhen to use rate vs count. Why
Syphilis screening at 1st ANC visit rateandBPG dose 1st treatment coveragemeasure different things. Denominator traps.
A colleague says: "KZN teen pregnancy is a problem." Complete the Problem–Objective–Success table using Delivery in 10-19 years in facility rate. Identify three KZN districts where the rate exceeds 20% in 2024.
- 01Anatomy of the DHIS2 export
periodnameandorganisationunitnameare your time and place variables. The two-letter province prefix in district names (ec, kz, gp…) — and why it matters for grouping. - 02What values above 100% mean — and when to flag themOur dataset includes screening rates >100% (e.g. Thabo Mofutsanyana 116.5%, Capricorn 294.7%). These are DHIS2 data quality artefacts. How to identify and note them without deleting them.
You have six DHIS2 exports — April 2023 through September 2025. The donor wants a trend table showing national screening rates across all periods. You need one combined table, not six files.
- 01Get Data → Folder: importing all exports at onceSetting up the folder import. Expanding the Content column. Why the header row appears in every file and how Power Query skips it automatically.
- 02Result: 1,612 rows — verifying the append was correctChecking row count (52 districts × ~31 periods). Confirming
periodnamehas the right unique values. Adding a refresh step comment for the next analyst.
// Step 1: Connect to folder
Source = Folder.Files("C:\Data\STI_Exports"),
// Step 2: Keep only .xlsx files
Filtered = Table.SelectRows(Source, each
[Extension] = ".xlsx"),
// Step 3: Expand and combine
Combined = Table.Combine(
List.Transform(Filtered[Content],
each Excel.Workbook(_){0}[Data])),
// Step 4: Promote headers
WithHeaders = Table.PromoteHeaders(Combined)
Clean & Prepare
The province prefixes are lowercase. District names have trailing spaces. Rates above 100% exist. This part teaches Power Query techniques using exactly the problems in this dataset.
You need a province-level summary for the quarterly review. But the data has no province column — only long district names like kz eThekwini Metropolitan Municipality. You need to extract KZ from those two leading characters.
- 01Extract First Characters — Power Query's built-in toolAdd Column → Extract → First Characters → 2. Rename to
prov_code. Transform → Uppercase. Result: EC, KZ, GP, LP, MP, FS, WC, NC, NW. - 02Mapping codes to full province names with a lookup tableCreating a small EC→Eastern Cape lookup table. Merging it into the main query. Why this is better than Replace Values — it's reusable and auditable.
AddProvCode = Table.AddColumn(
Source, "prov_code",
each Text.Upper(Text.Start([organisationunitname], 2)),
type text
),
CleanDistrict = Table.AddColumn(
AddProvCode, "district",
each Text.Trim(Text.Middle([organisationunitname], 3)),
type text
)
- 01Converting "April 2023" to a proper Date columnAppending " 01" to make it parseable. Using
Date.FromText. Why sortability matters for every chart you will ever make from this data. - 02Adding Year and Quarter columns for groupingExtracting Year and Quarter from the date column. These become your row/column headers in every pivot table going forward.
During review you notice Capricorn District (Limpopo) reported BPG treatment coverage = 294.7% in June 2025. This cannot be real. Before including it in any trend chart, you need a flag column — deleting it would hide the problem from programme leadership.
- 01Adding a data quality flag column in Power QueryConditional column: IF screening rate >100 THEN "Flag" ELSE "OK". Keeping the original value — never overwriting raw data.
- 02Documenting data quality for programme managersWriting a one-line data quality note in the analysis header: "3 district-months flagged with screening >100% and excluded from national averages." Why transparency matters for donor reporting.
- 01Reviewing Applied Steps — your audit trailReading the steps pane from top to bottom. Naming each step descriptively (not "Removed Columns1"). Why named steps matter when a colleague picks up your file in six months.
- 02Test: add new data and verify the pipelineDrop a new file in the folder. Refresh All. Confirm row count increases. Confirm province column still populates. Confirm data quality flags run. This is your regression test.
Build the complete cleaning pipeline for the 6-file dataset. Your cleaned output must have: (1) a province column, (2) a district column without prefix, (3) a period_date as a real Date type, (4) a dq_flag column for values >100%. Verify it refreshes correctly when you add a new file.
Explore
Before you analyse, you look. Pivot tables, summary statistics, and distribution checks — done on your combined 1,600-row dataset. This is where surprises live.
National average syphilis screening at 1st ANC: April 2023 = 54.2% → April 2024 = 86.7% → April 2025 = 92.7%. A 38-percentage-point improvement in two years. But the average hides KZN districts consistently above 95% while some Gauteng metros were at 5–18% in April 2023.
- 01AVERAGEIFS and COUNTIFS — conditional summaries by provinceCalculating the provincial average screening rate for each period using
=AVERAGEIFS(rate_col, prov_col, "KZ", period_col, "April 2024"). Building a 9-province summary table. - 02What does the distribution look like? Min, max, spreadIn April 2023, GP districts ranged from 5.8% to 89.4%. In September 2025, the range narrowed to 82.9%–98.6%. Distribution shape tells you whether the problem is systemic or concentrated.
- 01Configuration 1: Province × Period cross-tabRows: province, Columns: periodname (April 2023, April 2024, April 2025), Values: AVERAGE of syphilis screening rate. Immediately shows Gauteng's dramatic catch-up.
- 02Configuration 2: Top/bottom district rankingRows: district, Values: AVERAGE screening rate, Sort descending, Show Top 10 / Bottom 10. Identifies chronic underperformers for targeted support visits.
- 03Configuration 3: % below target by provinceRows: province, Values: COUNTIF districts below 80%, COUNTA districts. Calculated field:
% below target. Conditional formatting to flag provinces needing support visits.
Build all three pivot configurations. Answer: (1) Which province had the largest improvement in screening rate between April 2023 and September 2025? (2) Which three districts have the highest Antenatal syphilis test positive rate in the most recent period?
- 01Scatter plot: syphilis positivity vs BPG coverageVisualising districts where positivity is high but treatment coverage is low — the worst-case quadrant. How to add quadrant lines at national average.
- 02=CORREL() — quantifying relationshipsCalculating the correlation between
HIV positive 15-24andHIV positive 25-49rates across districts. Interpreting r in a programme context.
Analyse
Moving from "what does the data look like?" to "what is actually happening?" Trend analysis, comparative ranking, performance gap measurement, and outlier detection.
National syphilis screening improved from 54% (Apr 2023) to 94% (Sep 2025). But national BPG treatment coverage only improved from 63% to 72% over the same period. Screening improved far faster than treatment completion — what does that mean for programme design?
- 01Building a trend table: monthly averages across all 30 periodsCalculating national monthly averages for both screening rate and BPG coverage. Ordering by date (not alphabetically). The single most important table in a programme review.
- 02=SLOPE() — is the trend positive, negative, or flat?Calculating
=SLOPE(y_values, x_values)where x = period number 1–30. Interpreting slope: +1.2 pp/month means reaching target in X months. Calculating that projection. - 03Adding a target reference line to your trend chartAdding a 95% target line as a separate data series. The visual distance between the line and the trend is your programme gap — the most actionable information in the chart.
The KZN STI coordinator wants a district scorecard for the quarterly review. In April 2024, KZN districts ranged from iLembe (100.3% screening, 98.4% BPG) to King Cetshwayo (85.6% screening, 85.7% BPG). She wants a ranked table with traffic-light flags — green ≥95%, amber 80–94%, red <80%.
- 01=RANK() for performance league tables
=RANK(C2,$C$2:$C$12,0)for descending rank. Why simple RANK beats complex composite scores for programme use. - 02Traffic light formatting with conditional format rulesThree-colour conditional format: ≥95 = green, 80–94.9 = amber, <80 = red. Using icon sets vs colour fills — when each is appropriate for a programme audience.
- 01Statistical outliers: AVERAGE + 2*STDEV threshold
=IF(D2 > AVERAGE+2*STDEV, "High outlier", IF(D2 < AVERAGE-2*STDEV, "Low outlier", "Normal")). Why this threshold is a starting point, not a final answer. - 02Data quality outliers vs clinical outliersNot every outlier is a data error. iLembe's MUS incidence of 37–40/100,000 across multiple periods is probably real. How to distinguish systematic programme difference from data problems.
- 01=(Actual − Target) / Target × 100 — performance gap formulaBuilding a gap table for BPG treatment coverage against the 95% target. Which districts have a gap >25 percentage points? Which have been below target for >3 consecutive periods?
- 02Persistent underperformance: counting consecutive periods below targetUsing COUNTIFS with a period range to count how many months a district has been below 80% BPG coverage. Translating this into a programme escalation list.
Produce a "BPG Alert List" for September 2025: all districts with (1) syphilis screening ≥90% AND (2) BPG treatment coverage <60%. These are districts finding positive cases but not completing treatment — the most actionable gap in the data.
Visualise & Communicate
Charts that tell the right story to a programme audience. Power BI dashboards for provincial coordinators and district managers. Every visual is built directly from the STI/HIV dataset.
- 01Line chart: national syphilis screening trend (30 months)Building the trend line with a 95% target reference. Writing the chart title as a finding: "National screening reached 94% — but BPG coverage lags at 72%."
- 02Sorted bar chart: district ranking for a single period52 districts sorted by syphilis screening rate for September 2025. Adding a vertical target line. Labelling only the top 5 and bottom 5.
- 03Scatter plot: screening coverage vs BPG treatment — the quadrant chartX: syphilis screening rate. Y: BPG treatment coverage. Quadrant lines at 95% and 90%. Labels: High screening/High BPG (ideal) vs High screening/Low BPG (priority).
- 01Import, Model, Report: the three Power BI viewsData view — checking your 1,600 rows loaded correctly. Model view — confirming the period date column is recognised as Date. Report view — where the dashboard lives.
- 02DAX measures: average screening rate and BPG coverage
Avg Screening = AVERAGE([Syphilis screening at 1st ANC visit rate]).% Districts Above Target = DIVIDE(COUNTROWS(FILTER(data,[rate]>=95)), COUNTROWS(data)).
The national STI programme manager needs an interactive dashboard for a 30-minute district review meeting. She needs: (1) national KPI numbers at the top, (2) a trend line filterable by province, (3) a ranked district list for the selected period, (4) a map showing provincial averages. Everything must update when she selects a different province or period.
- 01Page 1: National overview — KPI cards + trendFour KPI cards: Avg Screening Rate, % Districts ≥95%, Avg BPG Coverage, National Syphilis Positivity Rate. Dual-axis line chart across 30 months.
- 02Page 2: District drilldown — sorted bar + mapHorizontal bar chart sorted by screening rate. Province slicer updates bar chart and map simultaneously. Conditional colour: red <80%, amber 80–94%, green ≥95%.
- 01The finding sentence: Situation → Data → ImplicationFormula: "[What the data shows] in [place/time], which means [programme implication]." Example: "Syphilis screening in EC improved from 68% to 92%, but BPG treatment coverage remains at 71%, suggesting a treatment pipeline gap requiring a stock or training intervention."
- 02Dashboard design principles for health sector audiencesSort bar charts by value; use target lines; limit charts to 8 per page; never use 3D charts; use consistent colour coding. Applying all to the STI dashboard.
Publish the STI surveillance dashboard to Power BI Service. Write three finding sentences — one for a district manager, one for a provincial coordinator, one for a donor report. Each must use a different indicator and follow the Situation → Data → Implication format.
Act & Monitor
Analysis that doesn't inform a decision is wasted. District scorecards, donor progress reports, and automated alerts — the operational end of the analysis pipeline.
The Chris Hani District Manager needs a one-page scorecard for his quarterly self-assessment. Chris Hani has consistently high syphilis positivity (3.6–5.4% across periods) and variable BPG coverage (62–74%). The scorecard must show where he stands, whether he's improving, and what to do next.
- 01XLOOKUP: pulling the latest district values automaticallyThe scorecard updates automatically when you change the district dropdown. Why XLOOKUP beats VLOOKUP for this use case.
- 02Trend arrows with conditional formattingUsing ▲ ▼ → symbols based on whether the current value is higher, lower, or within 1pp of the previous period. How to fit 8 indicators on one readable A4 page.
- 01Linking Excel tables to Word — the paste-link techniquePaste Special → Paste Link in Word. When the Excel source updates, the Word table updates with one click. Breaking the copy-paste-reformat cycle that causes donor report errors.
- 02Writing the standard indicator table for donor reportingStructure: Indicator | Target | Baseline (Apr 2023) | Latest (Sep 2025) | % Change | Status. The narrative sentence accompanying each row.
- 01Power BI alert: notify when national BPG coverage drops below 65%Setting up a data alert on a KPI card. Alert conditions: BPG coverage <65% (warning), <55% (critical). Connecting to email or Teams notification.
- 02Excel monitoring: the refreshable exception reportA Power Query-driven exception table: districts with >3 consecutive periods below 80% screening rate. This table re-runs on every Refresh All — your standing monitoring output.
Scale with R
Everything you did in Excel, reproduced in R with the tidyverse. Not because R is better for every task — but because it handles 1,600 rows the same way it handles 160,000, and every step is documented in code.
library(tidyverse)
library(readxl)
# List all xlsx files in the data folder
files <- list.files(
"data/", pattern = "\\.xlsx$", full.names = TRUE
)
# Read and combine — one function, six files
raw <- map_dfr(files, ~read_excel(.x, skip = 1))
glimpse(raw) # Rows: 1,612 Cols: 13
clean <- raw |>
rename(
period = periodname,
district = organisationunitname,
syphilis_screen = `Syphilis screening at 1st ANC visit rate`,
bpg_coverage = `BPG dose 1st treatment coverage`,
syph_positivity = `Antenatal syphilis test positive rate`
) |>
mutate(
prov_code = str_to_upper(str_sub(district, 1, 2)),
dist_clean = str_trim(str_sub(district, 4)),
period_date = as.Date(paste0("01 ", period), format = "%d %B %Y"),
year = year(period_date),
dq_flag = if_else(
syphilis_screen > 100 | bpg_coverage > 150,
"Flag", "OK"
)
)
# National trend by period
national_trend <- clean |>
filter(dq_flag == "OK") |>
group_by(period_date, period) |>
summarise(
avg_screening = mean(syphilis_screen, na.rm = TRUE),
avg_bpg = mean(bpg_coverage, na.rm = TRUE),
n_districts = n(),
.groups = "drop"
) |>
arrange(period_date)
# District rankings for September 2025
sep_ranking <- clean |>
filter(period == "September 2025", dq_flag == "OK") |>
select(dist_clean, prov_code, syphilis_screen, bpg_coverage) |>
arrange(desc(syphilis_screen))
library(ggplot2)
ggplot(national_trend, aes(x = period_date, y = avg_screening)) +
geom_line(colour = "#1d4ed8", linewidth = 1) +
geom_line(aes(y = avg_bpg),
colour = "#d97706", linewidth = 1, linetype = "dashed") +
geom_hline(yintercept = 95,
colour = "#dc2626", linetype = "dotted") +
annotate("text", x = min(period_date),
y = 96.5, label = "95% target",
colour = "#dc2626", size = 3, hjust = 0) +
labs(
title = "Screening improved sharply; BPG coverage lags",
x = NULL, y = "Rate (%)"
) +
theme_minimal()
- 01Structure of an R Markdown quarterly reportYAML header, setup chunk, inline code for numbers (
`r round(latest_screening, 1)`%), figure chunks, and knit-to-Word output for donor submission. - 02Parameterised reports — one template, nine provincesAdding a
params: provinceparameter. Runningrmarkdown::render()in a loop across all 9 provinces. Nine separate district reports, automatically generated.
Build a parameterised R Markdown report for the KwaZulu-Natal provincial coordinator. The report must: (1) read the full 6-file dataset, (2) filter to KZN districts only, (3) include the trend chart, district ranking table, and the BPG alert list, (4) render to Word with one click, (5) update automatically when a new data file is added to the folder.
By the end of this course, you will be able to: