Thoughtful statistical consulting for clear, defensible decisions
Exploring the New ggsql Package with Two Years of Swim Data
2026-06-16
Introduction
Today was a day of serendipity.
Firstly, I have been working on a personal project to capture my health data from the various wearables and devices, including my FORM smart goggles, in a single database. I finally finished building the data capture pipeline for the smart goggles this morning and was eager to explore the data.
Secondly, the guest on Posit’s Data Science Lab today was Thomas Pederson, who was talking about his new open source package ggsql, which extends standard SQL syntax to allow for declarative data visualization. By integrating the “Grammar of Graphics” directly into SQL, ggsql allows one to explore and visualize large datasets without ever leaving the database environment.
During the lab, I also learned that ggsql works with quarto. Since this website is built on quarto, I couldn’t resist the opportunity to use ggsql to explore my swim data.
If I hadn’t stumbled across the Data Science Lab today, I would have approached the analysis of my swim data quite differently.
Write a SQL query to extract the data.
Pull it into an R environment.
Coerce it into a tibble.
Pass it to a visualization library like ggplot2 to finally see what I’m working with.
This is a pipeline I’ve accepted for years, but it creates friction. What if I could collapse those steps and bring the visualization directly to the data?
Open EDA: Two Years of Swimming Data
Since I purchased my FORM goggles to motivate me to swim at a time I could hardly walk, I have been committed to swimming at least once per week. The dataset has grown pretty large; I have recorded 5,400 lap-by-lap telemetry records, which should be a nice test of ggsql.
Question 1: Which Days Do I Swim?
As a professor, my schedule can be a bit variable. I was curious which days I tended to swim more. To investigate this, I can plot the total number of swimming sessions I recorded on each day of the week using DRAW bar.
SELECTCAST(isodow(CAST(session_date ASDATE)) ASVARCHAR) ||'-'|| dayname(CAST(session_date ASDATE)) ASday, COUNT(*) AScountFROM form_sessionsGROUPBY isodow(CAST(session_date ASDATE)), dayname(CAST(session_date ASDATE))VISUALIZE dayAS x, countAS y DRAW bar
Question 2: How Has My Volume and Output Evolved Over Time?
I can plot my total distance swum over time using DRAW line.
SELECTCAST(session_date ASDATE) AS session_date, SUM(total_distance_m) AS total_distance_mFROM form_sessionsGROUPBYCAST(session_date ASDATE)VISUALIZE session_date AS x, total_distance_m AS yDRAW line
Or, I can visualize the total calories burned over time as an area chart using DRAW area.
SELECTCAST(session_date ASDATE) AS session_date, SUM(calories) AS caloriesFROM form_sessionsGROUPBYCAST(session_date ASDATE)VISUALIZE session_date AS x, calories AS yDRAW area
Question 3: Are My Workouts Consistent Across the Week?
To see the distribution of my workout distances across the different days of the week, I can use DRAW boxplot.
SELECTCAST(isodow(CAST(session_date ASDATE)) ASVARCHAR) ||'-'|| dayname(CAST(session_date ASDATE)) ASday, total_distance_m FROM form_sessions WHERE total_distance_m ISNOTNULLVISUALIZE dayAS x, total_distance_m AS yDRAW boxplot
Question 4: What is my Typical Heart Rate Profile?
To see the overall distribution of my heart rate across all recorded laps, I can use DRAW histogram.
SELECT avg_hr FROM form_laps WHERE avg_hr ISNOTNULLVISUALIZE avg_hr AS xDRAW histogram
Or, for a smoothed kernel density estimate, I can use DRAW density.
SELECT avg_hr FROM form_laps WHERE avg_hr ISNOTNULLVISUALIZE avg_hr AS xDRAW density
Question 5: How Does Head Pitch Affect Efficiency?
FORM Smart goggles measure lots of cool data, like your head pitch angle and peak head roll angle. I wanted to know how my head angle correlated with my actual distance per stroke (DPS), and whether that optimal pitch changes when I’m sprinting versus when I’m gliding.
To find out, I need to join my lengths table with my laps table (which contains my average heart rate).
SELECT l.head_pitch, (s.pool_length_m *1.0/ l.total_strokes) AS dps,CASEWHEN lap.avg_hr <130THEN'Zone 2'WHEN lap.avg_hr <=150THEN'Zone 3'ELSE'Zone 4/5'ENDAS hr_zoneFROM form_lengths lJOIN form_sessions s ON l.session_id = s.idLEFTJOIN form_laps lap ON l.lap_id = lap.idWHERE l.swim_stroke ='freestyle'AND l.total_strokes >0AND l.head_pitch ISNOTNULLAND lap.avg_hr ISNOTNULLVISUALIZE head_pitch AS x, dps AS y, hr_zone AS fillDRAW point
This chart suggests that I can possibly fit a mathematical model to this data which would allow me to calculate my ideal head pitch to maximize DPS at different heart rate zones.
Conclusion
The power of ggsql lies in its immediacy. When I am doing Exploratory Data Analysis, I am often asking dozens of rapid-fire questions. By removing the mental context-switch between “database language” and “plotting language,” ggsql allows me to stay in a state of flow, asking questions of my data and seeing the answers instantly.