Dylan Poulsen Consulting

Thoughtful statistical consulting for clear, defensible decisions

Exploring the New ggsql Package with Two Years of Swim Data

2026-06-16

Introduction

Today was a day of serendipity.

Firstly, I have been working on a personal project to capture my health data from the various wearables and devices, including my FORM smart goggles, in a single database. I finally finished building the data capture pipeline for the smart goggles this morning and was eager to explore the data.

Secondly, the guest on Posit’s Data Science Lab today was Thomas Pederson, who was talking about his new open source package ggsql, which extends standard SQL syntax to allow for declarative data visualization. By integrating the “Grammar of Graphics” directly into SQL, ggsql allows one to explore and visualize large datasets without ever leaving the database environment.

During the lab, I also learned that ggsql works with quarto. Since this website is built on quarto, I couldn’t resist the opportunity to use ggsql to explore my swim data.

If I hadn’t stumbled across the Data Science Lab today, I would have approached the analysis of my swim data quite differently.

  1. Write a SQL query to extract the data.
  2. Pull it into an R environment.
  3. Coerce it into a tibble.
  4. Pass it to a visualization library like ggplot2 to finally see what I’m working with.

This is a pipeline I’ve accepted for years, but it creates friction. What if I could collapse those steps and bring the visualization directly to the data?

Open EDA: Two Years of Swimming Data

Since I purchased my FORM goggles to motivate me to swim at a time I could hardly walk, I have been committed to swimming at least once per week. The dataset has grown pretty large; I have recorded 5,400 lap-by-lap telemetry records, which should be a nice test of ggsql.

Question 1: Which Days Do I Swim?

As a professor, my schedule can be a bit variable. I was curious which days I tended to swim more. To investigate this, I can plot the total number of swimming sessions I recorded on each day of the week using DRAW bar.

SELECT CAST(isodow(CAST(session_date AS DATE)) AS VARCHAR) || '-' || dayname(CAST(session_date AS DATE)) AS day, COUNT(*) AS count
FROM form_sessions
GROUP BY isodow(CAST(session_date AS DATE)), dayname(CAST(session_date AS DATE))
VISUALIZE day AS x, count AS y 
DRAW bar

Question 2: How Has My Volume and Output Evolved Over Time?

I can plot my total distance swum over time using DRAW line.

SELECT CAST(session_date AS DATE) AS session_date, SUM(total_distance_m) AS total_distance_m
FROM form_sessions
GROUP BY CAST(session_date AS DATE)
VISUALIZE session_date AS x, total_distance_m AS y
DRAW line

Or, I can visualize the total calories burned over time as an area chart using DRAW area.

SELECT CAST(session_date AS DATE) AS session_date, SUM(calories) AS calories
FROM form_sessions
GROUP BY CAST(session_date AS DATE)
VISUALIZE session_date AS x, calories AS y
DRAW area

Question 3: Are My Workouts Consistent Across the Week?

To see the distribution of my workout distances across the different days of the week, I can use DRAW boxplot.

SELECT CAST(isodow(CAST(session_date AS DATE)) AS VARCHAR) || '-' || dayname(CAST(session_date AS DATE)) AS day, total_distance_m 
FROM form_sessions 
WHERE total_distance_m IS NOT NULL
VISUALIZE day AS x, total_distance_m AS y
DRAW boxplot

Question 4: What is my Typical Heart Rate Profile?

To see the overall distribution of my heart rate across all recorded laps, I can use DRAW histogram.

SELECT avg_hr FROM form_laps WHERE avg_hr IS NOT NULL
VISUALIZE avg_hr AS x
DRAW histogram

Or, for a smoothed kernel density estimate, I can use DRAW density.

SELECT avg_hr FROM form_laps WHERE avg_hr IS NOT NULL
VISUALIZE avg_hr AS x
DRAW density

Question 5: How Does Head Pitch Affect Efficiency?

FORM Smart goggles measure lots of cool data, like your head pitch angle and peak head roll angle. I wanted to know how my head angle correlated with my actual distance per stroke (DPS), and whether that optimal pitch changes when I’m sprinting versus when I’m gliding.

To find out, I need to join my lengths table with my laps table (which contains my average heart rate).

SELECT 
    l.head_pitch, 
    (s.pool_length_m * 1.0 / l.total_strokes) AS dps,
    CASE 
        WHEN lap.avg_hr < 130 THEN 'Zone 2'
        WHEN lap.avg_hr <= 150 THEN 'Zone 3'
        ELSE 'Zone 4/5'
    END AS hr_zone
FROM form_lengths l
JOIN form_sessions s ON l.session_id = s.id
LEFT JOIN form_laps lap ON l.lap_id = lap.id
WHERE l.swim_stroke = 'freestyle' 
  AND l.total_strokes > 0 
  AND l.head_pitch IS NOT NULL
  AND lap.avg_hr IS NOT NULL
VISUALIZE head_pitch AS x, dps AS y, hr_zone AS fill
DRAW point

This chart suggests that I can possibly fit a mathematical model to this data which would allow me to calculate my ideal head pitch to maximize DPS at different heart rate zones.

Conclusion

The power of ggsql lies in its immediacy. When I am doing Exploratory Data Analysis, I am often asking dozens of rapid-fire questions. By removing the mental context-switch between “database language” and “plotting language,” ggsql allows me to stay in a state of flow, asking questions of my data and seeing the answers instantly.