How to use DuckDB with large datasets

Below is a single unified DuckDB instruction set combining both versions into one clean, consistent workflow for large radio astronomy event datasets.

Assumes columns:

  • date (YYYY-MM-DD)
  • time (HH:MM:SS)
  • ID
  • Event

Unified DuckDB Workflow for Large Radio Astronomy Event Data

1. Create a reusable base view (timestamp normalisation)

This is the foundation for everything else:

CREATE VIEW base AS
SELECT
    *,
    strptime(date || ' ' || time, '%Y-%m-%d %H:%M:%S') AS dt
FROM 'large.csv';

2. Filter to target dataset + create hourly event counts

This reduces data early and prepares it for analysis:

CREATE VIEW id_c_hourly AS
SELECT
    date_trunc('hour', dt) AS hour,
    Event,
    COUNT(*) AS n
FROM base
WHERE ID = 'C'
GROUP BY 1, 2;

This produces a tidy structure:

  • hour
  • event type
  • count per hour

3. Pivot events into columns (analysis-ready format)

If event categories are known:

CREATE VIEW id_c_pivot AS
SELECT *
FROM id_c_hourly
PIVOT (
    SUM(n) FOR Event IN ('A', 'B', 'C')
)
ORDER BY hour;

4. Export final pivot table

COPY id_c_pivot
TO 'ID_C_hourly_pivot.csv'
(HEADER);

5. Full one-shot pipeline (no views, direct output)

Single execution version:

COPY (
    SELECT *
    FROM (
        SELECT
            date_trunc('hour', dt) AS hour,
            Event
        FROM (
            SELECT *,
                   strptime(date || ' ' || time, '%Y-%m-%d %H:%M:%S') AS dt
            FROM 'large.csv'
        )
        WHERE ID = 'C'
    )
    PIVOT (
        COUNT(*) FOR Event IN ('A', 'B', 'C')
    )
    ORDER BY hour
)
TO 'ID_C_hourly_pivot.csv'
(HEADER);

6. Recommended scientific format (better than pivot for analysis)

This is often more useful for radio astronomy workflows, plotting, and signal analysis:

SELECT
    date_trunc('hour', dt) AS hour,
    Event,
    COUNT(*) AS intensity
FROM base
WHERE ID = 'C'
GROUP BY 1, 2
ORDER BY 1;

7. Optional: discover event classes automatically

Before pivoting:

SELECT DISTINCT Event
FROM 'large.csv'
WHERE ID = 'C';

Then use results in:

PIVOT (... FOR Event IN (...))

8. Performance rules for large datasets (important)

  • Always filter early:WHERE ID = 'C'
  • Avoid repeated timestamp parsing (use base view)
  • Prefer Parquet over CSV when possible:SELECT * FROM 'large.parquet';
  • Select only required columns in large-scale runs
  • Use hourly aggregation before pivoting for efficiency

9. Recommended final workflow (best practice summary)

  1. Create base view (timestamp conversion once)
  2. Filter by ID early
  3. Aggregate into hourly bins
  4. Optionally pivot for spreadsheet-style output
  5. Export to CSV or Parquet

How would I use DuckDB to divide a very large spreadsheets into smaller ones maximum 50000 lines per spreadsheet, copying amy header across to each file?

You can do this very cleanly in DuckDB because it can stream large tables and generate row numbers — which is exactly what you need for chunking.

Below is a simple, reliable method that splits a huge spreadsheet into multiple files of 50,000 rows each, automatically repeating the header.

DuckDB: Split Large Spreadsheet into 50,000-row Files

1. Core idea

We:

1. Load file

2. Add row numbers

3. Assign a “chunk ID” every 50,000 rows

4. Export each chunk separately (header included automatically)

2. Step-by-step DuckDB solution

Step A — Create a numbered view

CREATE VIEW base AS

SELECT

    *,

    row_number() OVER () AS rn

FROM read_csv_auto(‘large.csv’);

Step B — Assign chunk groups (50,000 rows each)

CREATE VIEW chunked AS

SELECT

    *,

    CAST((rn – 1) / 50000 AS INTEGER) AS chunk_id

FROM base;

Step C — Export each chunk (automatic splitting)

DuckDB doesn’t loop internally, so you export per chunk like this:

Example: chunk 0

COPY (

    SELECT *

    FROM chunked

    WHERE chunk_id = 0

    ORDER BY rn

)

TO ‘output_chunk_0.csv’

(HEADER);

chunk 1

COPY (

    SELECT *

    FROM chunked

    WHERE chunk_id = 1

    ORDER BY rn

)

TO ‘output_chunk_1.csv’

(HEADER);

3. One-shot automated version (Python + DuckDB — recommended)

This avoids manually writing chunk numbers:

import duckdb

con = duckdb.connect()

# Create chunked dataset

con.execute(“””

CREATE VIEW chunked AS

SELECT *,

       row_number() OVER () AS rn,

       CAST((row_number() OVER () – 1) / 50000 AS INTEGER) AS chunk_id

FROM read_csv_auto(‘large.csv’);

“””)

# Find how many chunks exist

chunks = con.execute(“SELECT MAX(chunk_id) FROM chunked”).fetchone()[0]

# Export all chunks

for i in range(chunks + 1):

    con.execute(f”””

        COPY (

            SELECT *

            FROM chunked

            WHERE chunk_id = {i}

            ORDER BY rn

        )

        TO ‘output_chunk_{i}.csv’

        (HEADER);

    “””)

4. Why this works well for very large astronomy datasets

No full in-memory load required

Streaming execution (DuckDB handles disk efficiently)

Stable ordering via row_number()

Header automatically included per file

Works for CSV, Parquet, or mixed formats

5. Optional improvements (useful for your radio data)

Keep only relevant columns (faster splits)

SELECT date, time, ID, Event

Or split already-filtered data

WHERE ID = ‘C’

Parquet version (much faster than CSV)

COPY chunked TO ‘chunk_0.parquet’ (FORMAT PARQUET);

6. Best-practice version for your use case

For radio astronomy logs:

1. Filter early (ID)

2. Convert timestamp once

3. Chunk last

4. Export Parquet if possible

If you want next step, I can turn this into:

a single Windows batch script

or an automatic “drop CSV → outputs chunks” tool

or integrate it with your SDR/1420 MHz pipeline so chunks become time-sliced observation blocks

Can I copy headers across in DuckDB?

Yes — but with an important nuance:

DuckDB doesn’t need you to manually define headers, because it already treats the first row as column names when using read_csv_auto(). The trick is making sure that behaviour is preserved when you split files.

So the real question becomes:

> “Can DuckDB automatically preserve the detected schema (header) when exporting chunks?”

Answer: Yes — automatically, as long as you export using COPY … (HEADER) from a properly inferred table/view.

✔ Best practice: fully automatic header handling in DuckDB

1. Let DuckDB infer the header once

CREATE VIEW base AS

SELECT *

FROM read_csv_auto(‘large.csv’);

At this point:

Column names are already detected

You do NOT need to reference the header again

2. Add chunking (safe for huge files)

CREATE VIEW chunked AS

SELECT *,

       row_number() OVER () AS rn,

       CAST((row_number() OVER () – 1) / 50000 AS INTEGER) AS chunk_id

FROM base;

3. Export chunks — header is automatically reused

This is the key point:

COPY (

    SELECT *

    FROM chunked

    WHERE chunk_id = 0

    ORDER BY rn

)

TO ‘chunk_0.csv’

(HEADER);

What happens here:

DuckDB automatically writes column names from the schema

No need to “re-copy” or “re-detect” header

Every chunk gets identical header row

✔ Even simpler: no header handling at all

If you do:

COPY base TO ‘file.csv’ (HEADER);

DuckDB:

Writes header automatically

Uses schema metadata (not first row of data)

So header handling is entirely schema-driven, not file-driven.

⚠ When header problems do happen

Only in these cases:

1. Mixed-format CSVs

Different files with inconsistent columns

2. Manual file concatenation (outside DuckDB)

Using cat or similar tools breaks consistency

3. Disabling header detection accidentally

If you use:

read_csv(‘file.csv’, header=false)

then you must define columns manually

✔ If you want full automation (no chunk loop in SQL)

Use Python wrapper (recommended for real workflows):

import duckdb

con = duckdb.connect()

con.execute(“””

CREATE VIEW chunked AS

SELECT *,

       row_number() OVER () AS rn,

       CAST((row_number() OVER () – 1) / 50000 AS INTEGER) AS chunk_id

FROM read_csv_auto(‘large.csv’);

“””)

n = con.execute(“SELECT MAX(chunk_id) FROM chunked”).fetchone()[0]

for i in range(n + 1):

    con.execute(f”””

        COPY (

            SELECT *

            FROM chunked

            WHERE chunk_id = {i}

            ORDER BY rn

        )

        TO ‘chunk_{i}.csv’

        (HEADER);

    “””)

✔ Key takeaway

You do NOT need to manually copy or manage headers.

DuckDB:

Infers header → schema

Stores it internally

Automatically writes it on every COPY … (HEADER) export

By Admin

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.