Below is a single unified DuckDB instruction set combining both versions into one clean, consistent workflow for large radio astronomy event datasets.
Assumes columns:
date(YYYY-MM-DD)time(HH:MM:SS)IDEvent
Unified DuckDB Workflow for Large Radio Astronomy Event Data
1. Create a reusable base view (timestamp normalisation)
This is the foundation for everything else:
CREATE VIEW base AS
SELECT
*,
strptime(date || ' ' || time, '%Y-%m-%d %H:%M:%S') AS dt
FROM 'large.csv';
2. Filter to target dataset + create hourly event counts
This reduces data early and prepares it for analysis:
CREATE VIEW id_c_hourly AS
SELECT
date_trunc('hour', dt) AS hour,
Event,
COUNT(*) AS n
FROM base
WHERE ID = 'C'
GROUP BY 1, 2;
This produces a tidy structure:
- hour
- event type
- count per hour
3. Pivot events into columns (analysis-ready format)
If event categories are known:
CREATE VIEW id_c_pivot AS
SELECT *
FROM id_c_hourly
PIVOT (
SUM(n) FOR Event IN ('A', 'B', 'C')
)
ORDER BY hour;
4. Export final pivot table
COPY id_c_pivot
TO 'ID_C_hourly_pivot.csv'
(HEADER);
5. Full one-shot pipeline (no views, direct output)
Single execution version:
COPY (
SELECT *
FROM (
SELECT
date_trunc('hour', dt) AS hour,
Event
FROM (
SELECT *,
strptime(date || ' ' || time, '%Y-%m-%d %H:%M:%S') AS dt
FROM 'large.csv'
)
WHERE ID = 'C'
)
PIVOT (
COUNT(*) FOR Event IN ('A', 'B', 'C')
)
ORDER BY hour
)
TO 'ID_C_hourly_pivot.csv'
(HEADER);
6. Recommended scientific format (better than pivot for analysis)
This is often more useful for radio astronomy workflows, plotting, and signal analysis:
SELECT
date_trunc('hour', dt) AS hour,
Event,
COUNT(*) AS intensity
FROM base
WHERE ID = 'C'
GROUP BY 1, 2
ORDER BY 1;
7. Optional: discover event classes automatically
Before pivoting:
SELECT DISTINCT Event
FROM 'large.csv'
WHERE ID = 'C';
Then use results in:
PIVOT (... FOR Event IN (...))
8. Performance rules for large datasets (important)
- Always filter early:
WHERE ID = 'C' - Avoid repeated timestamp parsing (use
baseview) - Prefer Parquet over CSV when possible:
SELECT * FROM 'large.parquet'; - Select only required columns in large-scale runs
- Use hourly aggregation before pivoting for efficiency
9. Recommended final workflow (best practice summary)
- Create
baseview (timestamp conversion once) - Filter by
IDearly - Aggregate into hourly bins
- Optionally pivot for spreadsheet-style output
- Export to CSV or Parquet
How would I use DuckDB to divide a very large spreadsheets into smaller ones maximum 50000 lines per spreadsheet, copying amy header across to each file?
You can do this very cleanly in DuckDB because it can stream large tables and generate row numbers — which is exactly what you need for chunking.
Below is a simple, reliable method that splits a huge spreadsheet into multiple files of 50,000 rows each, automatically repeating the header.
—
DuckDB: Split Large Spreadsheet into 50,000-row Files
1. Core idea
We:
1. Load file
2. Add row numbers
3. Assign a “chunk ID” every 50,000 rows
4. Export each chunk separately (header included automatically)
—
2. Step-by-step DuckDB solution
Step A — Create a numbered view
CREATE VIEW base AS
SELECT
*,
row_number() OVER () AS rn
FROM read_csv_auto(‘large.csv’);
—
Step B — Assign chunk groups (50,000 rows each)
CREATE VIEW chunked AS
SELECT
*,
CAST((rn – 1) / 50000 AS INTEGER) AS chunk_id
FROM base;
—
Step C — Export each chunk (automatic splitting)
DuckDB doesn’t loop internally, so you export per chunk like this:
Example: chunk 0
COPY (
SELECT *
FROM chunked
WHERE chunk_id = 0
ORDER BY rn
)
TO ‘output_chunk_0.csv’
(HEADER);
—
chunk 1
COPY (
SELECT *
FROM chunked
WHERE chunk_id = 1
ORDER BY rn
)
TO ‘output_chunk_1.csv’
(HEADER);
—
3. One-shot automated version (Python + DuckDB — recommended)
This avoids manually writing chunk numbers:
import duckdb
con = duckdb.connect()
# Create chunked dataset
con.execute(“””
CREATE VIEW chunked AS
SELECT *,
row_number() OVER () AS rn,
CAST((row_number() OVER () – 1) / 50000 AS INTEGER) AS chunk_id
FROM read_csv_auto(‘large.csv’);
“””)
# Find how many chunks exist
chunks = con.execute(“SELECT MAX(chunk_id) FROM chunked”).fetchone()[0]
# Export all chunks
for i in range(chunks + 1):
con.execute(f”””
COPY (
SELECT *
FROM chunked
WHERE chunk_id = {i}
ORDER BY rn
)
TO ‘output_chunk_{i}.csv’
(HEADER);
“””)
—
4. Why this works well for very large astronomy datasets
No full in-memory load required
Streaming execution (DuckDB handles disk efficiently)
Stable ordering via row_number()
Header automatically included per file
Works for CSV, Parquet, or mixed formats
—
5. Optional improvements (useful for your radio data)
Keep only relevant columns (faster splits)
SELECT date, time, ID, Event
Or split already-filtered data
WHERE ID = ‘C’
Parquet version (much faster than CSV)
COPY chunked TO ‘chunk_0.parquet’ (FORMAT PARQUET);
—
6. Best-practice version for your use case
For radio astronomy logs:
1. Filter early (ID)
2. Convert timestamp once
3. Chunk last
4. Export Parquet if possible
—
If you want next step, I can turn this into:
a single Windows batch script
or an automatic “drop CSV → outputs chunks” tool
or integrate it with your SDR/1420 MHz pipeline so chunks become time-sliced observation blocks
Can I copy headers across in DuckDB?
Yes — but with an important nuance:
DuckDB doesn’t need you to manually define headers, because it already treats the first row as column names when using read_csv_auto(). The trick is making sure that behaviour is preserved when you split files.
So the real question becomes:
> “Can DuckDB automatically preserve the detected schema (header) when exporting chunks?”
Answer: Yes — automatically, as long as you export using COPY … (HEADER) from a properly inferred table/view.
—
✔ Best practice: fully automatic header handling in DuckDB
1. Let DuckDB infer the header once
CREATE VIEW base AS
SELECT *
FROM read_csv_auto(‘large.csv’);
At this point:
Column names are already detected
You do NOT need to reference the header again
—
2. Add chunking (safe for huge files)
CREATE VIEW chunked AS
SELECT *,
row_number() OVER () AS rn,
CAST((row_number() OVER () – 1) / 50000 AS INTEGER) AS chunk_id
FROM base;
—
3. Export chunks — header is automatically reused
This is the key point:
COPY (
SELECT *
FROM chunked
WHERE chunk_id = 0
ORDER BY rn
)
TO ‘chunk_0.csv’
(HEADER);
What happens here:
DuckDB automatically writes column names from the schema
No need to “re-copy” or “re-detect” header
Every chunk gets identical header row
—
✔ Even simpler: no header handling at all
If you do:
COPY base TO ‘file.csv’ (HEADER);
DuckDB:
Writes header automatically
Uses schema metadata (not first row of data)
So header handling is entirely schema-driven, not file-driven.
—
⚠ When header problems do happen
Only in these cases:
1. Mixed-format CSVs
Different files with inconsistent columns
2. Manual file concatenation (outside DuckDB)
Using cat or similar tools breaks consistency
3. Disabling header detection accidentally
If you use:
read_csv(‘file.csv’, header=false)
then you must define columns manually
—
✔ If you want full automation (no chunk loop in SQL)
Use Python wrapper (recommended for real workflows):
import duckdb
con = duckdb.connect()
con.execute(“””
CREATE VIEW chunked AS
SELECT *,
row_number() OVER () AS rn,
CAST((row_number() OVER () – 1) / 50000 AS INTEGER) AS chunk_id
FROM read_csv_auto(‘large.csv’);
“””)
n = con.execute(“SELECT MAX(chunk_id) FROM chunked”).fetchone()[0]
for i in range(n + 1):
con.execute(f”””
COPY (
SELECT *
FROM chunked
WHERE chunk_id = {i}
ORDER BY rn
)
TO ‘chunk_{i}.csv’
(HEADER);
“””)
—
✔ Key takeaway
You do NOT need to manually copy or manage headers.
DuckDB:
Infers header → schema
Stores it internally
Automatically writes it on every COPY … (HEADER) export