How to cut a file in half with DuckDB, and also how to split by date?

One of the nicest features of DuckDB is that you don’t need to import a CSV before working with it.

For example, if your file is called large.csv:

View the first 10 rows

SELECT * FROM 'large.csv' LIMIT 10;

Keep only the first 100,000 rows and save them

COPY (
    SELECT *
    FROM 'large.csv'
    LIMIT 100000
) TO 'small.csv' (HEADER, DELIMITER ',');

Extract a range of rows (e.g. rows 100,001 to 200,000)

COPY (
    SELECT *
    FROM 'large.csv'
    LIMIT 100000 OFFSET 100000
) TO 'chunk2.csv' (HEADER, DELIMITER ',');

Filter by a column

COPY (
    SELECT *
    FROM 'large.csv'
    WHERE frequency > 1420
) TO 'filtered.csv' (HEADER);

If by “cut a CSV in half” you literally mean splitting it into two equal-sized CSV files, I can show you the simplest way to do that as well.

To split a CSV into two equal parts using DuckDB, you can do it in a clean SQL way by numbering rows first.

Step 1: Open DuckDB

In Command Prompt:

duckdb

Step 2: Split CSV in half

This works by assigning a row number, counting total rows, then splitting at the midpoint:

WITH numbered AS (
    SELECT
        *,
        row_number() OVER () AS rn,
        count(*) OVER () AS total_rows
    FROM 'large.csv'
)

-- First half
COPY (
    SELECT *
    FROM numbered
    WHERE rn <= total_rows / 2
) TO 'half1.csv' (HEADER, DELIMITER ',');

-- Second half
COPY (
    SELECT *
    FROM numbered
    WHERE rn > total_rows / 2
) TO 'half2.csv' (HEADER, DELIMITER ',');

Important notes

  • This is “equal rows”, not equal file size.
  • Row order is whatever order the CSV is read in (if order matters, we can stabilise it).
  • If your file is huge, DuckDB streams it efficiently so you don’t need to load it into memory.

If you want something more robust

If your CSV has a time column (like timestamps from SDR or astronomy logs), it’s often better to split by time instead of row count. I can show that version too if useful.

To split a CSV into one file per day using DuckDB, the key idea is:

> extract the date from your timestamp column → group/filter by it → export each group

1. Open DuckDB

duckdb

2. Check your date column

Assume your CSV has a column like:

timestamp (e.g. 2026-06-28 12:34:56)

We convert it to a date like this:

SELECT DISTINCT CAST(timestamp AS DATE) AS d

FROM ‘large.csv’

ORDER BY d;

3. Export ONE day (test first)

COPY (

    SELECT *

    FROM ‘large.csv’

    WHERE CAST(timestamp AS DATE) = ‘2026-06-28’

) TO ‘2026-06-28.csv’ (HEADER, DELIMITER ‘,’);

4. Automate: one file per day (Windows PowerShell way)

Exit DuckDB, then run this in PowerShell:

$days = duckdb -c “SELECT DISTINCT CAST(timestamp AS DATE) FROM ‘large.csv’;”

foreach ($day in $days) {

    duckdb -c “COPY (

        SELECT *

        FROM ‘large.csv’

        WHERE CAST(timestamp AS DATE) = ‘$day’

    ) TO ‘$day.csv’ (HEADER, DELIMITER ‘,’);”

}

5. If your column is already a DATE (faster)

If you already have a date column:

WHERE date = ‘2026-06-28’

Important notes

This works even for very large CSVs because DuckDB streams the data.

Output files are created per unique date.

If your timestamp format is weird (e.g. ISO strings), we may need strptime() instead of CAST().

By Admin

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.