# How to Split a CSV into Smaller CSVs


🗓 May 17, 2020 | 👱 By: Hugh


In this post I'll show you how to split up a big CSV into multiple smaller CSVs.

There are two methods I see being most commonly used:

  1. Splitting into N even parts. e.g. splitting into 6 files.

    # Store the heading row.
    csvheader=`head -n 1 input_file.csv`
    # Replace "6" with number of files you want.
    split -d -n "l/6" input_file.csv output_file_
    # Add heading row to files other than the first one.
    find . -name "output_file_*" | grep -v 00 | xargs sed -i "1s/^/$csvheader\n/"
    # Add a .csv to each of the files.
    find . -name "output_file_*" | xargs -I {} mv {} {}.csv
    

    (Note: thats an l not a 1)

  2. Splitting into files with N lines each. e.g. each file should have 200,000 lines.

    # Store the heading row.
    csvheader=`head -n 1 input_file.csv`
    # Replace 200000 with number of lines you want per file.
    split -d -l "200000" input_file.csv output_file_
    # Add heading row to files other than the first one.
    find . -name "output_file_*" | grep -v 00 | xargs sed -i "1s/^/$csvheader\n/"
    # Add a .csv to each of the files.
    find . -name "output_file_*" | xargs -I {} mv {} {}.csv
    

    (Note: since we split into 200,000 line files, then add the header on, most files will have 200,001 lines, the exceptions being the first and probably last. If you really need 200,000 max, either use 199,999 and deal with the first being one short, or delete the header, split to 199,999 lines, then add the header to all files.)

# The explanation

Most will have copy pasted and moved on, but I'm going to add some extra info if you want to read about what each line is doing.

# Get the column headings

csvheader=`head -n 1 input_file.csv`

This line saves the first line into the variable csvheader. The assumption, of course, is that the header is on the first line. This will be added to the top of the output files.

# Option 1: Split into N parts

split -d -n "l/6" input_file.csv output_file_

The -n option means split into a certain number of parts. Using l/N means to keep each line intact. if you used -n 6 it would split into 6 evenly sized files, even if this means splitting a line in half to make them even.

-d means use a decimal suffix, i.e. put 00, 01, 02... at the end of output file names.

The output_file_ at the end means the start of the output files will be output_file_. Combined with the suffix above, you end up with files like output_file_00.

# Option 2: Splitting into N lines

The only difference here is that the -l option means split into files with N lines, and create as many output files as is required to do so.

# Adding the headers back in

find . -name "output_file_*" | grep -v 00 | xargs sed -i "1s/^/$csvheader\n/"

This line finds each of the output files (make sure you update the search term so that it only finds the output files), skips the first one that already has the headers, and inserts the headers in to the rest.

# Add .csv suffix

find . -name "output_file_*" | xargs -I {} mv {} {}.csv

This line takes all the input files, and uses xargs + mv to rename files to add the .csv extension on the end.

# The end!

And that is how you split a CSV into smaller parts!