Linux data manipulation
Wherever creating or manipulating files my first port of call is linux (or git bash). Windows certainly has tools that can achieve the same results but I find linux can manage larger files more efficiently and gives you the facility to perform much more complex tasks at once without resorting to code.
SEQ COMMAND
This command provides a method of creating number sequences. The first number is the starting number, the second is the increment value and the third the maximum value.
$ seq 1000000 10 9000000
1000000
1000010
1000020
…..
9000000
REV COMMAND
This command allows you to reverse a string.
$ rev filename01.csv
vsc.10emanelif
Now imagine you wanted to change filename01.csv to filename02.csv. You could run the following command.
$ echo filename-01.csv | cut -d- -f2 | sed ‘s/^/20/’ | rev | sed ‘s/$/.csv'
filename-02.csv
FOR COMMAND
The command allows you to loop around a block of command(s). In this example we use the seq command to create a sequence of 1-10 and echo each value out.
$ for i in `seq 1 10`;
do
echo $1
done
1
2
3
…
10
SED COMMAND
This command can replace values within a string/file. In this example we add a header to an existing file called filename 01.csv.
FILENAME01.CSV (PRE UPDATE)
1000000,9000000
$ sed -i 1i'column1,column2’ filename01.csv
FILENAME01.CSV (POST UPDATE)
column1, column2
1000000,9000000
SPLIT COMMAND
This command allows you to take a large file and break it into smaller files. For example you have a data file with a million rows you could split that in to files each with 10000 rows.
$ split -l10000 onemillion.csv
xaa
xab
xac
…
xaz