Practical Shell Commands to Manipulate Genome Data

Fast and Neat!

Ye Zheng

Last updated on Oct 23, 2018 3 min read

Filter file 2 by file 1 column 1.

– Found here

awk 'FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3

Explanation:

FNR == NR: This test is true when the number of records is equal to the number of records in the file. This is only true for the first file, for the second file NR will be equal to the number of lines of file1 + FNR.

a[$1]: Create an array element index of the first field of file1.

next: skip to the next record so no more processing is done on file1.

!($1 in a): See if the first field ($1) is present in the array, ie in file1, and print the whole line (to file3).

Print number in certain column.

awk '{split($13, editDist, ":");split($16, mismatch, ":"); split($17, gap, ":"); split($18, gapLen, ":"); printeditDist[3]"\t"mismatch[3]+gapLen[3]"\t"mismatch[3]"\t"gap[3]"\t"gapLen[3]}' n3yesn2no.sam >n3yesn2no_summary.sam

awk 'FNR==NR{a[$1];next};($1 in a)' file1 file2 > file3

Group by the first and second column and sum the third column.

awk '{a[$1" "$2]+=$3}END{for (i in a) print i,a[i]}' aaa.txt | sort

Insert a column after certain column. e.g.insert after the second column

awk '$2 = $2 FS "0"' file >outfile

Sort value eg. chr1 1005; chr1 105. -> chr1 105; chr1 1005

sort -k1,1V -k2,2n infile >outfile

sort -g: exponential value

sort -r: decreasing order

wc -l write to file: remove the path of the counted file:

sed 's/\([0-9]*\).*/\1/' input >outfile

less input: 284 /p/keles/ENCODE-TE/volume13/SRR881996_Large/defaultDir/fithicDir/chr1/chr1.sign.contact 281 /p/keles/ENCODE-TE/volume13/SRR881996_Large/defaultDir/fithicDir/chr2/chr2.sign.contact less output: 284 281

Split a large file into parts.

split [OPTION] [INPUT [PREFIX]]
-l equal lines in each part
-b equal bytes in each part
-d enables a numeric suffix like prefix00 prefix01
enabling the option -a to 1, single digit numeric suffix is se.t

Find multiple folder size.

du -sch *

awk filter by variable values. Cannot input the variable directly, instead define a new value and compare with the column of interest:

thre=760
awk '$1 < thre {print $0}' thre=$thre input

awk print just change the first column and print all the column except the modified first column:

awk -v OFS="\t" '{split($1, id, "."); $1=""; {print id[1]"."id[2], $0}}' SRR881997_2_01_noheader.sam >/mnt/gluster/yzheng74/HiC/HiCPro/data/SRR881997_2_01_noheader

For duplicate parital columns (for example, column2-5), keep the first row when duplicate column2-5 appear.

awk '{if(! a[$1]){print; a[$1]++}}'
more condemssed way :
awk '!a[$1]++' file

Sum one column values.

awk '{s+=$1} END {printf "%.0f", s}' mydatafile
awk '{s+=$1} END {print s}' mydatafile

Read from file separated by “,” and save as array.

IFS=$"," read -r -a test <infile

When sort or picard runs, use tmp folder created locally instead of /tmp or $TMPDIR for space quota safety.

sort -T 
java -Djava.io.tmpdir=tmp
picard -jar TMP_DIR=tmp

Always use sort -T /localFolder/sorttmp when sort a large file on a group server where you have no control of timely cleaning temporary files under /tmp/. Instead, mkdir -p /localFolder/sorttmp and save all the temporary sorting files there to avoid running out of space under /tmp/ and teminate your specious sorting jobs.
search string in file including TAB;

grep -P 'A\tB'

World readable:

chmod 777 -R folder

Insert one column at any position:

awk -v OFS="\t" '$17 = $17 FS "\t0,255,0"' inFile >outFile
insert 0,255,0 at the 18th column.

Print the lines if the first column contains certain pattern (partially matched):

awk '$1 ~ /snow/ { print $0}' dummy_file

print columns that start with “MD”:

awk '{for (i=1;i<=NF;i++){if ($i ~/^MD:/) {print $i}}}' infile

merge two file: every two lines from one file and two lines from another:

awk '{key=$0; getline; getline x<"testsq"; getline y<"testsq";print key "\n" $0 "\n" x "\n" y;}' testseq

License

Released under the MIT license.

Shell