Managing data
Joachim Jacob 8 and 15 November 2013
Bioinformatics data
Historically, bioinformatics has al ays !sed te"t files to store data#
&'B file e"cer(t
$enban% record
HMM (rofile
N$) data
*he N$) machines s(it a lot of data, stored in plain text files# *hese files are m!lti(le gigabytes in si+e#
*i(s for managing N$) data
1# ,hen yo! move the data, do it in its smallest form# Compress the data# 2# ,hen yo! !n(ac% the data, leave it
here it is#
Symbolic links (oint to the data in different folders#
3# &rovide eno!gh storage for yo!r data#
choose yo!r file system type isely
-om(ression. tools in /in!"
2nd some more e"ist###
htt(.00
#lin!"lin%s#com0article0201102200111011310-om(ression*ools#html
*i(s
,idely !sed com(ression tools. $N3 +i( 4gzip5 Bloc% )orting com(ression 4 bzip25 *y(ically, com(ression tools or% on one file# Ho to com(ress directories and their contents6
*ar
itho!t com(ression
*ar 4*a(e 2rchive5 is a tool for bundling a set of files or directories into a single archive. *he res!lting file is called a tar ball# )ynta" to create a tarball. $ tar -cf archive.tar file1 file2 )ynta" to e"tract. $ tar -xvf /path/to/archive.tar
-om(ression. a ty(ical case
2rchiving and com(ression mostly occ!r together# *he most !sed formats are tar.gz or tar.bz. *hese files are the res!lt of two (rocesses#
Archiving 4tar5
Compressing 4g+i( or b+i(25
-om(ression. on yo!r des%to(
-om(ression. on yo!r des%to(
-om(ression. on the command line
Tar is the tool for creating #tar archives, b!t it can com(ress in one go, ith the + or 7 o(tion# Creating a com(ressed tar archive. $ tar cvfz mytararchive.tar.gz $ tar cvfj mytararchive.tar.bz
create -om(ression techni8!e
docs/ docs/
Decompressing a com(ressed tar archive $ tar xvfz mytararchive.tar.gz $ tar xvfj mytararchive.tar.bz
e"tract files verbose
'e90com(ression
*o com(ress one or more files. $ gzip [options] file $ bzip2 [options] file *o decom(ress one or more files. $ gunzip [options] file(s) $ bunzip2 [options] file(s)
*i(s
Many com(ression tools on the command line allo to read compressed files 4instead of first !n(ac%ing then reading5#
$ zcat file(s) $ bzcat file(s)
-om(ression is al ays a balance bet een time and com(ression ratio# $+i( is faster, b+i(2 com(resses harder# :f com(ression is im(ortant to yo!. benchmar%;
<"ercise
a little com(ression e"ercise#
)ymlin%s
&ay attention# )omething very convenient; 2 symbolic link 4or symlin%5 is a file hich (oints to the location of the lin%ed9to file# =o! can do anything ith the symlin% that yo! can do on the original file# 2s yo! move the original file from its location, the symlin% is >dead>#
'o nloads0
?
&ro7ects0
2nnotation0 @ice0 B!tterfly0 )e8!ences0 alignment.sam
)ymlin%s
*o create a symlin%, move to the folder in m!st be created, and e"ec!te ln # here the symlin%
~/Projects $ cd Butterfly ~/Butterfly $ ln -s ../Rice/ e!uences/alignment.sam "in#$to$alignment.sam
'o nloads0
?
&ro7ects0
2nnotation0 @ice0 B!tterfly0 )e8!ences0 alignment.sam
)ymlin%s
*he symlin% is created# =o! can chec% *o delete a symlin%, !se unlin#. ith ls.
~/Projects $ cd Butterfly ~/Butterfly $ ln -s ../Rice/ e!uences/alignment.sam "in#$to$alignment.sam ~/Butterfly $ ls -lh "in#$to$alignment.sam lr%xr%xr%x & joachim joachim '' (ct )) &'*'+ "in#$to$alignment.sam -, ../ e!uences/alignment.sam
'o nloads0
?
&ro7ects0
2nnotation0 @ice0 )e8!ences0 alignment.sam
B!tterfly0 ink!to!alignment.sam
<"ercise
a little symlin% e"ercise
'is%s and storage
:f yo! dive into bioinformatics, yo! manage dis%s and storage# * o ty(es of dis%s " solid state disks /o ca(acity, high s(eed, random " spinning hard disks High ca(acity, >normal> s(eed, se8!ential rites#
htt(.00en# i%i(edia#org0 i%i0)olid9stateAdrive htt(.00en# i%i(edia#org0 i%i0HardAdis%
ill have to
rites
2 dis% is a device
Bia the terminal, sho the dis%s !sing
$ sudo fdis# -l -sudo. /ass%ord for joachim* 0is# /dev/sda* &1.' 2B3 &1'45&'&1&) bytes ... 0is# /dev/sdb* 166+ 7B3 166+&+&+&) bytes ...
2 dis% is divided into (artitions
2 dis% can be divided in (arts, called (artitions# 2n internal disk hich r!ns an o(erating system is !s!ally divided in (artitions, one for each f!nctions# 2n external disk is !s!ally not divided in (artitions#
-hec% o!t the dis% !tility tool
*he system dis%
Name of the dis%
*he system dis%
Name c!rrently highlighted (artition
*he system dis%
&lace in the directory str!ct!re here the (artition can be accessed
2n e"am(le of an 3)B dis%
9
&lace in the directory str!ct!re here the (artition can be accessed
2n e"am(le of an 3)B dis%
*he 3)B dis% is >mo!nted> a!tomatically on the directory tree !nder #media#
2n e"am(le of an 3)B dis%
*his is the ty(e of file system on the (artition# *he (artition is said to be formatted in C2*32 4in this case5#
Cile system formats
By defa!lt, many 3)B flash dis%s are formatted in $AT%&# Dther ty(es are N*C), e"tE, FC)#
$AT%& G ma" E$B files 'T$S G ma"im!m (ortability 4also for !se !nder (xt) G defa!lt file system in /in!", indo s5
htt(.00en# i%i(edia#org0 i%i0CileAsystemHCileAsystemsAandAo(eratingAsystems
2n e"am(le of an 3)B dis%
Cirst !nmo!nt the device# Ne"t, choose format the device#
&
Cormat dis%s
ith dis% !tility
-hoose the ty(e of file system yo! ant to be on that device#
Cormat dis%s
ith dis% !tility
Cormat dis%s
ith dis% !tility
or%
=o! don>t ant to %no all the commands that behind the gnome9dis%9!tility for yo!# B!t if yo! do. 9 mo!nt 9 !mo!nt 9 fdis% 9 m%fs
=o! can read the man (ages and search for g!ides on the internet if yo! ant to get to %no these 4o!t of sco(e for this co!rse5#
-hec%ing storage s(ace
By defa!lt >dis% !sage analy+er>#
-hec%ing storage s(ace
Bon!s. IE'ir)tat# Not installed by defa!lt#
-hec%ing storage s(ace
Bon!s. IE'ir)tat# Not installed by defa!lt#
IE'irstat is a I'< (ac%age
@ehearsal. hat is I'<6
Bon!s. hat ha((ens hen yo! install this (ac%age on o!r system6
)(ace left on dis%s
ith df
*o chec% the storage that is !sed on the different dis%s#
~/ $ df -h
8ilesystem /dev/sda& udev tm/fs none none /dev/sdb&
~/ $ df -h .
ize &)2 '647 )447 <.47 '657 1.52
9sed :vail 9se; 7ounted on <.12 <.+2 '6; / '.4= '647 &; /dev 6)4= &667 &; /run 4 <.47 4; /run/loc# +>= '657 &; /run/shm )47 1.+2 &; /media/test
*he si+e of directories
*o chec% the si+e of files or directories#
~/ $ du -sh ? <)4= bin )5&7 @om/ression$exercise '.4= 0es#to/ '.4= 0ocuments <.47 0o%nloads '.4= 7usic '.4= Pictures '.4= Public 1+17 Rice Axam/le '.4= Bem/lates '.4= test &+7 test.img &&'7 ugene-&.&).) '.4= Cideos
,ildcards on the command line
,ildcards are !sed to describe the names of files#dirs# +, Dn that (osition, the character may be one of the characters bet een J K, e#g# saniti+sz,ation matches. sanitisation and sanitization Dn that (osition, any character is allo ed# e#g# saniti-ation matches. sanitisation, sanitiration, ### L Dn that (osition, any length of string is allo ed e#g# s. matches. san, sdd, sanitisation, sam#alignment,###
,ildcards on the command line
Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ du -sh 0o?
,ildcards on the command line
Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ du -sh 0o? '.4= 0ocuments )42 0o%nloads
,ildcards on the command line
Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ ls ?.fast!
,ildcards on the command line
Many tools that re8!ire an argument to (oint to files or directories acce(t these ildcards#
~/ $ ls ?.fast! ARR&'5<<)$&.fast! testout.fast! ARR&'5<<)$&$/rinse!$good$zz%D.fast! ARR&'5<<)$).fast! test.fast!
Iey ords
-om(ression 2rchive )ymbolic lin% mo!nting Cile system format (artition @ec!rsively df d! !nlin% ,rite in yo!r o n ords hat the terms mean
Brea%