使用Shell進行資料處理
Data processing with shell script
近幾個月應因工作的需要,大量的使用cmd來做一些簡單的資料處理。
因為這類型資料處理的工作大多是繁瑣卻不太會重複的內容或邏輯,很難有一套完整的商業邏輯,再加上回應的速度需要快一點,因此使用任何一種語言包裝這樣的邏輯成為一個工具反而有點不合成本。
shell所提供的一些文字處理相關commands便成為這樣情境下,算得上最稱手的一把錘子了。(為求方便,我在資料處理的過程中經常會使用.tsv檔案來處理,最後再搭配工具將.tsv轉換成一般使用者熟悉的excel file)
搭配MySQL:
首先使用SQL語法撈出資料,導出結果,再開始對欄位進行處理 ,
ex:
echo "SELECT * FROM users ..." | mysql --host=[host] --user=[usr] --password=[password] --default-character-set=utf8 [DB_NAME]
cat [sql file] | mysql --host=[host] --user=[usr] --password=[password] --default-character-set=utf8 [DB_NAME] | tee sql_result.tsv | head -n 10
搭配 PostgresSQL:
echo '[YOUR_SQL]'| PGPASSWORD=[YOUR_PASSWORD] psql -U [user] -h [host]-p [port]| head -n 10
下面列出,我在資料處理中經常會使用到的一些指令,跟一些搭配使用的情境:
Commands:
- > 與 >>
// 將畫面上顯示的結果導入tmp.txt
$ ls -l ... > tmp.txt// 將畫面上顯示的結果append在tmp.txt後
$ ls -l ... >> tmp.txt
- cat : 貓?!
$ cat [file]
$ cat -n [file] // 顯示行號
$ cat -s [file] // 輸出抑制多行空白// 使用 >> , > 搭配寫入, 以ctrl+d 跳離
$ cat > [file] // 覆蓋重寫or新開檔案
$ cat >> [file] // append
- | (pipe): 把資料當水流,pipe就是管線的概念。
$ cmd1 | cmd2 | cmd3 | ... | cmd...// ex.將很長的結果使用less分頁顯示
$ ls -al / | less
- find : 找指定目錄包含子目錄的檔案(隱藏資料夾也會找)。(https://www.computerhope.com/unix/ufind.htm)
$ find [搜尋的路徑目標] [執行的比對條件] [對結果進行的行為] [選項]$ find ~ // 找家目錄下所有的檔案及資料夾[執行的比對條件] :// 使用 -type 搜尋指定類型,f:檔案, l: synbol links, d: directory
// 找家目錄下所有的檔案
$ find ~ -type f// 加上name的過濾規則, 其他可用規則參考連結
$ find ~ -type f -name '*.png' // 用or
$ find ~ \( -type f -not -perm 0600 \) -or \( -type d -not -perm 0700 \)[對結果進行的行為] :// 對執行結果進行操作: -print, -delete, -quit, -ls
$ find ~ -type f -name '*.png' -ls// 也可以使用-exec, -ok... 來執行指令 => 用xargs比較好
$ find ~/Desktop -type f -name '*.png' -print | xargs ls -l[選項]: -depth, -maxdepth levels, -mindepth levels, -noleaf
- head, tail : 輸出檔案首10行(head) or 末10行(tail), 皆搭配 -n 可以輸出任意行數
$ ls -al /bin/usr | head // 顯示首10行
$ ls -al /bin/usr | head -n 5 // 顯示首5行
$ ls -al /bin/usr | tail // 顯示末10行
$ ls -al /bin/usr | tail -n 5 // 顯示末5行
- wc (word count) : 顯示檔案內的行列, 字元數
$wc ~/Desktop/event.js
300 757 8813 /Users/wangwilliam/Desktop/event.js
(行數) (字數) (位元數) (檔案位置) $ wc -l [file] // 只顯示行數 // 顯示某資料夾下的檔案數量
$ ls -1 [path] | wc -l
- tee (T): 把pipe in 的東西可以寫出的同時也往下繼續pipe out,像是T型水管。
// 讀取來自 ls 的stdin, 並輸出至 stdout 及 tmp.txt 檔案
$ ls -al /usr/bin | tee ~/Desktop/tmp.txt | less
- grep (global regular expression print) : 對文字進行搜索後輸出。
$ grep [options]options:
// -i : 忽略大小寫
// -v : 找出不匹配的
// -c : 匹配成功數目
// -n : 加上行號
// ...
// -E : 使用regex表達的pattern
$ ls -al | grep -E '^[a-zA-Zl]{3}'
$ ls -al ~ | sort -nr -k 5 // 第5行為檔案大小$ sort -k 3.7 // col:3, char: 7// -n 按數值大小排序.
// -r 倒轉排序, 即從大至小排序.
// -t 指定欄位分隔的字符, 預設是 Tab.
// -k 要排序的欄位.
- uniq : 因為uniq只能消除鄰近的行列,所以大部分都會先sort過再進行uniq
$ cat test.tsv
William Wang
Wang
Will
William
Wang
Wang
William
William Wang$ cat test.tsv | sort
Wang
Wang
Wang
Will
William
William
William Wang
William Wang$ cat test.tsv | sort | uniq
Wang
Will
William
William Wang
- cut :
(iris.data.tsv : https://github.com/WJWang/Logistic/blob/master/iris.data.tsv)
$ cat iris.data.tsv| head
SepalLength SepalWidth PetalLength PetalWidth TrainingClass
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.4 0.2 Iris-setosa$ cut -f 1,3 iris.data.tsv | head // 只取1,3欄
SepalLength PetalLength
5.1 1.4
4.9 1.4
4.7 1.3
4.6 1.5
5.0 1.4
5.4 1.7
4.6 1.4
5.0 1.5
4.4 1.4$ cut -d ':' [filepath]-f 1,3 // 改使用:作為切分點,找1,3欄
// ex. 用:切出/etc/passwd的首欄
$ cut -d ':' -f 1 /etc/passwd // 用file
$ cat /etc/passwd | cut -d ':' -f 1 // 用cat再pipe in$ ls -l | cut -c 1-10
total 3824
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
-r--------
- paste
// 將iris.data.tsv 的1,5欄對調
$ cut -f 1 iris.data.tsv > iris.data.first.tsv
$ cut -f 5 iris.data.tsv > iris.data.fifth.tsv
$ cut -f 2,3,4 iris.data.tsv > iris.data.else.tsv$ paste iris.data.fifth.tsv iris.data.else.tsv iris.data.first.tsv
TrainingClass SepalWidth PetalLength PetalWidth SepalLength
Iris-setosa 3.5 1.4 0.2 5.1
Iris-setosa 3.0 1.4 0.2 4.9
Iris-setosa 3.2 1.3 0.2 4.7
Iris-setosa 3.1 1.5 0.2 4.6
Iris-setosa 3.6 1.4 0.2 5.0
Iris-setosa 3.9 1.7 0.4 5.4
Iris-setosa 3.4 1.4 0.3 4.6
Iris-setosa 3.4 1.5 0.2 5.0
Iris-setosa 2.9 1.4 0.2 4.4
Iris-setosa 3.1 1.5 0.1 4.9
Iris-setosa 3.7 1.5 0.2 5.4
Iris-setosa 3.4 1.6 0.2 4.8
Iris-setosa 3.0 1.4 0.1 4.8
...
...
- comm: 比較A, B兩個文檔,以三行方式輸出,有-1, -2, -3三個選項
$ cat A.txt
1
2
3
4
5$ cat B.txt
1
2
3
4
6
7$ comm A.txt B.txt
1
2
3
4
5
6
7
(只有A有的) (只有B有的) (都有的)// comm -1 A.txt B.txt (只顯示2,3行)
// comm -2 A.txt B.txt (只顯示1,3行)
// comm -3 A.txt B.txt (只顯示1,2行)
- diff, patch :檔案間的diff, 產生patch.
// diff
- Compare files:
diff file1 file2- Compare files, ignoring white spaces:
diff -w file1 file2- Compare files, showing differences side by side:
diff -y file1 file2- Compare directories recursively:
diff -r directory1 directory2- Compare directories, only showing the names of files that differ:
diff -rq directory1 directory2// patch - Apply a patch:
patch < patch_file.diff- Apply a patch to current directory:
patch -p1 < patch_file.diff- Apply the reverse of a patch:
patch -R < patch_file.diff
- split : 可以把大的檔案切分。
- Split a file, each split having 10 lines (except the last split):
split -l 10 filename- Split a file into 5 files. File is split such that each split has same size (except the last split):
split -n 5 filename- Split a file with at most 512 bytes of lines in each split:
split -C 512 filename
- tr: Translate characters
- Replace all occurrences of a character in a file, and print the result:
tr find_characters replace_characters < filename- Map each character of the first set to the corresponding character of the second set:
tr 'abcd' 'jkmn' < filename- Delete all occurrences of the specified set of characters from the input:
tr -d 'input_characters'- Compress a series of identical characters to a single character:
tr -s 'input_characters'- Translate the contents of the file to upper-case and print result:
tr "[:lower:]" "[:upper:]" < filename- Strip out non-printable characters from the file and print result:
tr -cd "[:print:]" < filename
- sed: 用正則進行替換
- Replace the first occurrence of a string in a file, and print the result:
$ sed 's/find/replace/' filename- Replace all occurrences of an extended regular expression in a file:
$ sed -E 's/regex/replace/g' filename- Replace all occurrences of a string in a file, overwriting the file (i.e. in-place):
$ sed -i '' 's/find/replace/g' filename- Replace only on lines matching the line pattern:
$ sed '/line_pattern/s/find/replace/' filename- Apply multiple find-replace expressions to a file:
$ sed -e 's/find/replace/' -e 's/find/replace/' filename- Replace separator / by any other character not used in the find or replace patterns, e.g., #:
$ sed 's#find#replace#' filename
- aspell
- awk
- Print the fifth column (a.k.a. field) in a space-separated file:
$ awk '{print $5}' filename- Print the second column of the lines containing "something" in a space-separated file:
$ awk '/something/ {print $2}' filename- Print the last column of each line in a file, using a comma (instead of space) as a field separator:
$ awk -F ',' '{print $NF}' filename- Sum the values in the first column of a file and print the total:
$ awk '{s+=$1} END {print s}' filename- Sum the values in the first column and pretty-print the values and then the total:$ awk '{s+=$1; print $1} END {print "--------"; print s}' filename
- xargs
- Main usage pattern:
$ arguments_source | xargs command- Delete all files with a .backup extension:
$ find . -name '*.backup' | xargs rm -v- Convert newlines in the input into NUL (\0) characters, and split on those only (useful if the input to xargs contains spaces):
$ arguments_source | tr '\n' '\0' | xargs -0 command- Execute the command once for each input line, replacing any occurrences of the placeholder (here marked as _) with the input line:
$ arguments_source | xargs -I _ command _ optional_extra_arguments$ arguments_source | xargs -P 10 ..... // -P 表示 maxprocs- Example:
$ mkdir ~/tmp_work$ split -l 6000 ~/big_tsv_file.tsv ~/tmp_work/// ls ~/tmp_work// => aa ab ac ad ae af ag ah ai aj ak$ ls ~/tmp_work | xargs -P 11 -I %X bash -c "cat %X | php ~/processor.php" > tmp.txt
- rsync:
// Transfer file from local to remote host:
$ rsync path/to/file remote_host_name:remote_host_location// Transfer file from remote host to local:
$ rsync remote_host_name:remote_file_location local_file_location// Transfer file in archive (to preserve attributes) and compressed (zipped) mode:
$ rsync -az path/to/file remote_host_name:remote_host_location// Transfer a directory and all its children from a remote to local:
$ rsync -r remote_host_name:remote_folder_location local_folder_location// Transfer only updated files from remote host:
$ rsync -ru remote_host_name:remote_folder_location local_folder_location// Transfer file over SSH and show progress:
$ rsync -e ssh — progress remote_host_name:remote_file local_file
- ssh: 連線用,連遠端server.
// -p 不帶 default 22 port
$ ssh [user]@[server] -p [ssh port]// 使用金鑰登入
$ ssh [user]@[server] -i [key_file]// 直接在遠端執行指令
$ ssh [user]@[server] [要執行的cmd]
$ ssh william@remote-server 'ls -al /'// 執行遠端的sql將結果顯示在本機上
$ ssh william@remote-server 'echo "SELECT * FROM users;" | mysql mysql --host=[host] --user=[usr] --password=[password] --default-character-set=utf8 [DB_NAME]'
- scp (secure copy): 把檔案or資料夾,在遠端及本機間丟來丟去。
// 把檔案從本機丟去遠端
$ scp -i [key_file] [local file path] [remote_user]@[remote_server]:[remote_target_path]// 把檔案從遠端拿回本機
$ scp -i [key_file] [remote_user]@[remote_server]:[remote_target_path] [local file path]// -r 表示recursive(包含資料夾), -P 表示ssh port
$ scp -r -P 22000 [local file path] [remote_user]@[remote_server]:[remote_target_path]
// Copy a local file to a remote host:
$ scp path/to/local_file remote_host:path/to/remote_file// Copy a file from a remote host to a local folder:
$ scp remote_host:path/to/remote_file path/to/local_dir// Recursively copy the contents of a directory from a remote host to a local directory:
$ scp -r remote_host:path/to/remote_dir path/to/local_dir// Copy a file between two remote hosts transferring through the local host:
$ scp -3 host1:path/to/remote_file host2:path/to/remote_dir// Use a specific username when connecting to the remote host:
$ scp path/to/local_file remote_username@remote_host:path/to/remote_dir// Use a specific ssh private key for authentication with the remote host:
$ scp -i ~/.ssh/private_key local_file remote_host:/path/remote_file
- stat : 取得檔案狀態與詳細資訊
// example on MacOS
$ stat -x fetch.php
File: "fetch.php"
Size: 4846 FileType: Regular File
Mode: (0644/-rw-r--r--) Uid: ( 501/wangwilliam) Gid: ( 20/ staff)
Device: 1,4 Inode: 8595590042 Links: 1
Access: Sun Sep 9 11:43:30 2018
Modify: Tue Jan 16 11:21:23 2018
Change: Tue Jan 16 11:21:23 2018
https://blog.path8.net/archives/6572.html
https://unix.stackexchange.com/questions/175325/output-of-stat-on-osx
當然,有些比較固定的處理邏輯,把這樣的連續操作的指令給記錄下來就可以下次使用,這時候
- history 搭配 ![history中所顯示的行數] 會是一個很方便的方法
- ctrl + r [keyword] , 找下一個: ctrl+r, 上一個: ctrl+shift+r
當然也可以寫出一個串再一起寫成一份script file 更快速達到資料處理的目的,這部分就等有時間再來整理了。