使用Shell進行資料處理

Data processing with shell script

Wang William (WJWang)

23 min readDec 6, 2017

近幾個月應因工作的需要，大量的使用cmd來做一些簡單的資料處理。

因為這類型資料處理的工作大多是繁瑣卻不太會重複的內容或邏輯，很難有一套完整的商業邏輯，再加上回應的速度需要快一點，因此使用任何一種語言包裝這樣的邏輯成為一個工具反而有點不合成本。

shell所提供的一些文字處理相關commands便成為這樣情境下，算得上最稱手的一把錘子了。(為求方便，我在資料處理的過程中經常會使用.tsv檔案來處理，最後再搭配工具將.tsv轉換成一般使用者熟悉的excel file)

搭配MySQL：

首先使用SQL語法撈出資料，導出結果，再開始對欄位進行處理，

ex:

echo "SELECT * FROM users ..." | mysql --host=[host] --user=[usr] --password=[password] --default-character-set=utf8 [DB_NAME]

cat [sql file] | mysql --host=[host] --user=[usr] --password=[password] --default-character-set=utf8 [DB_NAME] | tee sql_result.tsv | head -n 10

搭配 PostgresSQL:

echo '[YOUR_SQL]'| PGPASSWORD=[YOUR_PASSWORD] psql -U [user] -h [host]-p [port]| head -n 10

下面列出，我在資料處理中經常會使用到的一些指令，跟一些搭配使用的情境：

Commands:

> 與 >>

// 將畫面上顯示的結果導入tmp.txt
$ ls -l ... > tmp.txt// 將畫面上顯示的結果append在tmp.txt後
$ ls -l ... >> tmp.txt

cat : 貓?!

$ cat [file]
$ cat -n [file] // 顯示行號
$ cat -s [file] // 輸出抑制多行空白// 使用 >> , > 搭配寫入, 以ctrl+d 跳離
$ cat > [file] // 覆蓋重寫or新開檔案
$ cat >> [file] // append

| (pipe): 把資料當水流，pipe就是管線的概念。

$ cmd1 | cmd2 | cmd3 | ... | cmd...// ex.將很長的結果使用less分頁顯示
$ ls -al / | less

find : 找指定目錄包含子目錄的檔案(隱藏資料夾也會找)。(https://www.computerhope.com/unix/ufind.htm)

$ find [搜尋的路徑目標] [執行的比對條件] [對結果進行的行為] [選項]$ find ~ // 找家目錄下所有的檔案及資料夾[執行的比對條件] :// 使用 -type 搜尋指定類型，f:檔案, l: synbol links, d: directory 
// 找家目錄下所有的檔案
$ find ~ -type f// 加上name的過濾規則, 其他可用規則參考連結
$ find ~ -type f -name '*.png' // 用or
$ find ~ \( -type f -not -perm 0600 \) -or \( -type d -not -perm 0700 \)[對結果進行的行為] :// 對執行結果進行操作: -print, -delete, -quit, -ls
$ find ~ -type f -name '*.png' -ls// 也可以使用-exec, -ok... 來執行指令 => 用xargs比較好
$ find ~/Desktop -type f -name '*.png' -print | xargs ls -l[選項]: -depth, -maxdepth levels, -mindepth levels, -noleaf

head, tail : 輸出檔案首10行(head) or 末10行(tail), 皆搭配 -n 可以輸出任意行數

$ ls -al /bin/usr | head // 顯示首10行
$ ls -al /bin/usr | head -n 5 // 顯示首5行
$ ls -al /bin/usr | tail // 顯示末10行
$ ls -al /bin/usr | tail -n 5 // 顯示末5行

wc (word count) : 顯示檔案內的行列, 字元數

$wc ~/Desktop/event.js 
300     757    8813    /Users/wangwilliam/Desktop/event.js
(行數)  (字數)  (位元數)  (檔案位置) $ wc -l [file] // 只顯示行數 // 顯示某資料夾下的檔案數量
$ ls -1 [path] | wc -l

tee (T): 把pipe in 的東西可以寫出的同時也往下繼續pipe out，像是T型水管。

// 讀取來自 ls 的stdin, 並輸出至 stdout 及 tmp.txt 檔案
$ ls -al /usr/bin | tee ~/Desktop/tmp.txt | less

grep (global regular expression print) : 對文字進行搜索後輸出。

$ grep [options]options:
// -i : 忽略大小寫
// -v : 找出不匹配的
// -c : 匹配成功數目
// -n : 加上行號
// ...
// -E : 使用regex表達的pattern
$ ls -al | grep -E '^[a-zA-Zl]{3}'

sort : 排序 (https://www.phpini.com/linux/sort-command-sort-data)

$ ls -al ~ | sort -nr -k 5 // 第5行為檔案大小$ sort -k 3.7 //  col:3, char: 7// -n 按數值大小排序.
// -r 倒轉排序, 即從大至小排序.
// -t 指定欄位分隔的字符, 預設是 Tab.
// -k 要排序的欄位.

uniq : 因為uniq只能消除鄰近的行列，所以大部分都會先sort過再進行uniq

$ cat test.tsv
William Wang
Wang
Will
William
Wang
Wang
William
William Wang$ cat test.tsv  | sort
Wang
Wang
Wang
Will
William
William
William Wang
William Wang$ cat test.tsv  | sort | uniq
Wang
Will
William
William Wang

cut :

(iris.data.tsv : https://github.com/WJWang/Logistic/blob/master/iris.data.tsv)

$ cat iris.data.tsv| head
SepalLength SepalWidth PetalLength PetalWidth TrainingClass
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.4 0.2 Iris-setosa$ cut -f 1,3 iris.data.tsv | head // 只取1,3欄
SepalLength PetalLength
5.1 1.4
4.9 1.4
4.7 1.3
4.6 1.5
5.0 1.4
5.4 1.7
4.6 1.4
5.0 1.5
4.4 1.4$ cut -d ':' [filepath]-f 1,3 // 改使用:作為切分點,找1,3欄
// ex. 用:切出/etc/passwd的首欄
$ cut -d ':' -f 1 /etc/passwd // 用file
$ cat /etc/passwd | cut -d ':' -f 1 // 用cat再pipe in$ ls -l | cut -c 1-10 
total 3824
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
drwxr-xr-x
-r--------

paste

// 將iris.data.tsv 的1,5欄對調
$ cut -f 1 iris.data.tsv > iris.data.first.tsv
$ cut -f 5 iris.data.tsv > iris.data.fifth.tsv
$ cut -f 2,3,4 iris.data.tsv > iris.data.else.tsv$ paste iris.data.fifth.tsv iris.data.else.tsv iris.data.first.tsv
TrainingClass SepalWidth PetalLength PetalWidth SepalLength
Iris-setosa 3.5 1.4 0.2 5.1
Iris-setosa 3.0 1.4 0.2 4.9
Iris-setosa 3.2 1.3 0.2 4.7
Iris-setosa 3.1 1.5 0.2 4.6
Iris-setosa 3.6 1.4 0.2 5.0
Iris-setosa 3.9 1.7 0.4 5.4
Iris-setosa 3.4 1.4 0.3 4.6
Iris-setosa 3.4 1.5 0.2 5.0
Iris-setosa 2.9 1.4 0.2 4.4
Iris-setosa 3.1 1.5 0.1 4.9
Iris-setosa 3.7 1.5 0.2 5.4
Iris-setosa 3.4 1.6 0.2 4.8
Iris-setosa 3.0 1.4 0.1 4.8
...
...

comm: 比較A, B兩個文檔，以三行方式輸出，有-1, -2, -3三個選項

$ cat A.txt
1
2
3
4
5$ cat B.txt
1
2
3
4
6
7$ comm A.txt B.txt
                         1
                         2
                         3
                         4
    5
               6
               7
(只有A有的) (只有B有的) (都有的)// comm -1 A.txt B.txt (只顯示2,3行)
// comm -2 A.txt B.txt (只顯示1,3行)
// comm -3 A.txt B.txt (只顯示1,2行)

Linux 使用 comm 找出兩個檔案不同的內容 - Tsung's Blog

想要比對兩個檔案哪些資料是不同的(非 diff，相同資料就不需要再出現)，Linux Bash shell 有方便的工具可以快速達成。 Linux 使用 comm 找出兩個檔案不同的內容要比對出兩個檔案不同的資料，可以使用 awk …

blog.longwin.com.tw

diff, patch ：檔案間的diff, 產生patch.

// diff
- Compare files:
    diff file1 file2- Compare files, ignoring white spaces:
    diff -w file1 file2- Compare files, showing differences side by side:
    diff -y file1 file2- Compare directories recursively:
    diff -r directory1 directory2- Compare directories, only showing the names of files that differ:
    diff -rq directory1 directory2// patch  - Apply a patch:
    patch < patch_file.diff- Apply a patch to current directory:
    patch -p1 < patch_file.diff- Apply the reverse of a patch:
    patch -R < patch_file.diff

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

github.com

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

github.com

Linux中 patch 的用法

tmp/t1 的檔案內容 /tmp/t2 的檔案內容 --- /tmp/t1 2007-06-28 22:08:33.000000000 +0800 (舊的檔案)+++ /tmp/t2 2007-06-28 22:09:16…

blog.xuite.net

split : 可以把大的檔案切分。

- Split a file, each split having 10 lines (except the last split):
    split -l 10 filename- Split a file into 5 files. File is split such that each split has same size (except the last split):
    split -n 5 filename- Split a file with at most 512 bytes of lines in each split:
    split -C 512 filename

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

github.com

Linux 使用 split 指令將大檔案切割成小檔案，方便網路傳輸 - G. T. Wang

這裡介紹如何在 Linux 中使用 split 指令將大檔案切割成檔案，方便網路傳輸或各種儲存媒體使用。在 Linux 中若要進行檔案備份或是網路傳輸，有時候會使用將多個檔案壓縮起來，變成一個壓縮檔後才進行後續的動作，而有時候一個…

blog.gtwang.org

tr: Translate characters

- Replace all occurrences of a character in a file, and print the result:
    tr find_characters replace_characters < filename- Map each character of the first set to the corresponding character of the second set:
    tr 'abcd' 'jkmn' < filename- Delete all occurrences of the specified set of characters from the input:
    tr -d 'input_characters'- Compress a series of identical characters to a single character:
    tr -s 'input_characters'- Translate the contents of the file to upper-case and print result:
    tr "[:lower:]" "[:upper:]" < filename- Strip out non-printable characters from the file and print result:
    tr -cd "[:print:]" < filename

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

github.com

sed: 用正則進行替換

- Replace the first occurrence of a string in a file, and print the result:
$ sed 's/find/replace/' filename- Replace all occurrences of an extended regular expression in a file:
$ sed -E 's/regex/replace/g' filename- Replace all occurrences of a string in a file, overwriting the file (i.e. in-place):
$ sed -i '' 's/find/replace/g' filename- Replace only on lines matching the line pattern:
$ sed '/line_pattern/s/find/replace/' filename- Apply multiple find-replace expressions to a file:
$ sed -e 's/find/replace/' -e 's/find/replace/' filename- Replace separator / by any other character not used in the find or replace patterns, e.g., #:
$ sed 's#find#replace#' filename

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

github.com

aspell
awk

- Print the fifth column (a.k.a. field) in a space-separated file:
$ awk '{print $5}' filename- Print the second column of the lines containing "something" in a space-separated file:
$ awk '/something/ {print $2}' filename- Print the last column of each line in a file, using a comma (instead of space) as a field separator:
$ awk -F ',' '{print $NF}' filename- Sum the values in the first column of a file and print the total:
$ awk '{s+=$1} END {print s}' filename- Sum the values in the first column and pretty-print the values and then the total:$ awk '{s+=$1; print $1} END {print "--------"; print s}' filename

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

github.com

xargs

- Main usage pattern:
$ arguments_source | xargs command- Delete all files with a .backup extension:
$ find . -name '*.backup' | xargs rm -v- Convert newlines in the input into NUL (\0) characters, and split on those only (useful if the input to xargs contains spaces):
$ arguments_source | tr '\n' '\0' | xargs -0 command- Execute the command once for each input line, replacing any occurrences of the placeholder (here marked as _) with the input line:
$ arguments_source | xargs -I _ command _ optional_extra_arguments$ arguments_source | xargs -P 10 ..... // -P 表示 maxprocs- Example:
$ mkdir ~/tmp_work$ split -l 6000 ~/big_tsv_file.tsv ~/tmp_work/// ls ~/tmp_work// => aa      ab      ac      ad      ae      af      ag      ah      ai      aj      ak$ ls ~/tmp_work | xargs -P 11 -I %X bash -c "cat %X | php ~/processor.php" > tmp.txt

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

github.com

xargs

利用 -n 參數決定一行輸出幾個 Item echo a b c d e f| xargs -n 3 # grep 出目前目錄底下所有 PHP 檔案有使用到 SongAll Model 的檔案 find . -name "*.php" |…

blog.johnsonlu.org

rsync:

// Transfer file from local to remote host:
$ rsync path/to/file remote_host_name:remote_host_location// Transfer file from remote host to local:
$ rsync remote_host_name:remote_file_location local_file_location// Transfer file in archive (to preserve attributes) and compressed (zipped) mode:
$ rsync -az path/to/file remote_host_name:remote_host_location// Transfer a directory and all its children from a remote to local:
$ rsync -r remote_host_name:remote_folder_location local_folder_location// Transfer only updated files from remote host:
$ rsync -ru remote_host_name:remote_folder_location local_folder_location// Transfer file over SSH and show progress:
$ rsync -e ssh — progress remote_host_name:remote_file local_file

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

github.com

ssh: 連線用，連遠端server.

// -p 不帶 default 22 port
$ ssh [user]@[server] -p [ssh port]// 使用金鑰登入
$ ssh [user]@[server] -i [key_file]// 直接在遠端執行指令
$ ssh [user]@[server] [要執行的cmd]
$ ssh william@remote-server 'ls -al /'// 執行遠端的sql將結果顯示在本機上
$ ssh william@remote-server 'echo "SELECT * FROM users;" | mysql mysql --host=[host] --user=[usr] --password=[password] --default-character-set=utf8 [DB_NAME]'

scp (secure copy): 把檔案or資料夾，在遠端及本機間丟來丟去。

// 把檔案從本機丟去遠端
$ scp -i [key_file] [local file path] [remote_user]@[remote_server]:[remote_target_path]// 把檔案從遠端拿回本機
$ scp -i [key_file] [remote_user]@[remote_server]:[remote_target_path] [local file path]// -r 表示recursive(包含資料夾), -P 表示ssh port
$ scp -r -P 22000 [local file path] [remote_user]@[remote_server]:[remote_target_path]
// Copy a local file to a remote host:
$ scp path/to/local_file remote_host:path/to/remote_file// Copy a file from a remote host to a local folder:
$ scp remote_host:path/to/remote_file path/to/local_dir// Recursively copy the contents of a directory from a remote host to a local directory:
$ scp -r remote_host:path/to/remote_dir path/to/local_dir// Copy a file between two remote hosts transferring through the local host:
$ scp -3 host1:path/to/remote_file host2:path/to/remote_dir// Use a specific username when connecting to the remote host:
$ scp path/to/local_file remote_username@remote_host:path/to/remote_dir// Use a specific ssh private key for authentication with the remote host:
$ scp -i ~/.ssh/private_key local_file remote_host:/path/remote_file

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

github.com

22 SSH Examples, Practical Tips & Tunnels | HackerTarget.com

Practical SSH examples to take your remote system admin game to the next level. Commands and tips to not only use SSH…

hackertarget.com

stat : 取得檔案狀態與詳細資訊

// example on MacOS
$ stat -x fetch.php
  File: "fetch.php"
  Size: 4846         FileType: Regular File
  Mode: (0644/-rw-r--r--)         Uid: (  501/wangwilliam)  Gid: (   20/   staff)
Device: 1,4   Inode: 8595590042    Links: 1
Access: Sun Sep  9 11:43:30 2018
Modify: Tue Jan 16 11:21:23 2018
Change: Tue Jan 16 11:21:23 2018

https://blog.path8.net/archives/6572.html

https://unix.stackexchange.com/questions/175325/output-of-stat-on-osx

當然，有些比較固定的處理邏輯，把這樣的連續操作的指令給記錄下來就可以下次使用，這時候

history 搭配 ![history中所顯示的行數] 會是一個很方便的方法
ctrl + r [keyword] , 找下一個: ctrl+r, 上一個: ctrl+shift+r

當然也可以寫出一個串再一起寫成一份script file 更快速達到資料處理的目的，這部分就等有時間再來整理了。

使用Shell進行資料處理

Data processing with shell script

搭配MySQL：

搭配 PostgresSQL:

Commands:

Linux 使用 comm 找出兩個檔案不同的內容 - Tsung's Blog

想要比對兩個檔案哪些資料是不同的(非 diff，相同資料就不需要再出現)，Linux Bash shell 有方便的工具可以快速達成。 Linux 使用 comm 找出兩個檔案不同的內容要比對出兩個檔案不同的資料，可以使用 awk …

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

Linux中 patch 的用法

tmp/t1 的檔案內容 /tmp/t2 的檔案內容 --- /tmp/t1 2007-06-28 22:08:33.000000000 +0800 (舊的檔案)+++ /tmp/t2 2007-06-28 22:09:16…

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

Linux 使用 split 指令將大檔案切割成小檔案，方便網路傳輸 - G. T. Wang

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

xargs

利用 -n 參數決定一行輸出幾個 Item echo a b c d e f| xargs -n 3 # grep 出目前目錄底下所有 PHP 檔案有使用到 SongAll Model 的檔案 find . -name "*.php" |…

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

22 SSH Examples, Practical Tips & Tunnels | HackerTarget.com

Practical SSH examples to take your remote system admin game to the next level. Commands and tips to not only use SSH…

Written by Wang William (WJWang)

No responses yet

使用Shell進行資料處理

Data processing with shell script

搭配MySQL：

搭配 PostgresSQL:

Commands:

Linux 使用 comm 找出兩個檔案不同的內容 - Tsung's Blog

想要比對兩個檔案哪些資料是不同的(非 diff，相同資料就不需要再出現)，Linux Bash shell 有方便的工具可以快速達成。 Linux 使用 comm 找出兩個檔案不同的內容 要比對出兩個檔案不同的資料，可以使用 awk …

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

Linux中 patch 的用法

tmp/t1 的檔案內容 /tmp/t2 的檔案內容 --- /tmp/t1 2007-06-28 22:08:33.000000000 +0800 (舊的檔案)+++ /tmp/t2 2007-06-28 22:09:16…

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

Linux 使用 split 指令將大檔案切割成小檔案，方便網路傳輸 - G. T. Wang

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

xargs

利用 -n 參數決定一行輸出幾個 Item echo a b c d e f| xargs -n 3 # grep 出目前目錄底下所有 PHP 檔案有使用到 SongAll Model 的檔案 find . -name "*.php" |…

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

tldr-pages/tldr

tldr - :books: Simplified and community-driven man pages

22 SSH Examples, Practical Tips & Tunnels | HackerTarget.com

Practical SSH examples to take your remote system admin game to the next level. Commands and tips to not only use SSH…

Written by Wang William (WJWang)

No responses yet

想要比對兩個檔案哪些資料是不同的(非 diff，相同資料就不需要再出現)，Linux Bash shell 有方便的工具可以快速達成。 Linux 使用 comm 找出兩個檔案不同的內容要比對出兩個檔案不同的資料，可以使用 awk …