Clickhouse tutorial
ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries.
参照官网教程:https://clickhouse.yandex/tutorial.html
单节点,测试数据为 1987到2015 美国民用航班数据,1.66亿rows
- single server
- use USA civil flights data since 1987 till 2015
- contains 166 millions rows, 63 Gb of uncompressed data
我的测试基于 centos6.9
其他系统安装见:official Docker images of ClickHouse
安装包下载
https://packagecloud.io/altinity/clickhouse
- clickhouse-server-common
- clickhouse-server
- clickhouse-debuginfo
- clickhouse-compressor
根据操作系统版本 在el / 6 和 el / 7挑选下载,依次上述排列顺序安装;
安装server包时,缺少依赖
1 | yum install -y openssl |
clickhouse-server启动
1 | sudo service clickhouse-server start |
clickhouse-client 命令
Interactive mode:
1 | clickhouse-client |
Enable multiline queries:
1 | clickhouse-client -m |
Run queries in batch-mode:
1 | clickhouse-client --query='SELECT 1' |
Insert data from file of a specified format:
1 | clickhouse-client --query='INSERT INTO table VALUES' < data.txt |
sample dataset
https://clickhouse.yandex/tutorial.html
创建表
1 | $ clickhouse-client --multiline |
下载数据(3G+,1.6亿条数据):
https://yadi.sk/d/pOZxpa42sDdgm
load data:
xz -v -c -d < ontime.csv.xz | clickhouse-client --query="INSERT INTO ontime FORMAT CSV"
如果只下载部分测试数据
参考 https://clickhouse.yandex/docs/en/getting_started/example_datasets/ontime.html
1 | for s in `seq 1987 2017` |
(from https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh )
1 | for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --query="INSERT INTO ontime FORMAT CSVWithNames"; done |
测试QUERY
- count 数目:
1 | select count(*) from ontime; |
- the most popular destinations in 2015;
1 | SELECT |
- cities of departure with most frequent delays for 1 hour or longer;
1 | SELECT OriginCityName, count() AS c, avg(DepDelay > 60) AS delays |