Skip to content

Instantly share code, notes, and snippets.

aasdasd

When you a clickhouse cluster and you run queries on all the replicas it's not easy to get all the queries ran. I use system.query_log all the time to check timings, errors and so on.

So what I do is create a global query_log:

:) create view query_log_all on cluster my_cluster as select * from remote('10.0.0.1,10.0.0.2', 'system.query_log')

So I can inspect queries in all the replicas with a single query:

Clickhouse has a pretty good endpoint /replicas_status which gives information about the, guess what, replication status. When you are working on a cluster in which you use replication to increase the amount of QPS you usually have a load balancer before, something like this:

                                 +--------------+
                                 |              |
                       +-------->+  clickhouse  |
                       |         |              |
                       |         +--------------+
                       |

Sometimes you have to move data from one table to a different one. You usually use

insert into target select * from source

This works but have several problems:

  1. materialized columns are not properly copied
  2. it's slow

Looks like filters are pushed down when filtering an "UNION ALL". Also an example on how to use EXPLAIN in clickhouse and a different view of seeing what is going on with the traces, this lines show how much data clickhouse is reading:

Selected 1 parts by date, 1 parts by key, 2 marks by primary key, 2 marks to read from 1 ranges
Reading approx. 16384 rows with 1 streams

The example

@javisantana
javisantana / til_dynamic_joinget.md
Last active November 6, 2020 06:18
how to generate a map data structure dynamically in clickhouse

Clickhouse has a powerful feature, JOIN engines, that allows to prepare a table to be joined with better performance that a regular table (MergeTree, Log...). It also allows to use joinGet to get table values using a key.

Somtimes you don't have a JOIN table but you'd like to use something with the joinGet performance. Unfortunately you can't use joinGet with something created on the fly (well, you could create a temporally join table but you need several SQL queries).

So there is a way to do that, using transform:

with (
  select (groupArray(key), groupArray(value)) from my_table
) as key_value
@javisantana
javisantana / summing_merge_tree.sql
Created January 17, 2020 14:03
SummingMergetree multiple keys in nested maps
MacBook-Pro-de-javi.local :) create table multiple_keys (tmp Int32, testMap Nested (a Int32, bKey Int32, value Int32)) Engine=SummingMergeTree() order by (tmp);
CREATE TABLE multiple_keys
(
`tmp` Int32,
`testMap` Nested(
a Int32,
bKey Int32,
value Int32)
)
@javisantana
javisantana / compile clickhouse on mojave.md
Last active March 11, 2020 05:45
compile clickhouse on mojave

clickhouse on osx

These are the steps I followed to compile clickhouse on OSX mojave (10.4.3). These might not be the best way to make it compile on OSX, my knowledge about C++ and build tooling is really limited but they do the work.

clickhouse sha-1: 2ad4df1d6a

compile command (apply the patch below before running cmake)

@javisantana
javisantana / res.sql
Last active June 20, 2019 07:24
filter ourside the join vs inside the join in clickhouse
MacBook-Pro-de-javi.local :) select cityHash64(groupArray(cityHash64(*))) from A a asof inner join (select * from B where ts<toDateTime('1970-01-01 02:00:00')) b on a.id=b.id and a.ts=b.ts where a.ts<toDateTime('1970-01-01 02:00:00');
SELECT cityHash64(groupArray(cityHash64(*)))
FROM A AS a
ASOF INNER JOIN
(
SELECT *
FROM B
WHERE ts < toDateTime('1970-01-01 02:00:00')
) AS b ON (a.id = b.id) AND (a.ts = b.ts)