Neat features of ClickHouse more people need to know about

B&T Television

Top 3 Meme Coins to Buy in June to 100x Your Investment – $WAI, $DAWGZ, and $PLAY

June

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

more tags

Neat features of ClickHouse more people need to know about

Tools And Technologies

Tags: new testing

Author: DATE POSTED:May 22, 2024

Feed: Dataconomy

View: Original article

Neat features of ClickHouse more people need to know about

As the popularity of ClickHouse increased over the years, more and more people know about the core features of ClickHouse like its incredible performance, its high compression ratio or the huge capabilities of reading and writing nearly all dataformats there are.

But ClickHouse also has a lot of hidden gems, which can help in your day to day work, a lot of the people don’t know about. Even so, most of them are well documented in the official Documentation, if you don’t know you are looking for them, you will not find them.

In this blog post, I’ll highlight some of my favorite small features that more people should know about.

Select modifiers

In column based DBMS like ClickHouse, queries like SELECT * FROM table should generally be avoided. At least that’s true when it comes to regular queries issued by your application. The typical day to day work of a DBA or database developer however often includes these types of queries, as otherwise it would take a huge amount of time to type down all X columns of a table manually.

But what if you want to run a lot of operations on some columns? Imagine you have the following table:

Unset CREATE TABLE customer ( customerId UInt64, custom_num_1 UInt64, custom_num_2 UInt64, ... custom_num_50 UInt64, custom_string_1 String, custom_string_2 String, ... custom_string_50 String );

When optimizing your table, you might be interested in the average length of the String columns in your table, as well as the maximum value of your number columns. If you want to handwrite a query to collect this information, you would have to write the correct function for 100 columns this way.

In some databases (like MySQL but also ClickHouse) you can utilize the INFORMATION_SCHEMA.COLUMNS table to build your query. This can be convenient for a lot of DBAs as they might already be used to this, but ClickHouse provides an even faster way to achieve your goal: Select Modifiers

Utilizing a combination of modifiers, our task comes down to a simple query:

Unset SELECT COLUMNS('.*num.*') APPLY max, COLUMNS('.*string.*') APPLY length APPLY max FROM customer

We are utilizing the COLUMNS modifier to apply a regex to the column names to only get columns with names num or string in it, and on all those columns we apply the function max if it has been in the set of number columns, or we first apply the function length and afterwards the function max. This gives us the wanted results, and takes you way less time than building a query via the information schema, or writing down 100 columns manually.

Manual test data in various formats

Often when I want to help customers or users of ClickHouse in the open slack channel, there are cases where you need to provide examples in a test table. Sometimes writing the table definition for such a test table and filling it with test data can even be more code than what you want to show in the first place.

As you could see in the previous section, the table I used for explaining select modifiers is longer than the actual code I want to present (and it has been abbreviated already).

But there is a possibility in ClickHouse to directly work on Data, as if it would be in a table, which is the format table-function

With this function you can specify which format you want to use (like JSONEachRow) and then directly enter the data, instead of inserting it into a table first:

Unset SELECT * FROM FORMAT(JSONEachRow, '{"customerId":1,"type":1,"custom_num1":4711}\n{"customerId":2, "type":2,"custom_ips":["127.0.0.1","127.0.0.2"]}') +-customerId-+-type-+-custom_num1-+-custom_ips----------------+ | 1 | 1 | 4711 | [] | | 2 | 2 | NULL | ['127.0.0.1','127.0.0.2'] | +------------+------+-------------+---------------------------+

You can see it generates a result set with two rows, and even complex types like Arrays are doable. And you can use a large variety of data formats to use.

Generate series with numbers table function

Some DBMS have possibilities to generate number sequences by using table functions. ClickHouse is no exception in this regard, providing the numbers() table function for these regards. But did you know that you can easily generate date sequences or time sequences with this table function as well?

ClickHouse allows simple arithmetic functions on Date and DateTime data types, allowing you to easily generate sequences of dates or timestamps:

Unset SELECT number, now() - number AS previousTimes, toDate(now()) + number AS nextDays FROM numbers(10) +-number-+-------previousTimes-+---nextDays-+ | 0 | 2024-04-25 13:04:30 | 2024-04-25 | | 1 | 2024-04-25 13:04:29 | 2024-04-26 | | 2 | 2024-04-25 13:04:28 | 2024-04-27 | | 3 | 2024-04-25 13:04:27 | 2024-04-28 | | 4 | 2024-04-25 13:04:26 | 2024-04-29 | | 5 | 2024-04-25 13:04:25 | 2024-04-30 | | 6 | 2024-04-25 13:04:24 | 2024-05-01 | | 7 | 2024-04-25 13:04:23 | 2024-05-02 | | 8 | 2024-04-25 13:04:22 | 2024-05-03 | | 9 | 2024-04-25 13:04:21 | 2024-05-04 | +--------+---------------------+------------+

By applying multiplication you can also introduce steps etc.

As the final Datatype of the columns will still be Date or DateTime you can be sure, that only valid dates are generated

Data formatting in custom columns

Sometimes it is needed to partially format your data on output. For example if you want to insert data into a streaming service like Kafka. Typically you have some columns you would need as direct columns, but also combine the data of other columns into a payload column, in a given format (Typically JSON).

Of course, you can do a lot of these things in other DBMS as well, by using string concatenation, and build your JSON manually, or use specific JSON functions like toJSONString or manually create JSON Objects etc.

But ClickHouse also has you covered here, by giving you the function formatRowNoNewline(). This function allows you to format an arbitrary amount of columns into all possible output formats ClickHouse has to offer.

And of course you can also use Select Modifiers to specify which columns to format:

Unset SELECT customerId, formatRowNoNewline('JSONEachRow', COLUMNS('.*num.*')) AS payload FROM customer LIMIT 10 +-customerId-+-payload------------------------------------------------------------------------------+ | 20 | {"custom_num_1":"4503644724578621668","custom_num_2":"156","custom_num_50":"32624"} | | 111 | {"custom_num_1":"9395348731023764754","custom_num_2":"4","custom_num_50":"8919"} | | 187 | {"custom_num_1":"4410745110154105282","custom_num_2":"67","custom_num_50":"19015"} | | 231 | {"custom_num_1":"8206799308850873985","custom_num_2":"151","custom_num_50":"43260"} | | 262 | {"custom_num_1":"14904510309594397590","custom_num_2":"83","custom_num_50":"2802"} | | 375 | {"custom_num_1":"14468162691130275987","custom_num_2":"13","custom_num_50":"6910"} | | 388 | {"custom_num_1":"15191470301382236130","custom_num_2":"110","custom_num_50":"39490"} | | 434 | {"custom_num_1":"11648353314185268442","custom_num_2":"211","custom_num_50":"52725"} | | 439 | {"custom_num_1":"8430391284904487000","custom_num_2":"31","custom_num_50":"43376"} | | 468 | {"custom_num_1":"11265143749859112447","custom_num_2":"41","custom_num_50":"58748"} | +------------+--------------------------------------------------------------------------------------+ Querying the whole cluster

Sometimes querying a single node is not enough. Imagine the case, that you are looking for queries which run longer than a specific threshold. You can get the information within ClickHouse’s system.query_log table, but you would have to check all hosts separately.

But again ClickHouse has you covered. There is a table function clusterAllReplicas which allows you to execute a query on all nodes of a cluster and giving you the combined result, as it would have been a local table:

Working with Materialized Views and possibilities of AggregateFunctions could fill multiple blog posts on its own (for example this one about performance impact of Materialized Views). Therefore I only want to briefly mention some functionality not everyone knows about, but could come in handy.

Let’s assume we have the following 2 tables, which just count unique customers per hour or day:

Unset CREATE TABLE customers_hourly ( eventDate Date, eventHour UInt8, uniqueCustomers AggregateFunction(uniq, UInt64) ) ENGINE=AggregatingMergeTree ORDER BY (eventDate, eventHour); CREATE TABLE customers_daily ( eventDate Date, uniqueCustomers AggregateFunction(uniq, UInt64) ) ENGINE=AggregatingMergeTree ORDER BY (eventDate); Initialize aggregation

Filling those tables is quite easy with Materialized Views. But what if you manually want to insert a row. For example for testing purposes you want to insert the test customerId 4711 on 3 different hours of the same day.

As uniqueCustomers is an AggregationFunction you cannot directly insert the customerId into that column so something like this doesn’t work:

Unset INSERT INTO customers_hourly SELECT eventDate,eventHour,customerId FROM FORMAT(JSONEachRow, $${"customerId":4711,"eventDate":"2024-04-26","eventHour":1} {"customerId":4711,"eventDate":"2024-04-26","eventHour":2} {"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)

Of course it’s possible to generate an aggregation state by using window functions:

Unset INSERT INTO customers_hourly SELECT eventDate,eventHour, uniqState(toUInt64(assumeNotNull(customerId))) OVER () FROM FORMAT(JSONEachRow, $${"customerId":4711,"eventDate":"2024-04-26","eventHour":1} {"customerId":4711,"eventDate":"2024-04-26","eventHour":2} {"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)

But if you have multiple arguments to the function that could be hard to read, so you can also just use the initializeAggregation function to directly insert into the table:

Unset INSERT INTO customers_hourly SELECT eventDate,eventHour, initializeAggregation('uniqState',toUInt64(assumeNotNull(customerId))) FROM FORMAT(JSONEachRow, $${"customerId":4711,"eventDate":"2024-04-26","eventHour":1} {"customerId":4711,"eventDate":"2024-04-26","eventHour":2} {"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$) Finalize aggregation

Now that we know how to directly write into an aggregation function, how can we read from it? Of course we can use uniqMerge to get the final result, but this is an Aggregation function, therefore we need to do GROUP BY etc to get the final result. And if we want to see the intermediate results we added into the table, we would also need to use a window again, to prevent the collapsing of the rows. Or we just use finalizeAggregation to make it easier:

Unset SELECT eventDate, eventHour, finalizeAggregation(uniqueCustomers) FROM customers_hourly ┌──eventDate─┬─eventHour─┬─finalizeAggregation(uniqueCustomers)─┐ 1. │ 2024-04-26 │ 1 │ 1 │ 2. │ 2024-04-26 │ 2 │ 1 │ 3. │ 2024-04-26 │ 3 │ 1 │ └────────────┴───────────┴──────────────────────────────────────┘ ┌──eventDate─┬─eventHour─┬─finalizeAggregation(uniqueCustomers)─┐ 4. │ 2024-04-26 │ 1 │ 1 │ 5. │ 2024-04-26 │ 2 │ 1 │ 6. │ 2024-04-26 │ 3 │ 1 │ └────────────┴───────────┴──────────────────────────────────────┘ MergeState

How to go from the hourly aggregation to the daily aggregation? In cases where you could just sum up the results from the hourly table, that’s of course quite easy to achieve. Unfortunately with the uniq function, you cannot just sum up the intermediate results, as the same user could have been active during multiple hours, therefore the daily result is not the sum of all hourly results. I’ve seen customers resulting to just recalculate the daily table from the basic table, instead of just continuing with the aggregation by using uniqMergeState.:

Unset INSERT INTO customers_daily SELECT eventDate, uniqMergeState(uniqueCustomers) FROM customers_hourly GROUP BY eventDate

The logic is as easy as it sounds. It will merge the intermediate results, but instead of giving the merged result, it will give the new internal state, therefore being stored in the daily Aggregation Table.

Those have been 5 (+3 bonus) small features I think are good to know. If you have any more neat features or topics you want to read about, feel free to contact me via slack or linked.in

Open Source lives from its community!

Featured image credit: Kevin Ku/Unsplash

Feed: Dataconomy

View: Original article

Tags: new testing

Tools And Technologies