Clickhouse Dictionaries 在内存中的存储方式

2019-01-01 2019-11-06

bigdata

11 minutes read (About 1594 words)

Clickhouse Dictionaries -Storing Dictionaries in Memory

Clickhouse支持多种方式将字典存储在内存中

一般推荐flat、hashed和 complex_key_hashed，这些提供了最佳的处理速度，但是不推荐使用cache，因为可能会出现性能差且难以选择最佳参数的问题。

有以下几种方式提升字典的使用性能：

在使用Group By之后再调用函数处理字典
将属性标记为单射(injective).如果不同的属性值对应不同的键，则属性被称为单射。因此，当group by 中使用通过key获取字典value的函数时，此函数将自动从group by中取出。

出现以下情况，会产生异常：

无法加载正在访问的字典
查询cache字典时出错

可以通过查询system.dictionaries表来获取已导入的外部字典列表以及字典的状态

存储方式的配置节点在字典配置文件的dictionary - layout - layout_type节点:

<yandex>
    <dictionary>
        ...
        <layout>
            <layout_type>
                <!-- layout settings -->
            </layout_type>
        </layout>
        ...
    </dictionary>
</yandex>

clickhouse 目前支持以下存储方式：

flat
hashed
cache
range_hashed
complex_key_hashed
complex_key_cache
ip_trie

flat

字典以平面阵列(flat arrays)的形式完全存储在内存中。具体使用了多少内存？占用内存量和最大key的size成正比

字典的键(key）必须是UInt64类型，而且键值被限制在500,000，超过最大值，则不会创建字典并且抛出异常

支持所有的数据源，更新时，数据（from a file or a table)被整体加载

在所有的存储方式中，flat具有最佳性能

配置

1
2
3

<layout>
  <flat />
</layout>

hashed

字典以hash table 的形式完全存储在内存中。该种形式存储的字典可以包含任意数量的带有任何标识符的元素。实际上，键的数量可以达到数千万（tens of millions）。

支持所有的数据源，更新时，数据（from a file or a table)被整体加载

配置

1
2
3

<layout>
  <hashed />
</layout>

complex_key_hashed

从《Dictionary Key and Fields》章节可以知道，clickhouse 字典是支持Numeric key和Composite key。如果使用Composite key，存储方式必须为complex_key_hashed 或者complex_key_cache。

其他同 hashed

配置

1
2
3

<layout>
  <complex_key_hashed />
</layout>

range_hashed

字典以 具有有序的范围数组和对应值的hash table 的形式完全存储在内存中

该存储方式和 hashed 工作方式相同，除了普通的key外，还可以使用日期/时间范围的键

Example: 该表包含以下格式的每个广告客户的折扣：:

+---------------+---------------------+-------------------+--------+
| advertiser id | discount start date | discount end date | amount |
+===============+=====================+===================+========+
| 123           | 2015-01-01          | 2015-01-15        | 0.15   |
+---------------+---------------------+-------------------+--------+
| 123           | 2015-01-16          | 2015-01-31        | 0.25   |
+---------------+---------------------+-------------------+--------+
| 456           | 2015-01-01          | 2015-01-15        | 0.05   |
+---------------+---------------------+-------------------+--------+

使用时间range的样例, 在配置中定义 range\_min and range\_max：

<structure>
    <id>
        <name>Id</name>
    </id>
    <range_min>
        <name>first</name>
    </range_min>
    <range_max>
        <name>last</name>
    </range_max>
 ....

要使用这些词典，需要将一个额外的日期参数传递给dictGetT函数：

1	dictGetT('dict_name', 'attr_name', id, date)

此函数返回指定的id并且date在符合的日期范围内的值(id = x and date between first and last)。

1 2	-- 举例广告客户123在2015-01-18的折扣 dictGetT('dict_name', 'attr_name', 123, 2015-01-18) = 0.25 --id是uint64，date 是日期型，只是举例

Details of the algorithm:

如果未找到id或未找到符合传递date的范围，则返回字典的默认值。
如果存在重叠范围，则可以使用任何范围。
如果范围分隔符为NULL或无效日期（例如1900-01-01或2039-01-01），则the range is left open。The range can be open on both sides.。

配置

<yandex>
        <dictionary>

                ...

                <layout>
                        <range_hashed />
                </layout>

                <structure>
                        <id>
                                <name>Abcdef</name>
                        </id>
                        <range_min>
                                <name>StartDate</name>
                        </range_min>
                        <range_max>
                                <name>EndDate</name>
                        </range_max>
                        <attribute>
                                <name>XXXType</name>
                                <type>String</type>
                                <null_value />
                        </attribute>
                </structure>

        </dictionary>
</yandex>

cache

字典存储在具有固定数量单元格的缓存中。这些单元格包含常用元素。

当进行字典搜索时，将首先搜索cache。对于每个数据块，将会使用SELECT attrs... FROM db.table WHERE id IN (k1, k2, ...)sql 语句从源中请求那些不能被缓存命中或者已经超时的keys。然后将获得的数据写入cache。

对于缓存字典，可以设置缓存中数据的到期生存期（lifetime)。如果数据从加载到cell开始超过了lifetime，则不使用cell’s value，并在下次需要使用时重新请求该数据。这也是存储词典的所有方法中效率最低的。缓存字典的速度在很大程度上取决于正确的设置和使用场景。只有当命中率足够高时（建议99％或更高），缓存类型字典才能很好地执行。您可以在system.dictionaries表中查看平均命中率。

要提高缓存性能，请使用带有LIMIT的子查询，并在外部使用字典调用该函数。

Supported sources: MySQL, ClickHouse, executable, HTTP

配置

<layout>
    <cache>
        <!-- The size of the cache, in number of cells. Rounded up to a power of two. -->
        <size_in_cells>1000000000</size_in_cells>
    </cache>
</layout>

设置足够大的缓存大小。您需要尝试选择单元格数量：

设定一些value。
运行查询，直到缓存完全填满。
使用system.dictionaries表评估内存消耗。
增加或减少单元数，直到达到所需的内存消耗。

Warning:

不要用Click house作为数据源，因为clickhouse 处理随机读太慢

complex_key_cache

This type of storage is for use with composite keys. Similar to cache.

ip_trie

这种类型的存储用于将网络前缀（IP地址）映射到诸如ASN的元数据

示例：该表包含网络前缀及其对应的AS编号和国家/地区代码：

+-----------------+-------+--------+
  | prefix          | asn   | cca2   |
  +=================+=======+========+
  | 202.79.32.0/20  | 17501 | NP     |
  +-----------------+-------+--------+
  | 2620:0:870::/48 | 3856  | US     |
  +-----------------+-------+--------+
  | 2a02:6b8:1::/48 | 13238 | RU     |
  +-----------------+-------+--------+
  | 2001:db8::/32   | 65536 | ZZ     |
  +-----------------+-------+--------+

When using this type of layout, the structure must have a composite key

Example：

<structure>
  <key>
      <attribute>
          <name>prefix</name>
          <type>String</type>
      </attribute>
  </key>
  <attribute>
          <name>asn</name>
          <type>UInt32</type>
          <null_value />
  </attribute>
  <attribute>
          <name>cca2</name>
          <type>String</type>
          <null_value>??</null_value>
  </attribute>
  ...

key必须只有一个包含允许的IP前缀的String类型属性。其他类型尚不支持。

对于查询，您必须使用与Composite key字典相同的函数（dictGetT with a tuple）：

1	dictGetT('dict_name', 'attr_name', tuple(ip))

该函数支持UInt32的IPv4或FixedString（16）的IPv6：

1	dictGetString('prefix', 'asn', tuple(IPv6StringToNum('2001:db8::1')))

其他类型尚不支持。该函数返回与此IP地址对应的前缀的属性。如果存在重叠前缀，则返回最具体的一个。

数据以字典树（Trie）形式完全存储在随机存取存储器中（RAM）

本文标题：Clickhouse Dictionaries 在内存中的存储方式
本文作者：LoganShen
本文链接：https://blog.95id.com/clickhouse-dictionaries-storing-dictionaries-in-memory.html
发布时间：2019-01-01
版权声明：本博客所有文章除特别声明外，均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处！

bigdata, clickhouse