Ceph Community News (2022-7-1 to 2022-7-30)

rosinL2022-08-13CephDynamicPacificopenEulersig: sig-SDS

Ceph Community News (2022-07-01 to 2022-07-30)

Quincy Ultra-large-scale Deployment Test

The Quincy test, conducted on 200 servers with a 100 Gbit/s network, aimed to evaluate the cluster management and monitoring modules. During this evaluation, two significant issues were identified and subsequently resolved: mgr and CLI response freezing caused by performance count reporting by the Ceph daemon, and peering between PGs affected by scale-out pg_autoscaler. Optimized cephadm cluster management reduces errors and agent log traffic, and improves cluster management interaction. Prometheus needs better SSDs for large cluster management due to its inability to collect data from all nodes simultaneously.

Ceph Security Vulnerability Fixing

The following two security vulnerabilities are fixed. The v16.2.10 and v17.2.2 versions are released. You are advised to upgrade to the versions.

  • Upgrading Ceph cluster from Nautilus (or earlier) to a later major version were vulnerable to attacks from malicious users. This vulnerability allowed users to access any part of the CephFS file system hierarchy instead of being correctly restricted to their own child volumes.

  • The s3website request does not refer to a bucket. As a result, a null pointer occurs, causing RGW segfault.

In-depth Optimization of Ceph RocksDB

The default RocksDB optimization of Ceph was compared with several other configurations. We looked at how these settings affect write amplification and performance on NVMe drives, and tried to show how complex the interaction between different configuration options was. It seems that with the correct combination of options, higher performance can be achieved without significantly increasing write amplification. In some cases, enabling compression may reduce write amplification and have a moderate impact on performance in some tests. In these tests, the configuration with the highest performance usually seems to be: No Compression

bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=128,min_write_buffer_number_to_merge=16,compaction_style=kCompactionStyleLevel,write_buffer_size=8388608,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0

LZ4 compression (The RGW load can reduce write amplification, but the bucket list performance is affected.)

bluestore_rocksdb_options = compression=kLZ4Compression,max_write_buffer_number=128,min_write_buffer_number_to_merge=16,compaction_style=kCompactionStyleLevel,write_buffer_size=8388608,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0

These non-default configurations can improve performance but have not been tested a lot. In the next few months, we'll run these configurations through QA and may consider changing the default values for the next version of Ceph.

Recently Merged PRs

Recently, PRs mainly focused on bug fixing. The main features are as follows:

  • MemDB:
    • To simplify the buffer library, the MemDB code is deleted and MemDB is no longer supported. #36282
    • Deleted statfs updates on each txc, which reduces the database load and improves performance. #46036
  • compressor/zlib:

    • arm64 zlib uses isa-l to improve performance by 4 to 7 times. #44762
  • libcephsqlite

    • If the regular expression '-' is used for compiling on gcc8.5.0-14, libcephsqlite breaks down. #47270.
  • crimson

    • Resolved the issue that ClientRequests leakage occurs when the OSD is stopped. #47064
    • Fixed sigsegv caused by the reuse of moved objects. #47237
    • Fixed the undefined behavior caused by forcible conversion of the Transaction::is_retired pointer. #47081
  • rgw

    • Optimized the lock contention mechanism. The 5-second waiting time is divided into multiple 10 ms retries to reduce the lock waiting time. #46248.
    • Added the s3website check condition to solve the problem that the program breaks down when the bucket name is not specified. #46933
  • mgr large cluster management optimization

    • Avoided dumping all PG statistics for progress updates. [#44208

    • Prevented pg_num from being modified within the range of pgp_num. #44155

    • Transfers the notification mgr only when the module needs to be notified. #44162

Recent Ceph Developer News

Each module of the Ceph community holds regular meetings to discuss and align the development progress. Meeting videos are uploaded to YouTube. The major meetings are as follows:

Conference NameDescriptionFrequency
Crimson SeaStore OSD Weekly MeetingCrimson & Seastore developmentWeekly
Ceph Orchestration MeetingCeph management module (mgr)Weekly
Ceph DocUBetter MeetingDocument optimizationBiweekly
Ceph Performance MeetingCeph performance optimizationBiweekly
Ceph Developer MonthlyCeph developer meetingMonthly
Ceph Testing MeetingVersion verification and releaseMonthly
Ceph Science User Group MeetingCeph scientific computingIrregularly
Ceph Leadership Team MeetingCeph leadership teamWeekly
Ceph Tech TalksDiscussion on technology-related topics in the Ceph communityMonthly

Performance Meeting

  • The performance of v17.2.1 is lower than that of v17.2.0.
    • Compared with the version change, the zero block detection function of BlueStore is enabled by default in v17.2.0 and disabled by default in v17.2.1.
  • In-depth optimization of Ceph RocksDB
    • Discussed the rgw compression. After compression is enabled, the write and deletion performance is improved, and the write traffic is lower. The read performance does not deteriorate when LZ4 is used. If the rgw daemon process is restricted to using only two cores, the performance deteriorates when the bucket list is queried, and the read, write, and deletion performance remains unchanged.
    • Discussed the tombstone clearing and compression policies for RocksDB to delete data. Within a specified iteration window, if the threshold is exceeded, the tombstone is compressed.
    • Discussed bluestore statfs statistics optimization for the upcoming new WAL.

Crimson SeaStore OSD Weekly Meeting

  • Resolved issues related to zns support.
  • GC transactions are not sourced by user behavior, so scope read operations from GC transactions do not satisfy the temporal locality principle. These extended areas should not be added to the LRU cache.

[Disclaimer] This article only represents the author's opinions, and is irrelevant to this website. This website is neutral in terms of the statements and opinions in this article, and does not provide any express or implied warranty of accuracy, reliability, or completeness of the contents contained therein. This article is for readers' reference only, and all legal responsibilities arising therefrom are borne by the reader himself.