AWS RDS MySQL의 Handler 메트릭 모니터링 [PMM]

데이터베이스

AWS RDS MySQL의 Handler 메트릭 모니터링 [PMM]

leezzangmin 2024. 11. 12. 17:33

인트로

MySQL을 운영하면서 주의깊게 보는 메트릭이 여러가지 있습니다. CPU 사용율이나 Memory 사용량 등 하드웨어 수치도 중요하지만, 그 중 서버 내부 메트릭 중 Handler 메트릭이 있습니다. 이 수치들을 모니터링하면 실제 피크타임에 문제가 발생하고 나서 알아차리는 것 보다 빠르게 미리 문제를 식별하고 조치할 수 있습니다.

이 글에서는 간단하게 해당 메트릭들을 PMM을 통해 모니터링하고, 조치하고, 자동화하는 방법을 작성해보겠습니다.

실험 환경

AWS RDS MySQL 8.0.39버전, db.t4g.micro 클래스의 DB 인스턴스
1. 슬로우 쿼리, slow_extra 파라미터 활성화
로컬에서 docker를 통해 Percona Monitorng Management를 실행 (PMM)
AWS IAM 사용자의 access key 및 secret으로 RDS 를 discover 및 등록
test1 데이터베이스 complex_table에는 데이터 200만건 적재

모니터링

모니터링하려는 handler 메트릭은 MySQL에 아래 명령어를 입력하면 쉽게 조회할 수 있습니다.

mysql> show global status like 'handler%';
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1540   |
| Handler_delete             | 8      |
| Handler_discover           | 0      |
| Handler_external_lock      | 20549  |
| Handler_mrr_init           | 0      |
| Handler_prepare            | 10     |
| Handler_read_first         | 860    |
| Handler_read_key           | 6390   |
| Handler_read_last          | 0      |
| Handler_read_next          | 8659   |
| Handler_read_prev          | 0      |
| Handler_read_rnd           | 0      |
| Handler_read_rnd_next      | 394332 |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_update             | 1005   |
| Handler_write              | 244899 |
+----------------------------+--------+
18 rows in set (0.04 sec)

이 변수들을 조회해서 특정 Variable_name의 Value가 특정 값을 초과하면 알람을 전송하는 파이썬 스크립트를 작성할 수도 있습니다.

-- pseudocode
while True:
	status=map(name,value).of(mysql.select("show global status like 'handler%'")
	for name,value in map:
    	if name=="target_name" and value >= target_value:
        	slack.send_message("이상 발생")

자체 구축 스크립트도 괜찮지만 PMM으로 모니터링이 가능합니다.

MySQL InnoDB Details 탭에서 확인할 수 있습니다.

가상 시나리오

나열된 많은 메트릭 중 이 글에서 살펴볼 수치는 `Handler_read_rnd_next` 입니다.

The number of requests to read the next row in the data file. This value is high if you are doing a lot of table scans. Generally this suggests that your tables are not properly indexed or that your queries are not written to take advantage of the indexes you have.

간략히 말하면, 테이블 풀 스캔으로 수행되는 SQL에 의해 이 수치가 상승됩니다. https://dev.mysql.com/doc/refman/8.4/en/server-status-variables.html

이 수치가 높거나 점진적으로/급격하게 상승하고 있다면 인덱스를 적절히 활용하지 못하는 SQL이 유입되고 있다고 판단할 수 있습니다. DBA 입장에서는 해당 쿼리를 신속하게 식별해서 인덱스 필요 유무를 판단해서 인덱스를 생성하거나, 여의치 않다면 쿼리 최적화해서 개발팀에 변경을 요청하는 등의 조치를 취할수도 있습니다.

사전에 생성되어 200만건의 데이터가 삽입되어 있는 테이블이 있습니다.

CREATE TABLE test1.complex_table (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    email VARCHAR(255) NOT NULL,
    age INT,
    address TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    status ENUM('active', 'inactive', 'pending') DEFAULT 'active',
    json_data JSON,
    KEY idx_name_email (name, email),
    UNIQUE KEY udx_email (email)
);

Q. 아래 쿼리는 Handler_read_rnd_next 수치를 상승시킬까요?

SELECT * FROM test1.complex_table where name='john';

A. name 컬럼에 인덱스가 생성되어있기 때문에 john에 해당하는 데이터가 많지 않다면 상승하지 않습니다. 인덱스를 사용 쿼리 실행 전후의 메트릭을 세션 단위에서 비교해보면 정확히 알 수 있습니다.

-- status 초기화
mysql> flush status;
Query OK, 0 rows affected (0.02 sec)

-- 쿼리 실행
SELECT ...

-- 변수 조회
mysql> show session status like '%handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 2     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 4     |
| Handler_mrr_init           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_last          | 0     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
18 rows in set (0.02 sec)

실행 후 문제가 없는 SQL이라고 판단이 가능합니다.

Q2. 다음 SQL은 어떨까요?

SELECT * FROM test1.complex_table where age = 25;

A2. age 컬럼에 적절한 인덱스가 없으므로 테이블 풀스캔으로 수행되는 Read_rnd_next 수치가 상승하는 쿼리입니다.

mysql> show session status like '%handler%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 1       |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_external_lock      | 2       |
| Handler_mrr_init           | 0       |
| Handler_prepare            | 0       |
| Handler_read_first         | 1       |
| Handler_read_key           | 1       |
| Handler_read_last          | 0       |
| Handler_read_next          | 0       |
| Handler_read_prev          | 0       |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_next      | 2000001 |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_update             | 0       |
| Handler_write              | 0       |
+----------------------------+---------+
18 rows in set (0.15 sec)

그런데, 이런 식으로 매 쿼리가 수행될 때 마다 status를 조회하는 것은 불가능합니다.

PMM에서도 metric의 상승/하락 경향은 파악할 수 있지만, 각 SQL의 수치는 알 수 없습니다.

바로 이 때 필요한 것이 slow_extra 파라미터 입니다.

https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_log_slow_extra

해당 파라미터를 ON으로 설정하면 slow 쿼리에 handler 메트릭이 포함되어 저장됩니다.

Dynamic type 변수이기 때문에 운영 중 적용해도 문제가 없습니다.

설정한 후에는 아래처럼 추가적인 메트릭이 포함되어 로그에 기록됩니다.

알람

이제 메트릭이 어떤 상황에서 증가하는지 파악했으니 PMM에서 슬랙 알람을 설정하면 됩니다. (이 부분은 다른 레퍼런스를 참조해주세요)

하지만 앞서 언급했듯이, PMM 대시보드에서는 상승/하락에 대한 경향성만 인지할 수 있고 어떤 쿼리가 악성 쿼리인지 판별하기 어려울 수 있습니다.

그렇다면 slow query log를 하나씩 찾아보며 원인을 판별해야합니다.

Read_rnd_next 수치가 상승하는 원인은 여러 경우가 있습니다.

신규 개발된 SQL이 배포되었는데, 인덱스를 사용할 수 없는 SQL
기존에 인덱스를 잘 사용하던 쿼리지만, 데이터가 분포가 점진적으로 변경되어 더 이상 좋은 plan으로 수행되지 않는 SQL
CDC 등의 솔루션에서 풀 싱크를 수행
개발팀에서 개발 데이터 검증을 위한 1회성 쿼리를 풀스캔으로 수행
관리형 스크립트에서 메타 테이블에 대해 조회 혹은 백업 수행 (PMM 유저도 포함됨)
배치 등 대용량 조회

그럼 슬랙으로 알람이 왔을 때 로그를 살펴보면서 일일히 어떤 쿼리가 악성 쿼리인지 찾아내야 합니다.

그런데 방대한 슬로우쿼리 로그를 눈으로 하나씩 보고 있기에는 휴먼 에러의 위험도 있고, 리소스가 너무 많이 투입됩니다.

바로 이럴 때 slow query log를 읽어서 handler 수치가 높은 쿼리를 뽑아내는 프로그램이 필요합니다.

-- psuedo code

for slow_query_log in slow_query_logs:
	if slow_query_log.read_rnd_next >= 1000:
    	print(slow_query_log)

뭐 이런거 있으면 편하겠죠....? 호호호

간단한 이 스크립트를 개발해두면 악성 SQL을 조기 탐지할 수 있으니 여러분도 해보세용

그럼 용두사미 글 끝. 안녕~~~

데이터 삽입 스크립트

챗지피티가 많이 발전했다. 이 정도는 스크립트는 기존에도 몇 번 왔다갔다 하면 조금만 수정해도 쓸 수 있을 정도로 잘 했지만...

import mysql.connector
from mysql.connector import Error
import random
import string
import json
from datetime import datetime
from multiprocessing import Pool

def generate_random_string(length):
    return ''.join(random.choices(string.ascii_letters + string.digits, k=length))

def generate_random_email():
    return generate_random_string(10) + '@' + generate_random_string(5) + '.com'

def generate_random_data(_):
    name = generate_random_string(10)
    email = generate_random_email()
    age = random.randint(18, 70)
    address = generate_random_string(50)
    status = random.choice(['active', 'inactive', 'pending'])
    json_data = json.dumps({
        'key1': generate_random_string(10),
        'key2': random.randint(1, 1000)
    })
    return (name, email, age, address, status, json_data)

def insert_data(records):
    try:
        connection = mysql.connector.connect(
            host='test-57.cnyugyuqszqs.ap-northeast-2.rds.amazonaws.com',
            database='test1',
            user='admin',
            password='Cromysql5128*'
        )

        cursor = connection.cursor()
        sql_insert_query = """
            INSERT INTO complex_table (name, email, age, address, status, json_data)
            VALUES (%s, %s, %s, %s, %s, %s)
        """

        cursor.executemany(sql_insert_query, records)
        connection.commit()
        print(f"{cursor.rowcount} records inserted successfully.")

    except Error as e:
        print(f"Error: {e}")
    finally:
        if connection.is_connected():
            cursor.close()
            connection.close()

def main():
    total_records = 1000000
    batch_size = 10000  # 한 번에 삽입할 레코드 수
    pool = Pool()

    for _ in range(0, total_records, batch_size):
        records = pool.map(generate_random_data, range(batch_size))
        insert_data(records)

    pool.close()
    pool.join()

if __name__ == "__main__":
    main()

https://dev.mysql.com/doc/refman/8.4/en/server-status-variables.html

MySQL :: MySQL 8.4 Reference Manual :: 7.1.10 Server Status Variables

7.1.10 Server Status Variables The MySQL server maintains many status variables that provide information about its operation. You can view these variables and their values by using the SHOW [GLOBAL | SESSION] STATUS statement (see Section 15.7.7.37, “S

dev.mysql.com