Optimizing MySQL Storage: Choosing the Right Data Types to Speed Up I/O and Save RAM

MySQL tutorial - IT technology blog
MySQL tutorial - IT technology blog

Slow Database? The Culprit Might Be Your Data Types

When I first started working with MySQL, all I cared about was whether queries ran and whether indexes were correct. Choosing between INT and BIGINT, or VARCHAR(255) versus VARCHAR(50) — I just went with “bigger is safer” and moved on.

It wasn’t until my production MySQL 8.0 database hit around 50GB and started showing abnormally high I/O that I sat down to audit the schema — and discovered a string of problems. The user_id column was using BIGINT even though the data had never exceeded 2 million records. The status column stored the strings "active"/"inactive" instead of TINYINT. The email column was declared as VARCHAR(1000) “just to be safe.”

After refactoring the schema, queries became noticeably faster and the buffer pool utilization improved dramatically. This article documents the practical tips from that process — not textbook theory.

Why Do Data Types Affect I/O and RAM?

MySQL reads data in units of pages (InnoDB defaults to 16KB). How many rows fit in a page depends on the size of each row. The smaller the row, the more rows fit per page.

The practical consequences of “bloated” rows from oversized data types:

  • Higher I/O: More disk reads are needed to scan the same amount of data because fewer rows fit per page
  • Wasted RAM: The InnoDB buffer pool (memory cache) holds fewer rows for the same amount of RAM
  • Larger indexes: B-tree index traversal slows down with larger key sizes — and indexes store data in the column’s declared type

A concrete example: for a table with 10 million rows, saving 8 bytes per row by choosing INT instead of BIGINT adds up to 80MB of data savings. Factor in the indexes that store that primary key, and the real number is much higher.

In Practice: Choosing the Right Data Type for Each Case

1. Integers — Don’t Default to BIGINT for Everything

This is the most common mistake I see. Each integer type has a different range and storage size:

-- Storage size and range:
TINYINT    -- 1 byte  | max 127 (signed) / 255 (unsigned)
SMALLINT   -- 2 bytes | max 32,767 / 65,535
MEDIUMINT  -- 3 bytes | max 8,388,607 / 16,777,215
INT        -- 4 bytes | max 2,147,483,647 / 4,294,967,295
BIGINT     -- 8 bytes | max ~9.2 × 10^18

Practical applications:

  • age, star_rating, quantity for small values → use TINYINT UNSIGNED (0–255), no need for INT
  • year, postal_codeSMALLINT UNSIGNED or the YEAR type
  • user_id, product_id → if data stays under 4 billion, INT UNSIGNED is enough, saving 4 bytes/row over BIGINT
  • Only use BIGINT when truly needed: financial transaction IDs, Snowflake IDs, distributed systems
-- Bad: BIGINT for user_id in a small system, wastes 4 bytes/row
CREATE TABLE orders (
  id         BIGINT AUTO_INCREMENT PRIMARY KEY,
  user_id    BIGINT NOT NULL,
  status     TINYINT NOT NULL DEFAULT 1
);

-- Better: saves 8 bytes/row (id + user_id)
CREATE TABLE orders (
  id         INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  user_id    INT UNSIGNED NOT NULL,
  status     TINYINT UNSIGNED NOT NULL DEFAULT 1
);

2. Strings — Size VARCHAR(n) to Reality, Not “Just to Be Safe”

VARCHAR(n) stores only the actual length plus 1–2 bytes of overhead. However, the value of n affects temporary memory when MySQL needs to sort or group: it allocates n × charset_bytes per row in a temporary table, even if the actual data is much shorter.

-- Declare close to actual max length:
email       VARCHAR(254),    -- RFC 5321: max email length is 254 characters
username    VARCHAR(50),     -- usernames rarely exceed 50 characters
phone       VARCHAR(20),     -- longest international phone number is ~15 characters
slug        VARCHAR(200),    -- realistic URL slug length
country     CHAR(2),         -- ISO country code: VN, JP, US — use CHAR for fixed-length

-- Avoid oversized declarations:
-- email VARCHAR(1000)   -- allocates 3000 bytes/row when sorting (UTF-8 × 3)
-- name  VARCHAR(255)    -- if the actual max is 100, don't declare 255

CHAR vs VARCHAR: Use CHAR(n) for fixed-length data like country codes, postal codes, or string-format UUIDs. Use VARCHAR for variable-length data.

TEXT types: Not stored inline in the row — InnoDB stores them on a separate page, with the row holding only a pointer. This means reading a TEXT column requires an additional I/O operation. Use only when you genuinely need to store long content:

-- Correct: TEXT for genuinely long content
article_content  MEDIUMTEXT,   -- blog article body
product_desc     TEXT,
raw_log          LONGTEXT

-- Wrong: TEXT for short data
note             TEXT          -- if notes are only a few dozen characters
-- Better:
note             VARCHAR(500)

3. ENUM for Fixed Value Sets

Columns like status, role, and type with a handful of fixed values — this is where ENUM shines:

-- VARCHAR stores the actual string; each value takes exactly len(value) bytes
status VARCHAR(20)   -- 'active' = 6 bytes, 'inactive' = 8 bytes

-- ENUM stores a 1-2 byte index; actual values live in table metadata
status ENUM('active', 'inactive', 'banned', 'pending')  -- always 1 byte
role   ENUM('admin', 'editor', 'viewer', 'guest')

One downside to keep in mind: adding or removing ENUM values requires an ALTER TABLE (expensive on large tables). If the value set changes frequently, using TINYINT UNSIGNED with constants in application code is more flexible.

4. Date and Time Types — TIMESTAMP Saves 4 Bytes/Row Compared to DATETIME

-- DATETIME : 8 bytes, range 1000–9999, not timezone-aware
-- TIMESTAMP: 4 bytes, range 1970–2038, auto-converts to server timezone
-- DATE     : 3 bytes, no time component

-- In practice:
created_at  TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at  TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
birth_date  DATE,                     -- 3 bytes, no time needed
event_start DATETIME(3)               -- when millisecond precision is needed, or range exceeds 2038

For systems running in a single timezone that don’t need to store timestamps beyond 2038, each TIMESTAMP column instead of DATETIME saves 4 bytes. A table with 10 million rows and 2 timestamp columns saves 80MB right away.

5. NULL — Declare NOT NULL Whenever Possible

Nullable columns cost an extra bit in the NULL bitmap. The impact isn’t as large as in MyISAM, but NULL complicates queries and can cause the optimizer to make poor decisions when estimating cardinality.

-- Default to NOT NULL + DEFAULT for columns that always have a value
view_count  INT UNSIGNED NOT NULL DEFAULT 0,
is_active   TINYINT(1) NOT NULL DEFAULT 1,
created_at  TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

-- Only allow NULL when you need to distinguish "not set" from "empty value"
deleted_at  TIMESTAMP NULL DEFAULT NULL,   -- soft delete pattern
verified_at TIMESTAMP NULL DEFAULT NULL    -- not yet verified = NULL

6. Auditing Your Current Schema with INFORMATION_SCHEMA

Before optimizing, I always run this query to spot “bloated” columns:

-- List columns with data types worth reviewing
SELECT 
  TABLE_NAME,
  COLUMN_NAME,
  COLUMN_TYPE,
  CHARACTER_MAXIMUM_LENGTH,
  IS_NULLABLE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'your_database'
  AND (
    DATA_TYPE IN ('text', 'mediumtext', 'longtext')
    OR (DATA_TYPE = 'varchar' AND CHARACTER_MAXIMUM_LENGTH > 500)
    OR (DATA_TYPE = 'bigint')
    OR (DATA_TYPE = 'datetime')
  )
ORDER BY TABLE_NAME, ORDINAL_POSITION;
-- Check the actual size of each table
SELECT 
  TABLE_NAME,
  TABLE_ROWS,
  ROUND(DATA_LENGTH / 1024 / 1024, 2)              AS data_mb,
  ROUND(INDEX_LENGTH / 1024 / 1024, 2)             AS index_mb,
  ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024, 2) AS total_mb
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'your_database'
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC
LIMIT 20;

Conclusion: Principles I Apply From Day One

Schema design isn’t a “set and forget” thing. As data grows, poor early decisions turn into technical debt that’s expensive to pay back. Refactoring a 100-million-row table costs far more than getting it right from the start.

Here are the principles I now apply whenever I create a table:

  1. Use the smallest integer that fits: Estimate the real-world max value and choose the appropriate type. Add UNSIGNED if negative values aren’t needed.
  2. Size VARCHAR(n) to reality: Don’t declare 255 or 1000 “just to be safe” — it affects memory during sorting.
  3. ENUM for stable fixed value sets: status, role, type — don’t use VARCHAR.
  4. TIMESTAMP over DATETIME when data won’t exceed 2038 — saves 4 bytes/row.
  5. NOT NULL + DEFAULT as the default: Only allow NULL when there’s a clear reason.
  6. Audit regularly with INFORMATION_SCHEMA: Catch issues early before data grows.

These optimizations might seem minor individually, but on a 50GB database like mine, they make a real difference — both in query speed and buffer pool RAM utilization.

Share: