Database Storage II (CMU Database Systems 2019 Fall)
CommentLog=structured file organization
Instead of storing tuples in pages , the DBMS only stores log records.
The system appends log records to the file of how the database was modified:
- inserts store the entire tuple
- Deletes mark the tuple as deleted
- Updates contain the delta of just the attributes that were modified
why?
- easy to roll back
- fast
Obvious downside: reads…
To read a record, the DBMS scans the log backwards and “recreates” the tuple to find what it needs.
Build indexed to allow it to jump to locations in the log
Periodically compact the log.
Today’s agenda
- Data representation
- system catalogs
- storage models
Tuple storage
The DBMS’s catalogs contain the schema information about tables that the system uses to figure out the tuple’s layout.
INTEGER/BIGINT/SMALLINT/TINYINT
- C/C++ Representation
FLOAT/REAL vs. NUMERIC/DECIMAL
- IEEE-754 STANDARD / Fixed-point Decimals
VARCHAR/VARBINARY/TEXT/BLOB
- Header with length, followed by data bytes.
TIME/DATE/TIMESTAMP
- 32/64-bit integer of (micro)seconds since Unix epoch
Fixed precision numbers
Numeric data types with arbitrary precision and scale. Used when round errors are unacceptable.
- Example: NUMERIC, DECIMAL
Typically stored in a exact, variable-length binary representation with additional meta-data
- like a varchar but not stored as a string
Large Values
Most DBMSs don’t allow a tuple to exceed the size of a single page.
To store values that are larger than a page, the DBMS uses separate overflow storage pages.
- Postgres: TOAST( > 2 KB)
- MySQL: Overflow (> 1/2 size of page)
- SQL Server: Overflow( > size of page)
To BLOB or Not To BLOB: Large Object Storage in a Database of a Filesystem?
256 KiB seems to be a division
System Catalogs
A DBMS stores meta-data about database in its internal catalogs
-Tables, columns, indexed, views
Users, permissions
Internal statistics
Almost every DBMS stores their a database’s catalog in itself.
- Wrap object abstraction around tuples
- Specialized code for “bootstrapping” catalog tables.
You can query the DBMS’s internal INFORMATION_SCHEMA catalog to get info about the database.
- ANSI standard stet of read-only views that provide info about all of the tables, views, columns, and procedures in a database.
DBMSs also have non-standard shortcuts to retrieve this information.
1 | SELSCT * FROM INFORMATION_SCHEMA.TABLES WHERE table_catalog = '<db name>'; |
1 | \d; |
1 | SHOW TABLES; |
1 | .schema <table name>; |
observation
The relational model does not specify that we have to store all of a tuple’s attributes together in a single page.
This may not actually be the best layout for some workloads…
OLTP
Online Transaction Processing:
- Simple queries that read / update a small amount of data that is related to a single entity in the database.
This is usually the kind of application that people build first.
OLAP
Online Analytical Processing
- Complex queries that read large portions of the database spanning multiple entities
- You execute these workloads on the data you have collected from your OLTP applications(s).
N-ARY STORAGE MODEL
Advantages
- Fast inserts, updates, and deletes.
- Good for queries that need the entire tuple.
disadvantages
- Not good for scanning large portions of the table and / or a subset of the atreibutes./
DECOMPOSITION STORAGE MODEL(DSM)
The DBMS stores the values of a single attribute for all tuples contiguously in a page
- Also known as a “column store”.
Tuple Identification
Choice #1: fixed-length Offsets
- Each value is the same length for an attribute
Choice #2: Embedded Tuple Ids
- Each value is stored with its tuple id in a column
Advantages
- Reduces the amount wasted I/O because the DBMS only reads the data that it needs
- Better query processing and data compression (more on this later)
Disadvantages
- slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching
CONCLUSION
The storage manager is not entirely independent from the rest of the DBMS.
It is important to choose the right storage model for the target workload:
- OLTP = Row Store
- OLAP = Column Store
![](20221030-DatabaseStorageII.assets/屏幕截图 2022-10-30 185728.png)
Next:
- How the DBMS manages its memory and move data back-and-forth from disk
MIX OF TWO? engineering overhead too much
hybrid storage systems