How to Optimize Database for High Performance Part-2
On this PostgreSQL Performance and Tuning blog series we will cover about the PostgreSQL parameters and Data design- What parameters to tune and considerations while desiging your database schema.
Hello, and welcome to a free monthly edition of my newsletter. I’m Ajay, and each week I tackle reader questions about PostgreSQL, working and implementing PG with best practices, and anything else that’s stressing you out in PostgreSQL world. Send me your questions and in return, I’ll humbly offer actionable real-talk advice. 🤜🤜
If you find this post valuable, check out some of my other posts:
To receive this newsletter in your inbox weekly, consider subscribing 👇
On the first blog post of the database performance and optimization series, I talked about 6 different areas where we can focus on optimal PostgreSQL performance.
In this Blog, we will cover all the important database configurations and Database designing.
Database configuration
PostgreSQL configuration file contains a variety of settings that can impact the database server's performance. It's important to understand the impact of each setting and to configure them appropriately to ensure that the server is using resources effectively. Regular monitoring and tweaking of the configuration settings can help to maintain optimal performance over time.
Resource Usage
shared_buffers
The purpose of shared_buffers is to control how much memory PostgreSQL reserves for writing data to a disk. It acts as a cache for recently accessed data, and keeps frequently accessed data in memory to reduce disk I/O and improve performance. The size of shared_buffers affects the amount of memory used by PostgreSQL, and consequently, may affect the overall system performance. If the value is too low, the database may use more disk I/O and have slower performance. On the other hand, setting the value too high can result in memory being wasted if it isn't being used.
The shared_buffers parameter determines how much memory the database server allocates for shared memory buffers. This parameter is important for optimizing the performance of the database by reducing the need to read data from disk and providing a quicker response time for data requests. It is recommended to set the shared_buffers value to 25% of your system’s memory.
work_mem
Work_mem is a configuration parameter in PostgreSQL that determines the maximum amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. This parameter is often used to improve query performance and lessen the pressure on the CPU by allowing more operations to be done in memory instead of writing to disk.
The recommended starting point for work_mem is ((Total RAM - shared_buffers)/(16 x CPU cores)).
If the work_mem parameter is set too low, you may experience hashing operations being done in batches, which can be observed in the query plan.
checkpoint_timeout, checkpoint_completion_target
Checkpoint_timeout and checkpoint_completion_target are two PostgreSQL configuration parameters that affect the frequency and duration of database checkpoints.
Checkpoint_timeout dictates the time between WAL checkpoints, and setting this too low can lead to decreased performance due to the expensive disk read/write operations used to flush the data into the data files.
Checkpoint_completion_target is the fraction of time between checkpoints for checkpoint completion and a high frequency of these checkpoints can impact performance. When set to an optimal value, checkpoints can be triggered by timeout for better performance and predictability and also ensure that enough WAL space is reserved to handle spikes in the use of WAL.
maintenance_work_mem
The maintenance work memory resource is a memory setting used by PostgreSQL to determine the maximum amount of memory to be used by maintenance operations. This includes tasks such as VACUUM, CREATE INDEX, ALTER TABLE ADD FOREIGN KEY, and RESTORE.
The default value is 64MB, but it can be increased to improve performance for maintenance operations. Additionally, the resource can be used by autovacuum workers, with the amount of memory each worker can use determined by the autovacuum_max_workers parameter. This must be taken into account when setting the maintenance_work_mem parameter as the total memory used by autovacuum workers can be multiple times the amount of memory set. It should typically be set to 10-20% of the system's RAM.
By allowing a larger amount of memory to be used for maintenance operations, queries will require less disk I/O operations, as data can be stored in memory instead of disk. This leads to better overall performance and reduces response time for queries and maintenance operations. Furthermore, increasing the maintenance_work_mem setting allows for more tuples to be cleaned up during the autovacuum process and can result in a better-performing database.
autovacuum_work_mem
The autovacuum work memory is the amount of memory that is assigned to autovacuum workers for performing maintenance operations. It is set using the autovacuum_work_mem parameter and is usually determined by the value of maintenance_work_mem. It is important to note that the autovacuum workers will use the same amount of memory allocated to each worker, so it is important to take that into account when setting the autovacuum_work_mem parameter. Increasing the autovacuum_work_mem parameter will allow for more efficient vacuuming and maintenance operations.
Autovacuum
log_autovacuum_min_duration
Logs vacuuming tasks for a specified amount of time. This parameter is helpful if you want to track or log the activity of autovacuum tasks. For instance, you want to check if there is an autovacuum task that did not run or execute due to a non-existing relation. Setting the value to -1
logs such information.
autovacuum_max_workers
Autovacuum is a background process responsible for removing dead tuples (deleted rows) and updating database statistics used by PostgreSQL query planner to optimize queries. The default number of workers for Autovacuum is 3, however, this can be increased up to a maximum of 5 to allow for parallel and more frequent vacuuming across tables. The autovacuum feature automates the execution of VACUUM and ANALYZE commands, which can be run with standard or full variants. Monitoring autovacuum activity and increasing the number of workers, as well as increasing the memory available to each worker, can help with tuning Autovacuum for optimal performance.
autovacuum_vacuum_cost_limit
The vacuum cost limit for autovacuums is the maximum amount of work a vacuum can do before pausing to rest. This is configured by setting the parameter vacuum_cost_limit, which is set to 500 to 1000 by default. This parameter is used to control the amount of resources that the autovacuum process can use. It also helps to prevent excessive load on the database server due to autovacuuming. Increasing this value allows the autovacuum to do more work before sleeping.
Write-Ahead Log
wal_compression
Write-ahead compression is a technique used in PostgreSQL to reduce the amount of disk space required to store the Write-Ahead Log (WAL). When enabled, the server compresses a full-page image written to the WAL when full_page_writes is on or during a base backup. This helps to reduce the WAL file size, thus reducing disk space usage and increasing database performance.
The effect of write-ahead compression is a decrease in the amount of disk space needed to store the WAL, resulting in improved database performance and reduced recovery time when the database crashes. It also helps to reduce the amount of shared memory used for WAL data that has not yet been written to disk, as the default setting is about 3% of shared_buffers. Additionally, increasing the wal_buffers parameter to at least a few MB can further improve write performance on a server with many concurrent transactions.
wal_buffers
The wal_buffers parameter determines how much memory the database server uses for write-ahead logging (WAL) buffers. These buffers hold transaction data that is later written to persistent storage. By using the WAL mechanism, PostgreSQL can recover data after a failure and reduce disk I/O by avoiding frequent writes to disk. The larger the wal_buffers value, the more memory is dedicated to caching data and writing transaction data to the WAL file. This can improve write performance on a server with many concurrent transactions. However, if the wal_buffers value is set too low, it can negatively impact write performance by decreasing the amount of memory available for back-end processes to write WAL data in memory.
Wal_buffers is an important parameter for the PostgreSQL database. It is the amount of shared memory used for the write-ahead log (WAL) data that has not yet been written to disk. The default value is about 3% of the shared_buffers, but it can be increased to improve write performance on a server with many concurrent transactions.
checkpoint_timeout
The checkpoint_timeout setting in the Write Ahead Logging (WAL) feature determines the amount of time in which PostgreSQL aims to complete a checkpoint. This means a checkpoint need not result in an I/O spike and instead aims to spread the writes over this fraction of the checkpoint_timeout value. A higher timeout reduces overall WAL volume but makes crash recovery take longer, so the recommended value is usually a minimum of 15 minutes, but ultimately the RPO of business requirements will dictate what this should be. The
checkpoint_completion_target
checkpoint_completion_target parameter is also used to set the fraction of time between checkpoints for checkpoint completion. This feature helps to ensure smooth checkpointing, so setting the checkpoint_timeout to a low value is important.
The checkpoint completion target is used to set the amount of time in which PostgreSQL aims to complete a checkpoint. This is done to prevent an I/O spike and instead spread the writes over this fraction of the checkpoint_timeout value. The recommended value is 0.9 which is the default in PostgreSQL 14.
max_wal_size
The maximum size of a write-ahead log (WAL) is determined by the “max_wal_size” parameter. This parameter specifies the maximum size the WAL can grow between control points. Increasing this parameter can increase the amount of time needed to recover faults, so it's important to set this value carefully. The recommended value is half to two-thirds of the available disk space where the WAL is located. It is also important to remember that the WAL file can also grow larger than the max_wal_size parameter in special circumstances. To help prevent running out of disk space, checkpoint_timeout should be set to trigger checkpoints at regular intervals. In addition to the max_wal_size parameter, other WAL settings such as wal_compression, min_wal_size, wal_buffers, and checkpoint_completion_target should also be taken into consideration. Furthermore, it is recommended to save the WAL files on a disk other than PGDATA to both balance the writing and for security in case of hardware failure.
archive_mode
The archive mode is a setting in PostgreSQL that controls how the write-ahead log (WAL) is handled. When enabled, a copy of the WAL segment is written to an archive location, usually a separate disk, for long-term storage. This allows for the recovery of the database up to the point of a crash or other failure. When the archive mode is disabled, the WAL segment is recycled or removed from the system once it is no longer needed. The archive mode is important in ensuring that data is not lost in the event of a system failure.
archive_command
The archive command is used to back up or archive write-ahead log (WAL) files. It allows the database to create a copy of the WAL file in an alternative location, so that the WAL log is always backed up and can be recovered in the case of a system crash. This is a critical component of PostgreSQL's write-ahead logging system, and must be configured correctly for the system to be reliable. The archive command can also be used to perform periodic archiving of WAL files, which keeps disk space requirements for the WAL log to a minimum.
Other Important parameters:
superuser_reserved_connections
In case of reaching the limit of max_connection, these connections are reserved for superuser.
fsync
If fsync is enabled, PostgreSQL will try to make sure that the updates are physically written to the disk. This ensures that the database cluster can be recovered to a consistent state after an operating system or hardware crash.
While disabling fsync generally improves performance, it can cause data loss in the event of a power failure or a system crash. Therefore, it is only advisable to deactivate fsync if you can easily recreate your entire database from external data.
max_connections affects how the PostgreSQL server and client connections behave. You can use this to configure the maximum number of parallel connections that a PostgreSQL server can support. The formula used to calculate this is:
max_connections = max(4 * number of CPU cores, 100)
This makes sure that at any given point in time, the CPU is not overloaded with too many active connections. But we also need to make sure we have enough hardware resources to support this number of parallel connections.
default_statistics_target
The default statistics target is a PostgreSQL parameter that controls the amount of statistical data that PostgreSQL stores. This data is used to determine the best query plans and efficient query execution. The default value is 100, which is considered a low target. Increasing this value can result in more accurate query plans and better query performance, as PostgreSQL will have more statistical data to work with. However, increasing this value too high can negatively affect performance and cause unnecessary overhead.
synchronous_commit
Synchronous_commit is an important setting in PostgreSQL that determines whether transaction commit will wait for the WAL records to be written to disk before the command returns a “success” indication to the client. By setting this to “on”, this ensures a higher reliability for the database as it will wait for the WAL to be written to disk before returning a success status to the client. However, this might cause a performance bottleneck as it will take longer for the transaction to commit. Therefore, the setting can be adjusted depending on the application's priority between performance and reliability. Additionally, PostgreSQL also has a range of different lock types to allow for reasonable concurrency while maintaining consistency constraints. This helps to reduce lock contention and provides transaction isolation for each database session.
Temp_buffers
This parameter sets the maximum number of temporary buffers used by each database session. The session local buffers are used only for access to temporary tables. The setting of this parameter can be changed within individual sessions but only before the first use of temporary tables within the session.
PostgreSQL database utilizes this memory area for holding the temporary tables of each session, these will be cleared when the connection is closed.
The default value of temp_buffer = 8MB.
Database Design:
In addition to improving data access performance, a good design achieves other benefits, such as maintaining data consistency, accuracy, and reliability and reducing storage space by eliminating redundancies. Another benefit of good design is that the database is easier to use and maintain. Anyone who has to manage it will only need to look at the entity-relationship diagram (ERD) to understand its structure.
ERDs are the fundamental tool of database design. They can be created and visualized at three levels of design: conceptual, logical, and physical.
The conceptual design shows a very summarized diagram, with only the elements necessary to agree on criteria with the project stakeholders, who do not need to understand the technical details of the database. The logical design shows the entities and their relationships in detail but in a database-agnostic way.
Define the type of database to design
Two fundamental types of databases are usually distinguished: relational and dimensional.
Relational databases are used for traditional applications that execute transactions on the data – that is, they get information from the database, process it, and store the results.
On the other hand, Dimensional databases are used for the creation of data warehouses: large repositories of information for data analysis and data mining to obtain insights.
A relational database design
A dimensional database design
The first step in any database design task is to choose one of the two main database types to work with: relational or dimensional. It is vital to have this clear before you start designing. Otherwise, you can easily fall into design mistakes that will eventually lead to many problems and will be difficult (or impossible) to correct.
Adopting a Naming Convention
The names used in the database design are essential because, once an object is created in a database, changing its name can be fatal. Changing just one letter of the name can break dependencies, relationships, and even entire systems.
That’s why it is critical to work with a healthy naming convention: a set of rules that saves you the trouble of trying 50 different possibilities to find the name of an object you can’t remember.
There is no universal guide to what a naming convention should be to do its job. But the important thing is to establish a naming convention before naming any of the objects in a database and maintaining that convention forever. A naming convention establishes guidelines such as whether to use an underscore to separate words or to join them directly, whether to use all capital letters or capitalize words (Camel Case style), whether to use plural or singular words to name objects and so on.
Start with the conceptual design, then the logical design, and finally the physical design.
That’s the natural order of things. As a designer, you may be tempted to start by creating objects directly on the DBMS to skip steps. But this will prevent you from having a tool to discuss with stakeholders to ensure that the design meets the business requirements.
After the conceptual design, you must move on to the logical design to have adequate documentation to help the programmers understand the database structure. It is vital to keep the logical design updated to be independent of the database engine to be used. This way, if you eventually migrate the database to a different engine, the logical design will still be useful.
Finally, the physical design can be created by the programmers themselves or by a DBA, taking the logical design and adding all the implementation details needed to implement it on a particular DBMS.
Create and maintain a data dictionary
Even if an ERD is clear and descriptive, you should add a data dictionary to make it even clearer. The data dictionary maintains coherence and consistency in the database design, particularly when the number of objects in it grows significantly.
The main purpose of the data dictionary is to maintain a single repository of reference information about the entities of a data model and its attributes. The data dictionary should contain the names of all entities, the names of all attributes, their formats and data types, and a brief description of each.
The data dictionary provides a clear and concise guide to all the elements that make up the database. This avoids creating multiple objects that represent the same thing, which makes it difficult to know which object to resort to when you need to query or update information.
Maintain consistent criteria for primary keys
The decision to use natural keys or surrogate keys must be consistent within a data model. If entities in a data model have unique identifiers that can be efficiently managed as primary keys of their respective tables, there is no need to create surrogate keys.
But it is common for entities to be identified by multiple attributes of different types – dates, numbers, and/or long strings of characters – which may be inefficient for forming primary keys. In these cases, it’s better to create surrogate keys of integer numeric type, which provide maximum efficiency in index management. And the surrogate key is the only option if an entity lacks attributes that uniquely identify it.
A table with a natural primary key (left) versus a table with a surrogate key (right)
Use the correct data types for each attribute.
Certain data give us the option of choosing which data type to use to represent them. Dates, for example. We can choose to store them in date type fields, date/time type fields, varchar type fields, or even numeric type fields. Another case is numeric data that is not used for mathematical operations but to identify an entity, such as a driver’s license number or a zipcode.
In the case of dates, it is convenient to use the engine’s data type, which makes it easier to manipulate data. If you need to store only the date of an event without specifying the time, the data type to choose will be simply Date; if you need to store the date and time when a certain event occurred, the data type should be DateTime.
Using other types, such as varchar or numeric, to store dates may be convenient but only in very particular cases. For example, if it is not known in advance in which format a date will be expressed, it is convenient to store it as varchar. If search performance, sorting, or indexing is critical in handling date-type fields, a previous conversion to float can make a difference.
Numeric data not involved in mathematical operations should be represented as varchar, applying format validations in the recording to avoid inconsistencies or repetitions. Otherwise, you expose yourself to the risk that some data exceeds the limitations of the numeric fields and forces you to refactor a design when it is already in production.
Indexing(What type and why It’s Important)
Indexing is a database optimization technique that allows for faster retrieval of data from a table. It works by creating a separate data structure that stores a mapping of the values in a specific column to the corresponding rows in the table. When a query is executed, the database engine can use the index to quickly locate the rows that match the search criteria, rather than scanning the entire table.
There are several types of indexes in PostgreSQL, including:
B-tree index: A balanced tree index that is used for most general-purpose queries.
Hash index: A hash-based index that is used for equality queries with a high degree of concurrency.
GiST (Generalized Search Tree) index: A versatile index that can be used for a variety of data types, including geometric and text data.
GIN (Generalized Inverted Index) index: An index used for data types that have many repeating values, such as arrays.
SP-GiST (Space-Partitioned Generalized Search Tree) index: A versatile index that can be used for a variety of data types, including geometric and text data.
Some guidelines for choosing the right index
Choose the right data type: Choose the right data type for your columns as this can affect the performance of indexes.
Index columns with high selectivity: Columns with high uniqueness and low redundancy should be prioritized for indexing as they will result in more efficient queries.
Avoid index overloading: Avoid indexing too many columns in a single index as it can lead to index overloading and slow down the database.
Use the correct index type: Choose the correct index type (e.g. B-tree, Hash, etc.) based on the type of queries that will be performed on the data.
Keep indexes small: Large indexes can slow down updates and inserts into the database, so try to keep them as small as possible.
Monitor index usage: Regularly monitor the usage of indexes and make changes as necessary to improve performance.
Some great resources to read 👍
7 Database Design Mistakes to Avoid (With Solutions) 7 Database Design Mistakes to Avoid (With Solutions)
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
https://www.postgresql.org/docs/14/runtime-config-resource.html
Tuning PostgreSQL Database Parameters to Optimize Performance
https://www.percona.com/blog/tuning-postgresql-database-parameters-to-optimize-performance/