Scripts to determine PostgreSQL database size and rows processed

Differences between PostgreSQL and MySQL storage format and why this matters for billing estimations

tl;dr Rows processed numbers between Postgres and MySQL and database size may differ due to different index and row storage

PostgreSQL and MySQL are both relational databases with strong transactional capabilities. The way their storage engines store rows and corresponding indexes and how those indexes are used during queries differs significantly though. Check out this article from Uber Engineering for the technical details behind those differences.

Due to those index and row storage format differences, any numbers about rows read / written and database size from PostgreSQL will differ from the numbers you can expect once migrated to MySQL. If you are using similar indexes for your queries, the numbers should be pretty similar but depending on your exact queries and read/write pattern, MySQL numbers might be lower or even higher.

Determine PostgreSQL database size

Determining the size of a PostgreSQL database is pretty straight forward, run the following SQL statement for your DB:

SELECT pg_size_pretty( pg_database_size('databasename') );

For instance, if your database was called secrets, you would determine the size like this:

$ psql secrets
psql (14.1)
Type "help" for help.

secrets=# SELECT pg_size_pretty( pg_database_size('secrets') );
 pg_size_pretty 
----------------
 8889 kB
(1 row)

secrets=#

Determine PostgreSQL rows processed

Running statistic queries multiple times to determine frequency

In contrast to MySQL, PostgreSQL does not seem to record dedicated statistics about rows processed since the last server startup. Instead, statistics are kept from the very beginning or until a super user runs a statistics reset.

To see the number of rows read/written since the last statistics reset for your database, connect to it and run the following statement:

psql mydatabase;
Type "help" for help.

mydatabase=# select cast(extract(epoch from now()) as INTEGER) as epoch_seconds, sum(seq_tup_read + idx_tup_fetch) as rows_read, sum(n_tup_ins+ n_tup_upd + n_tup_del) as row_written from pg_stat_user_tables  where relname not like 'psc_data_%';
 epoch_seconds | rows_read | row_written
---------------+-----------+-------------
    1642075374 |         2 |           2

Now, you can find out the time, your statistics have been lastly reset:

mydatabase=# select stats_reset from pg_stat_database where datname=current_database();
          stats_reset
-------------------------------
 2022-01-12 17:15:37.568914+01
(1 row)

If your workload since the last statistic reset was relatively constant, one could use the timestamp from the last time, statistics have been reset to calculate the daily averages:

mydatabase=# WITH timediff as (
     SELECT cast(extract(epoch from (now() - stats_reset)) as INTEGER) as since_last_stats_reset from pg_stat_database where datname=current_database()
)
SELECT timediff.since_last_stats_reset as seconds_since_last_stats_reset, sum(seq_tup_read + idx_tup_fetch) as rows_read, sum(n_tup_ins+ n_tup_upd + n_tup_del) as row_written, sum(seq_tup_read + idx_tup_fetch)/timediff.since_last_stats_reset * 60 * 60 * 24 as average_rows_read_daily, sum(n_tup_ins+ n_tup_upd + n_tup_del)/timediff.since_last_stats_reset * 60 * 60 * 24 as average_rows_written_daily from pg_stat_user_tables, timediff  where relname not like 'psc_data_%' group by timediff.since_last_stats_reset;

seconds_since_last_stats_reset | rows_read | row_written | average_rows_read_daily  | average_rows_written_daily
--------------------------------+-----------+-------------+--------------------------+----------------------------
                            579 |         3 |           1 | 447.66839378238341990400 |   149.22279792746113968000

If your last reset was a very long time ago or your historical workload differed too much from what you expect to migrate to MySQL, you have two choices:

Option I: Calculate the statistics manually by running the same query multiple times a week

We have already shown you the query to see all rows affected sinced the last statistics reset:

mydatabase=# select cast(extract(epoch from now()) as INTEGER) as epoch_seconds, sum(seq_tup_read + idx_tup_fetch) as rows_read, sum(n_tup_ins+ n_tup_upd + n_tup_del) as row_written from pg_stat_user_tables  where relname not like 'psc_data_%';
 epoch_seconds | rows_read | row_written
---------------+-----------+-------------
    1642083509 |        12 |           9
(1 row)

The number of rows_read and rows_written should increase over time, same as the epoch_seconds value, representing the Unix time.

If you run the query later again (after you have experienced your typical DB workflow pattern), you can use the differences between the column values to calculate the average rows written and read:

Imagine, your avarage workload pattern occurred again and you run a similar query later again:

mydatabase=# select cast(extract(epoch from now()) as INTEGER) as epoch_seconds_2, sum(seq_tup_read + idx_tup_fetch) as rows_read_2, sum(n_tup_ins+ n_tup_upd + n_tup_del) as row_written_2 from pg_stat_user_tables  where relname not like 'psc_data_%';
 epoch_seconds_2 | rows_read_2 | row_written_2
-----------------+-------------+---------------
      1642083892 |          24 |             0
(1 row)

Then you can calculate average_rows_read_daily=(rows_read_2-rows_read)/(epoch_seconds_2-epoch_seconds)* 60 * 60 * 24

and

average_rows_written_daily=(rows_written_2-rows_written)/(epoch_seconds_2-epoch_seconds)* 60 * 60 * 24

The longer the time between queries, the more you will throw out outliers from your statistics.

Our recommendation for the time between runs would be 24 hours - run the script at the start of your day and then again the morning after.

Option II: Reset the statistics

Check with your DBA whether it is ok to reset the Postgres statistics. If so, let a super user run the following SQL query in your database:

mydatabase=# SELECT pg_stat_reset();

pg_stat_reset
---------------

(1 row)

CAUTION: Only reset the statistics if you know that no other team is relying on it.

Running the row stats query again will show very small numbers:

mydatabase=# WITH timediff as (
     SELECT cast(extract(epoch from (now() - stats_reset)) as INTEGER) as since_last_stats_reset from pg_stat_database where datname=current_database()
)
SELECT timediff.since_last_stats_reset as seconds_since_last_stats_reset, sum(seq_tup_read + idx_tup_fetch) as rows_read, sum(n_tup_ins+ n_tup_upd + n_tup_del) as row_written, sum(seq_tup_read + idx_tup_fetch)/timediff.since_last_stats_reset * 60 * 60 * 24 as average_rows_read_daily, sum(n_tup_ins+ n_tup_upd + n_tup_del)/timediff.since_last_stats_reset * 60 * 60 * 24 as average_rows_written_daily from pg_stat_user_tables, timediff  where relname not like 'psc_data_%' group by timediff.since_last_stats_reset;
 seconds_since_last_stats_reset | rows_read | row_written | average_rows_read_daily | average_rows_written_daily
--------------------------------+-----------+-------------+-------------------------+----------------------------
                              1 |         0 |           0 |  0.00000000000000000000 |     0.00000000000000000000

If you wait long enough that your typical workflow pattern re-occured in your database (after a business day or business week), run the same query again to get some more accurate numbers. Our recommendation is to wait at least 24 hours after statistics reset to get a reliable average.

jonico/counting-affected-rows-in-potsgresql.md