Last active
May 30, 2022 05:08
-
-
Save artoby/0bcf790cfebed5805fbbb6a9853fe5d5 to your computer and use it in GitHub Desktop.
Symlog bins - splits a data range into log-like bins but with 0 and negative values taken into account
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
def symlog_bins(arr, n_bins, zero_eps=0.1, padding=0): | |
""" | |
Splits a data range into log-like bins but with 0 and negative values taken into account. | |
Can be used together with matplotlib 'symlog' axis sacale (i.e. ax.set_xscale('symlog')) | |
Feel free to contribute: https://gist.github.com/artoby/0bcf790cfebed5805fbbb6a9853fe5d5 | |
""" | |
a = min(arr) / (1 + padding) | |
b = max(arr) * (1 + padding) | |
if a > b: | |
a, b = b, a | |
neg_range_log = None | |
if a < -zero_eps: | |
neg_range_log = [np.log10(-a), np.log10(zero_eps)] | |
# Add a value to zero bin edges in case a lies within [-zero_eps; zero_eps) - so an additional bin will be added before positive range | |
zero_bin_edges = [] | |
if -zero_eps <= a < zero_eps: | |
zero_bin_edges = [a] | |
pos_range_log = None | |
if b > zero_eps: | |
pos_range_log = [np.log10(max(a, zero_eps)), np.log10(b)] | |
nonzero_n_bin_edges = n_bins + 1 - len(zero_bin_edges) | |
neg_range_log_size = (neg_range_log[0] - neg_range_log[1]) if neg_range_log is not None else 0 | |
pos_range_log_size = (pos_range_log[1] - pos_range_log[0]) if pos_range_log is not None else 0 | |
range_log_size = neg_range_log_size + pos_range_log_size | |
pos_n_bin_edges_raw = int(round(nonzero_n_bin_edges * (pos_range_log_size/range_log_size))) if range_log_size > 0 else 0 | |
# Ensure each range has at least 2 edges if it's not empty | |
neg_n_bin_edges = max(2, nonzero_n_bin_edges - pos_n_bin_edges_raw) if neg_range_log_size > 0 else 0 | |
pos_n_bin_edges = max(2, nonzero_n_bin_edges - neg_n_bin_edges) if pos_range_log_size > 0 else 0 | |
neg_bin_edges = [] | |
if neg_n_bin_edges > 0: | |
neg_bin_edges = list(-np.logspace(neg_range_log[0], neg_range_log[1], neg_n_bin_edges)) | |
pos_bin_edges = [] | |
if pos_n_bin_edges > 0: | |
pos_bin_edges = list(np.logspace(pos_range_log[0], pos_range_log[1], pos_n_bin_edges)) | |
result = neg_bin_edges + zero_bin_edges + pos_bin_edges | |
return result |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, thanks for the code!. I wrote a similar function to yours that I think fixes possible issues. In particular, if pos_n_bin_edges_raw in your code is zero, it will fail, since neg_n_bin_edges would be nonzero_n_bin_edges and pos_n_bin__edges would be 2, so the total number of cuts will be bigger than n_cuts. Also, I extend the minimum and maximum values by a 10% to fix possible issues with floats (I use this together with pd.cut and sometime edge values are not included in their bin because float precision).
Another difference in my code is that I don't start from zero_eps but from the closest value to zero, so positive and negative bins do not start from the same value. I paste here my function in case it can help anyone.