Skip to content

Instantly share code, notes, and snippets.

@artoby
Last active May 30, 2022 05:08
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save artoby/0bcf790cfebed5805fbbb6a9853fe5d5 to your computer and use it in GitHub Desktop.
Save artoby/0bcf790cfebed5805fbbb6a9853fe5d5 to your computer and use it in GitHub Desktop.
Symlog bins - splits a data range into log-like bins but with 0 and negative values taken into account
import numpy as np
def symlog_bins(arr, n_bins, zero_eps=0.1, padding=0):
"""
Splits a data range into log-like bins but with 0 and negative values taken into account.
Can be used together with matplotlib 'symlog' axis sacale (i.e. ax.set_xscale('symlog'))
Feel free to contribute: https://gist.github.com/artoby/0bcf790cfebed5805fbbb6a9853fe5d5
"""
a = min(arr) / (1 + padding)
b = max(arr) * (1 + padding)
if a > b:
a, b = b, a
neg_range_log = None
if a < -zero_eps:
neg_range_log = [np.log10(-a), np.log10(zero_eps)]
# Add a value to zero bin edges in case a lies within [-zero_eps; zero_eps) - so an additional bin will be added before positive range
zero_bin_edges = []
if -zero_eps <= a < zero_eps:
zero_bin_edges = [a]
pos_range_log = None
if b > zero_eps:
pos_range_log = [np.log10(max(a, zero_eps)), np.log10(b)]
nonzero_n_bin_edges = n_bins + 1 - len(zero_bin_edges)
neg_range_log_size = (neg_range_log[0] - neg_range_log[1]) if neg_range_log is not None else 0
pos_range_log_size = (pos_range_log[1] - pos_range_log[0]) if pos_range_log is not None else 0
range_log_size = neg_range_log_size + pos_range_log_size
pos_n_bin_edges_raw = int(round(nonzero_n_bin_edges * (pos_range_log_size/range_log_size))) if range_log_size > 0 else 0
# Ensure each range has at least 2 edges if it's not empty
neg_n_bin_edges = max(2, nonzero_n_bin_edges - pos_n_bin_edges_raw) if neg_range_log_size > 0 else 0
pos_n_bin_edges = max(2, nonzero_n_bin_edges - neg_n_bin_edges) if pos_range_log_size > 0 else 0
neg_bin_edges = []
if neg_n_bin_edges > 0:
neg_bin_edges = list(-np.logspace(neg_range_log[0], neg_range_log[1], neg_n_bin_edges))
pos_bin_edges = []
if pos_n_bin_edges > 0:
pos_bin_edges = list(np.logspace(pos_range_log[0], pos_range_log[1], pos_n_bin_edges))
result = neg_bin_edges + zero_bin_edges + pos_bin_edges
return result
@SergioG-M
Copy link

Hi, thanks for the code!. I wrote a similar function to yours that I think fixes possible issues. In particular, if pos_n_bin_edges_raw in your code is zero, it will fail, since neg_n_bin_edges would be nonzero_n_bin_edges and pos_n_bin__edges would be 2, so the total number of cuts will be bigger than n_cuts. Also, I extend the minimum and maximum values by a 10% to fix possible issues with floats (I use this together with pd.cut and sometime edge values are not included in their bin because float precision).
Another difference in my code is that I don't start from zero_eps but from the closest value to zero, so positive and negative bins do not start from the same value. I paste here my function in case it can help anyone.

def symlogspace(col, n_cuts, dtype='float64'):
    """
    Splits a data range into log-like bins but with 0 and negative values
    taken into account. Log cuts start from the closest value to zero.
    
    Parameters
    ----------
    col: df column or array
    n_cuts: int
            Number of cuts to perform
    dtype: dtype of the outputs
    """
    min_val = col.min()
    max_val = col.max()

    # compute negative and positive range
    min_pos = col[col > 0].min() if not np.isnan(col[col > 0].min()) else 0
    max_neg = col[col < 0].max() if not np.isnan(col[col < 0].max()) else 0
    neg_range = [-min_val, -max_neg] if min_val < max_neg else None
    pos_range = [max(min_val, min_pos), max_val] if max_val > min_pos else None

    # If min value is 0 create a bin for it
    zero_cut = [min_val] if max_neg <= min_val < min_pos else []

    n_cuts = n_cuts - len(zero_cut)

    neg_range_size = (neg_range[0] - neg_range[1]) if neg_range is not None else 0
    pos_range_size = (pos_range[1] - pos_range[0]) if pos_range is not None else 0
    range_size = neg_range_size + pos_range_size

    n_pos_cuts = max(2, int(round(n_cuts * (pos_range_size / range_size)))) if range_size > 0 and pos_range_size > 0 else 0
    # Ensure each range has at least 2 edges if it's not empty
    n_neg_cuts = max(2, n_cuts - n_pos_cuts) if neg_range_size > 0 else 0   
    # In case n_pos_cuts + n_neg_cuts > n_cuts this is needed
    n_pos_cuts = max(2, n_cuts - n_neg_cuts) if pos_range_size > 0 else 0

    neg_cuts = []
    if n_neg_cuts > 0:
        neg_cuts = list(-np.geomspace(neg_range[0], neg_range[1], n_neg_cuts, dtype=dtype))

    pos_cuts = []
    if n_pos_cuts > 0:
        pos_cuts = list(np.geomspace(pos_range[0], pos_range[1], n_pos_cuts, dtype=dtype))

    result = neg_cuts + zero_cut + pos_cuts
    # Add 0.1% to the edges to include min and max
    result[0] = min(result[0] * 1.1, result[0] * 0.9)
    result[-1] = result[-1] * 1.1
    return result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment