slang.util

Slang utils

class slang.util.ArithmeDict[source]

A dict, with arithmetic. A unary operator is just applied to all values. When a dict operates with a number, the operation is applied to each value of the dict. When a dict operates with another dict, the keys are aligned and the operation applied to the aligned values.

The class is meant to be used in situations where pandas.Series would be used to operate with (sparse) vectors such as word counts, etc.

Performance:

In a nutshell, if you use pandas already in your app, then use pandas.Series instead. But, if you want weight packages (pandas isn’t light), or have small dicts you want to operate on, use ArithmeDict.

Note that both construction and operation are faster on ArithmeDict, for smaller sets.

``` import pandas as pd

t = ArithmeDict(a=1, b=2) tt = ArithmeDict(b=3, c=4) %timeit t + tt # 1.41 µs ± 41.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

### versus ###

t = pd.Series(dict(a=1, b=2)) tt = pd.Series(dict(b=3, c=4)) %timeit t + tt # and not even what we want (see later) # 405 µs ± 7.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) % timeit pd.Series.add(t, tt, fill_value=0).to_dict() # 410 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

### but ### t = ArithmeDict({i: i for i in range(10000)}) tt = ArithmeDict({i: i for i in range(5000, 15000)}) %timeit t + tt # 3.22 ms ± 98.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

### not so far from ### t = pd.Series({i: i for i in range(10000)}) tt = pd.Series({i: i for i in range(5000, 15000)}) %timeit pd.Series.add(t, tt, fill_value=0).to_dict() 3.71 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # and actually much slower than: %timeit pd.Series.add(t, tt, fill_value=0) 575 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

```

On the other hand, memory usage is inconclusive, because I don’t know how to actually make the comparison. ``` import pickle, sys, pandas

t = ArithmeDict({i: i for i in range(10000)}) sys.getsizeof(t), len(pickle.dumps(t)) # (295032, 59539)

t = pandas.Series({i: i for i in range(10000)}) sys.getsizeof(t), len(pickle.dumps(t)) # (160032, 240666) ```

Notes for enhancement:

When dict operates with/on a dict, and therefore we need to align keys, there are different merge and reduce options that may or may not make sense according to the value type and context. For example, should we really keep all keys and use operand defaults to get their values, or just drop those fields all together? Also, if we choose to keep all keys, what should the operand default be. Sometimes it might depend on the other operand (example matmul), or need to be created (example __concat__, since don’t want the mutable list as a default), etc.

>>> d1 = ArithmeDict(a=1, b=2)
>>> d2 = ArithmeDict(b=3, c=4)
>>>
>>> # These are still dicts
>>> isinstance(d1, dict)
True
>>> # and display as such
>>> d1
{'a': 1, 'b': 2}
>>> d2
{'b': 3, 'c': 4}
>>>
>>> # Unary operators (just applied to all values)
>>> assert -d1 == {'a': -1, 'b': -2}
>>> assert abs(-d1) == d1  # ... and in case that doesn't look impressive enough..
>>> assert abs(ArithmeDict(a=-1, b=2, c=-3)) == {'a': 1, 'b': 2, 'c': 3}
>>>
>>> # An operation with a number is transferred to the values of the dict (applied to each).
>>> assert d1 + 10 == {'a': 11, 'b': 12}
>>> assert d1 - 10 == {'a': -9, 'b': -8}
>>> assert d1 * 10 == {'a': 10, 'b': 20}
>>> assert d1 / 10 == {'a': 0.1, 'b': 0.2}
>>> assert d1 // 2 == {'a': 0, 'b': 1}
>>> assert d1 ** 2 == {'a': 1, 'b': 4}
>>> assert d2 % 2 == {'b': 1, 'c': 0}
>>> assert d2 % 3 == {'b': 0, 'c': 1}
>>> assert d2 >> 1 == {'b': 1, 'c': 2}  # shift all bits by one bit to the right
>>> assert d2 << 1 == {'b': 6, 'c': 8}  # shift all bits by one bit to the left
>>>
>>> # An operation with another dict will align the keys and apply the operation to the aligned values.
>>> assert d1 + d2 == {'a': 1, 'b': 5, 'c': 4}
>>> assert d1 - d2 == {'a': 1, 'b': -1, 'c': -4}
>>> assert d1 * d2 == {'a': 1, 'b': 6, 'c': 4}
>>> assert d1 / d2 == {'a': 1, 'b': 0.6666666666666666, 'c': 0.25}
>>> assert d2 // d1 == {'b': 1, 'c': 4, 'a': 1}
>>> assert d1 ** d2 == {'a': 1, 'b': 8, 'c': 1}
>>> assert ArithmeDict(a=10, b=10) % dict(a=3, b=4) == {'a': 1, 'b': 2}
>>> assert d1 << d2 == {'a': 1, 'b': 16, 'c': 0}  # shifting bits
>>> assert d1 + {'b': 3, 'c': 4} == {'a': 1, 'b': 5, 'c': 4}  # works when the right side is a normal dict
>>> assert d1 + ArithmeDict() == d1
>>> assert ArithmeDict() - d1 == -d1
op_func(b, /)

Same as a @ b.

slang.util.balanced_sample_maker(key_to_tag, max_n_keys_per_tag=7, random=False)[source]

making a sample of the data (when you want to just test quickly)

>>> mk_sample = balanced_sample_maker(key_to_tag=lambda k: k.split('/')[0],
...                                   max_n_keys_per_tag=2,
...                                   random=False)
>>> mk_sample(['good/1', 'bad/1', 'good/2', 'good/3', 'good/4', 'bad/2', 'good/5', 'bad/3'])
['good/1', 'good/2', 'bad/1', 'bad/2']
slang.util.choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

New in version 1.7.0.

Note

New code should use the choice method of a default_rng() instance instead; please see the random-quick-start.

Parameters
  • a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

  • replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of a can be selected multiple times.

  • p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Returns

samples – The generated random samples

Return type

single item or ndarray

Raises

ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

See also

randint, shuffle, permutation

Generator.choice

which should be used in new code

Notes

Setting user-specified probabilities through p uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element of p is 1 / len(a).

Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword.

Examples

Generate a uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3)
array([0, 3, 4]) # random
>>> #This is equivalent to np.random.randint(0,5,3)

Generate a non-uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0]) # random

Generate a uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False)
array([3,1,0]) # random
>>> #This is equivalent to np.random.permutation(np.arange(5))[:3]

Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0]) # random

Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:

>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
      dtype='<U11')
class slang.util.lazyprop(func)[source]

A descriptor implementation of lazyprop (cached property) from David Beazley’s “Python Cookbook” book. It’s >>> class Test: … def __init__(self, a): … self.a = a … @lazyprop … def len(self): … print(‘generating “len”’) … return len(self.a) >>> t = Test([0, 1, 2, 3, 4]) >>> t.__dict__ {‘a’: [0, 1, 2, 3, 4]} >>> t.len generating “len” 5 >>> t.__dict__ {‘a’: [0, 1, 2, 3, 4], ‘len’: 5} >>> t.len 5 >>> # But careful when using lazyprop that no one will change the value of a without deleting the property first >>> t.a = [0, 1, 2] # if we change a… >>> t.len # … we still get the old cached value of len 5 >>> del t.len # if we delete the len prop >>> t.len # … then len being recomputed again generating “len” 3

slang.util.mk_callable(call_func)[source]

A class decorator that adds a __call__ method. Specialized for sklearn models.

>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>>
>>> CallablePCA = mk_callable('single_transform')(PCA)
>>> pca = CallablePCA(n_components=3).fit(np.random.rand(100, 5))
>>> x = np.random.rand(5)
>>> all(pca(x) == pca.transform([x])[0])
True
>>>
>>> from sklearn.neighbors import NearestNeighbors
>>>
>>> def nearest_neighbors_indices(self, x):
...     _, indices = self.kneighbors([x])
...     return indices[0]
...
>>>
>>> @mk_callable(nearest_neighbors_indices)
... class CallableKnn(NearestNeighbors):
...     '''NearestNeighbors with callable instances that give you the indices of the neighbors
...     without the kerfuffle.'''
>>>
>>> knn = CallableKnn().fit(np.arange(1000).reshape(200, 5))
>>> x = np.array([10, 20, 30, 40, 50])  # say we have a single point we want to get neighbors for
>>>
>>> # This is the standard way to do it
>>> _, indices = knn.kneighbors([x])
>>> neighbors = indices[0]
>>> neighbors
array([6, 5, 7, 4, 8])
>>> # but now, you can just do this instead:
>>> knn(x)
array([6, 5, 7, 4, 8])
>>>
>>> assert all(knn(x) == neighbors)
slang.util.row_euclidean_distance(A, B)[source]

Euclidean distance between aligned rows of A. An array of length len(A) (==len(B)).

>>> import numpy as np
>>> A = np.arange(5 * 16).reshape((5, 16))
>>> B = 1 + A
>>> assert all(row_euclidean_distance(A, A) == np.zeros(5))
>>> assert all(row_euclidean_distance(A, B) == np.array([4., 4., 4., 4., 4.]))

Note: Not to be confused with the matrix of distances of all pairs of rows. Here, equivalent to the latter diagnonal (see below).

` from  sklearn.metrics.pairwise import euclidean_distances A = np.random.rand(5, 7) B = np.random.rand(5, 7) assert all(np.diag(euclidean_distances(A, B)) == row_euclidean_distance(A, B)) `

slang.util.running_mean_gen(it, chk_size=2, chk_step=1)[source]

Running mean (moving average) on iterator. Note: When input it is list-like, ut.stats.smooth.sliders version of running_mean is 4 times more efficient with big (but not too big, because happens in RAM) inputs. :param it: iterable :param chk_size: width of the window to take means from :return:

>>> list(running_mean_gen([1, 3, 5, 7, 9], 2))
[2.0, 4.0, 6.0, 8.0]
>>> list(running_mean_gen([1, 3, 5, 7, 9], 2, chk_step=2))
[2.0, 6.0]
>>> list(running_mean_gen([1, 3, 5, 7, 9], 2, chk_step=3))
[2.0, 8.0]
>>> list(running_mean_gen([1, 3, 5, 7, 9], 3))
[3.0, 5.0, 7.0]
>>> list(running_mean_gen([1, -1, 1, -1], 2))
[0.0, 0.0, 0.0]
>>> list(running_mean_gen([-1, -2, -3, -4], 3))
[-2.0, -3.0]