Examples

Simple illustrative examples on how you can quickly start using the bcselector package.

Data Generation

First of all lets generate some artificial data, that we are going to use in feature selection. Bcselector provides two classes that let us generate data with costs:

  • MatrixGenerator - generates data in np.ndarray and costs as list.

  • DataFrameGenerator - generates data in pd.DataFrame and costs as dict.

Every method, uses the same algorithm, which is based on one main assumption, that mutual information of a feature and target variable is directly proportional to its cost. Higher the cost, lower the noise.

  1. Simulate \(p\) independent random variables \(X_1,\ldots,X_p\), where \(X_i\sim N(0,1)\). We obtain \(p\) variables \(X_i = \{x_1^{(i)},\ldots,x_n^{(i)}\}\), where \(n\) is a sample size and \(c_i\) is a cost for i-th variable. We assume that all costs are the same, i.e. \(c_i = c_1 = c_2 = \ldots = c_p = 1\).

  2. For each observation \((i)\), calculate the following term: \(\sigma_i = \frac{e^{\sum_{j=1}^p x_{i}^{(j)}}}{1+e^{\sum_{j=1}^p x_{i}^{(j)}}}.\)

  3. We generate target variable \(Y = \{y_1, \ldots, y_n\}\), where \(y_i\) is generated from Bernoulli distribution with success probability \(\sigma_i\).

  4. We generate \(p\) noise random variables \(e_1,\ldots,e_p\), where \(e_i\sim N(0,\sigma)\).

  5. We create new \(p\) perturbed variables, each is generated as: \(X_i' := X_i + e_i\). Each variable \(X_i'\) is assigned with cost equal to \(c_i' = \frac{1}{\sigma_i +1}\).

  6. Steps \(4-5\) are repeated for all values from list of standard deviations: \(noise\_sigmas = [\sigma_1, \ldots, \sigma_k]\)

  7. At the end we obtain \(k*p\) features.

MatrixGenerator

from bcselector.data_generation import MatrixGenerator

# Fix the seed for reproducibility.
SEED = 42

# Data generation arguments:
# - data size,
# - cost of non-noised feature
# - sigma of noise for noised features.
n_rows = 1000
n_cols = 10
noise_sigmas = [0.9,0.8,0.3,0.1]

mg = MatrixGenerator()
X, y, costs = mg.generate(
    n_rows=n_rows,
    n_basic_cols=n_cols,
    noise_sigmas=noise_sigmas,
    seed=SEED,
    discretize_method='uniform',
    discretize_bins=10)

DataFrameGenerator

from bcselector.data_generation import DataFrameGenerator

# Fix the seed for reproducibility.
SEED = 42

# Data generation arguments:
# - data size,
# - cost of non-noised feature,
# - sigma of noise for noised features.
n_rows = 1000
n_cols = 10
noise_sigmas = [0.9,0.8,0.3,0.1]

dfg = DataFrameGenerator()
X, y, costs = dfg.generate(
    n_rows=n_rows,
    n_basic_cols=n_cols,
    noise_sigmas=noise_sigmas,
    seed=SEED,
    discretize_method='uniform',
    discretize_bins=10)

Feature Selection

For this moment, just two methods of cost-sensitive feature selection methods are implemented:

  • FractionVariableSelector - costs are compared to relation with target variable as difference.

  • DiffVariableSelector - costs are compared to relation with target variable as fraction.

FractionVariableSelector

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

from bcselector.variable_selection import FractionVariableSelector
from bcselector.data_generation import MatrixGenerator

# Fix the seed for reproducibility.
SEED = 42

# Data generation arguments:
# - data size,
# - cost of non-noised feature,
# - sigma of noise for noised features.
n_rows = 1000
n_cols = 10
noise_sigmas = [0.9,0.8,0.3,0.1]

# Generate data
mg = MatrixGenerator()
X, y, costs = mg.generate(
    n_rows=n_rows,
    n_basic_cols=n_cols,
    noise_sigmas=noise_sigmas,
    seed=SEED,
    discretize_method='uniform',
    discretize_bins=10)

# Arguments for feature selection
# - cost scaling parameter,
# - kwarg for j_criterion_func,
# - model that is fitted on data.
r = 1
beta = 0.5
model = LogisticRegression()

# Feature selection
fvs = FractionVariableSelector()
fvs.fit(
     data=X,
     target_variable=y,
     costs=costs,
     r=r,
     j_criterion_func='cife',
     beta=beta)
fvs.score(
     model=model,
     scoring_function=roc_auc_score)
fvs.plot_scores(
     compare_no_cost_method=True,
     model=model,
     annotate=True)

DiffVariableSelector

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

from bcselector.variable_selection import DiffVariableSelector
from bcselector.data_generation import MatrixGenerator


# Fix the seed for reproducibility.
SEED = 42

# Data generation arguments:
# - data size,
# - cost of non-noised feature,
# - sigma of noise for noised features.
n_rows = 1000
n_cols = 10
noise_sigmas = [0.9,0.8,0.3,0.1]

# Generate data
mg = MatrixGenerator()
X, y, costs = mg.generate(
    n_rows=n_rows,
    n_basic_cols=n_cols,
    noise_sigmas=noise_sigmas,
    seed=SEED,
    discretize_method='uniform',
    discretize_bins=10)

# Arguments for feature selection
# - cost scaling parameter,
# - model that is fitted on data.
lamb = 1
beta = 0.5
model = LogisticRegression()

# Feature selection
dvs = DiffVariableSelector()
dvs.fit(
     data=X,
     target_variable=y,
     costs=costs,
     lamb=lamb,
     j_criterion_func='jmi')
dvs.score(
     model=model,
     scoring_function=roc_auc_score)
dvs.plot_scores(
     compare_no_cost_method=True,
     model=model,
     annotate=True)