Molecular dynamics application example

Pre-requirements

Import dependencies

[1]:
# Primary imports
import importlib     # Only needed for module editing
import json
import pandas as pd  # Optional dependency
from pathlib import Path
import pprint
import sys
import time
import warnings
warnings.simplefilter("ignore")  # Surpress or enable warnings

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import datasets  # For sklearn test data set creation
from sklearn.preprocessing import StandardScaler

# CNN clustering module
import cnnclustering.cnn as cnn    # CNN clustering
import cnnclustering.cmsm as cmsm  # Core-set MSM estimation
import pydpc

This notebook was created using Python 3.8.

[2]:
# Version information
print(sys.version)
3.8.5 | packaged by conda-forge | (default, Aug 21 2020, 18:21:27)
[GCC 7.5.0]

Notebook configuration

We use matplotlib to create plots. A "matplotlibrc" file can be used to customise the appearance of the plots.

[3]:
# Matplotlib configuration
mpl.rc_file(
    "../matplotlibrc",
    use_default_template=False
)
[4]:
# Axis property defaults for the plots
ax_props = {
    "xlabel": None,
    "ylabel": None,
    "xlim": (-2.5, 2.5),
    "ylim": (-2.5, 2.5),
    "xticks": (),
    "yticks": (),
    "aspect": "equal"
}

# Line plot property defaults
line_props = {
    "linewidth": 0,
    "marker": '.',
}

Package configuration

[5]:
# Configuration file found?
cnn.settings.cfgfile  # If None, no file is provided
[6]:
# Display default settings
cnn.settings.defaults
[6]:
{'default_cnn_cutoff': '1',
 'default_cnn_offset': '0',
 'default_radius_cutoff': '1',
 'default_member_cutoff': '2',
 'default_fit_policy': 'conservative',
 'default_predict_policy': 'conservative',
 'float_precision': 'sp',
 'int_precision': 'sp'}

Helper functions

[7]:
def draw_evaluate(clusterobject, axis_labels=False, plot="dots"):
    fig, Ax = plt.subplots(
        1, 3,
        figsize=(mpl.rcParams['figure.figsize'][0],
                 mpl.rcParams['figure.figsize'][1]*0.5)
        )

    for dim in range(3):
        dim_ = (dim * 2, dim * 2 + 1)

        ax_props_ = {k: v for k, v in ax_props.items()}
        if axis_labels:
            ax_props_.update({"xlabel": dim_[0] + 1, "ylabel": dim_[1] + 1})

        _ = clusterobject.evaluate(
            ax=Ax[dim], plot=plot,
            ax_props=ax_props_,
            dim=dim_
            )

MD showcase - Langerin

Let’s read in some “real world” data for this example. We will work with a 6D projection of a classical MD trajectory of the C-type lectin receptor langerin that was generated by the dimension reduction procedure TICA.

[8]:
langerin = cnn.CNN(points=np.load("md_example/md_showcase_langerin.npy", allow_pickle=True))

After creating a CNN instance, we can print out basic information about the data. The projection comes in 116 parts of individual independent simulations. The number of data points in these parts is written out for the first 4 parts. In total we have about 2.6 million data points in this set representing 26 microseconds of simulation time at a sampling timestep of 10 picoseconds.

[9]:
print(langerin)
================================================================================
CNN cluster object
--------------------------------------------------------------------------------
Alias :                         root
Hierachy level :                0

Data point shape :              Parts      - 116, [5571  4148  20851  99928 ...]
                                Points     - 2641593
                                Dimensions - 6

Distance matrix calculated :    None
Neighbourhoods calculated :     None
Density graph calculated :      None

Clustered :                     False
Children :                      False
================================================================================

Dealing with six data dimensions we can still visualise the data quite well.

[11]:
draw_evaluate(langerin, axis_labels=True, plot="contourf")
../_images/tutorial_md_parameter_selection_21_0.png

Clustering of this fairly large amount of data points directly is in principle possible, but it will be slow. Pre-calculating pairwise distances would occupy terabytes of disk space which we can not afford and so we have resort to brute-force on-the-fly distance calculation. To allow quick and handy data exploration and cluster result screening we want to work on a reduced data set.

[12]:
langerin_reduced = langerin.cut(points=(None, None, 100))

Now we distance pre-calculation is feasible and clustering will be much faster. When reducing a data set it is most important that the new set remains representative of the original one. Usually using a regular stride on the data points will be appropriate.

[13]:
draw_evaluate(langerin_reduced, axis_labels=True, plot="contourf")
../_images/tutorial_md_parameter_selection_25_0.png
[14]:
print(langerin_reduced)
================================================================================
CNN cluster object
--------------------------------------------------------------------------------
Alias :                         root
Hierachy level :                0

Data point shape :              Parts      - 116, [56  42  209  1000 ...]
                                Points     - 26528
                                Dimensions - 6

Distance matrix calculated :    None
Neighbourhoods calculated :     None
Density graph calculated :      None

Clustered :                     False
Children :                      False
================================================================================

A quick look on the distribution of distances in the set gives us a first feeling for what might be a suitable value for the neighbour search radius r.

[15]:
langerin_reduced.calc_dist(mmap=True, chunksize=5000)  # Pre-calculate point distances and temporary store on disk
langerin_reduced.dist_hist()
Mapping: 100%|██████████| 6.00/6.00 [00:09<00:00, 1.52s/Chunks]
[15]:
(<Figure size 750x450 with 1 Axes>,
 <AxesSubplot:xlabel='d / au'>,
 [<matplotlib.lines.Line2D at 0x7f147d2eddf0>],
 None)
../_images/tutorial_md_parameter_selection_28_2.png

Clustering root data

We can expect a split of the data into clusters for values of r of roughly 2 or lower. Let’s attempt a first clustering step with a relatively low density criterion (large r cutoff, low number of common neighbours c):

[16]:
langerin_reduced.fit(2, 5, policy="progressive")
Execution time for call of fit: 0 hours, 0 minutes, 23.0190 seconds
--------------------------------------------------------------------------------
#points   R         C         min       max       #clusters %largest  %noise
26528     2.000     5         2         None      3         0.983     0.000
--------------------------------------------------------------------------------
[17]:
draw_evaluate(langerin_reduced)
../_images/tutorial_md_parameter_selection_32_0.png

Clustering hierarchy level 1

We see that no data point was excluded as sparse outlier and we split the data into three clusters. Especially the first cluster can obviously be splitted further. Let’ recluster it applying a higher density criterion. For this we first need to freeze this cluster result and isolate the obtained clusters into individual child cluster objects.

[18]:
langerin_reduced.isolate()
current = c1 = langerin_reduced.children[1]

By default, the isolation assigns an alias to each child cluster that reveals its origin. In this case “root - 1”, translates to the first child cluster of the root data.

[20]:
current
[20]:
CNN clustering object (root - 1)
[21]:
# Recluster first cluster from last step
current.data.points.cKDTree()              # We pre-compute neighbourhoods here
r = 1
current.calc_neighbours_from_cKDTree(r=r)
current.fit(r, 5)
Execution time for call of fit: 0 hours, 0 minutes, 1.4944 seconds
--------------------------------------------------------------------------------
#points   R         C         min       max       #clusters %largest  %noise
26079     1.000     5         2         None      3         0.727     0.000
--------------------------------------------------------------------------------
[22]:
draw_evaluate(current)
../_images/tutorial_md_parameter_selection_39_0.png

Clustering hierarchy level 2

The re-clustered data points are split into another three clusters. This time, we see an oportunity to again re-cluster the first two obtained clusters, slightly increasing the density criterion further.

[23]:
current
[23]:
CNN clustering object (root - 1)
[25]:
current.isolate()
current = c1_1 = current.children[1]
[26]:
current
[26]:
CNN clustering object (root - 1 - 1)

At this stage we choose a member_cutoff of 10 for the fit, ensuring that we do not yiels small, meaningless clusters.

[27]:
# Recluster first cluster from last step
current.data.points.cKDTree()  # We pre-compute neighbourhoods here
r = 0.45
current.calc_neighbours_from_cKDTree(r=r)
current.fit(r, 15, member_cutoff=10)
Execution time for call of fit: 0 hours, 0 minutes, 0.4998 seconds
--------------------------------------------------------------------------------
#points   R         C         min       max       #clusters %largest  %noise
18947     0.450     15        10        None      3         0.946     0.027
--------------------------------------------------------------------------------
[28]:
draw_evaluate(current)
../_images/tutorial_md_parameter_selection_47_0.png
[29]:
current = c1_2 = c1.children[2]
[30]:
current
[30]:
CNN clustering object (root - 1 - 2)
[31]:
# Recluster second cluster from last step
current.data.points.cKDTree()  # We pre-compute neighbourhoods here
r = 0.5
current.calc_neighbours_from_cKDTree(r=r)
current.fit(r, 10)
Execution time for call of fit: 0 hours, 0 minutes, 0.0964 seconds
--------------------------------------------------------------------------------
#points   R         C         min       max       #clusters %largest  %noise
7077      0.500     10        2         None      3         0.613     0.010
--------------------------------------------------------------------------------
[32]:
draw_evaluate(current)
../_images/tutorial_md_parameter_selection_51_0.png

Clustering hierarchy level 3

And on it goes …

[33]:
current = c1_1
[34]:
current.isolate()
current = c1_1_1 = current.children[1]
[35]:
current
[35]:
CNN clustering object (root - 1 - 1 - 1)
[36]:
# Recluster first cluster from last step
current.data.points.cKDTree()  # We pre-compute neighbourhoods here
r = 0.4
current.calc_neighbours_from_cKDTree(r=r)
current.fit(r, 15)
Execution time for call of fit: 0 hours, 0 minutes, 0.3754 seconds
--------------------------------------------------------------------------------
#points   R         C         min       max       #clusters %largest  %noise
17927     0.400     15        2         None      2         0.867     0.015
--------------------------------------------------------------------------------
[37]:
draw_evaluate(current)
../_images/tutorial_md_parameter_selection_58_0.png

Clustering hierarchy level 4

[38]:
current.isolate()
current = c1_1_1_1 = current.children[1]
[40]:
current
[40]:
CNN clustering object (root - 1 - 1 - 1 - 1)
[39]:
# Recluster first cluster from last step
current.data.points.cKDTree()  # We pre-compute neighbourhoods here
r = 0.28
current.calc_neighbours_from_cKDTree(r=r)
current.fit(r, 15, member_cutoff=10)
Execution time for call of fit: 0 hours, 0 minutes, 0.1906 seconds
--------------------------------------------------------------------------------
#points   R         C         min       max       #clusters %largest  %noise
15548     0.280     15        10        None      2         0.676     0.142
--------------------------------------------------------------------------------
[41]:
draw_evaluate(current)
../_images/tutorial_md_parameter_selection_63_0.png

Clustering hierarchy level 5

[42]:
current.isolate()
current = c1_1_1_1_1 = current.children[1]
[43]:
current
[43]:
CNN clustering object (root - 1 - 1 - 1 - 1 - 1)
[44]:
# Recluster first cluster from last step
current.data.points.cKDTree()  # We pre-compute neighbourhoods here
r = 0.22
current.calc_neighbours_from_cKDTree(r=r)
current.fit(r, 15, member_cutoff=10)
Execution time for call of fit: 0 hours, 0 minutes, 0.1533 seconds
--------------------------------------------------------------------------------
#points   R         C         min       max       #clusters %largest  %noise
10506     0.220     15        10        None      2         0.650     0.309
--------------------------------------------------------------------------------
[45]:
draw_evaluate(current)
../_images/tutorial_md_parameter_selection_68_0.png

Merge hierarchy levels

We want to leave it with that for the moment. We can visualise our cluster hierarchy to get an overview.

[47]:
_ = langerin_reduced.pie()
../_images/tutorial_md_parameter_selection_71_0.png

And we can finally put everything together and incorporate the child clusters into the root data set.

[48]:
langerin_reduced.reel(deep=None)

After this call, cluster labeling may not be contiguous and sorted by size, which we can fix easily.

[55]:
langerin_reduced.labels.sort_by_size()
[56]:
draw_evaluate(langerin_reduced)
../_images/tutorial_md_parameter_selection_76_0.png
[59]:
print("Label    Size")
print("=============")
print(*sorted({k: len(v) for k, v in langerin_reduced.labels.clusterdict.items()}.items()), sep="\n")
Label    Size
=============
(0, 6326)
(1, 6831)
(2, 4335)
(3, 2834)
(4, 2619)
(5, 2104)
(6, 451)
(7, 425)
(8, 231)
(9, 217)
(10, 55)
(11, 52)
(12, 48)

For later re-use, we save the clustering result in form of the cluster label assignments.

[60]:
np.save("md_example/cluster_labels.npy", np.asarray(langerin_reduced.labels))

MSM estimation

Assuming our data was sampled in a time-correlated manner as it is the case for MD simulation data, we can use this clustering result as a basis for the estimation of a core-set Markov-state model.

[61]:
M = cmsm.CMSM(langerin_reduced.get_dtraj(), unit="ns", step=1)
[62]:
# Estimate csMSM for different lag times (given in steps)
lags = [1, 2, 4, 8, 15, 30]
for i in lags:
    M.cmsm(lag=i, minlenfactor=5)
    M.get_its()

*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 1 ns
---------------------------------------------------------

Using 116 trajectories with 26380 steps over 12 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 2 ns
---------------------------------------------------------

Using 116 trajectories with 26380 steps over 12 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 4 ns
---------------------------------------------------------

Using 116 trajectories with 26380 steps over 12 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 8 ns
---------------------------------------------------------

Using 116 trajectories with 26380 steps over 12 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 15 ns
---------------------------------------------------------

Trajectories [0, 1]
are shorter then step threshold (lag*minlenfactor = 75)
and will not be used to compute the MSM.

Using 114 trajectories with 26284 steps over 12 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 30 ns
---------------------------------------------------------

Trajectories [0, 1, 4, 63]
are shorter then step threshold (lag*minlenfactor = 150)
and will not be used to compute the MSM.

Using 112 trajectories with 25999 steps over 12 coresets

All sets are connected
---------------------------------------------------------
*********************************************************

[63]:
# Plot the time scales
fig, ax, *_ = M.plot_its()
fig.tight_layout(pad=0.1)
../_images/tutorial_md_parameter_selection_84_0.png

Cluster alternatives

[10]:
pydpc_clustering = pydpc.Cluster(langerin_reduced.data.points)
../_images/tutorial_md_parameter_selection_86_0.png
[11]:
pydpc_clustering.autoplot = False
[16]:
pydpc_clustering.assign(0, 1.8)
[19]:
pydpc_clustering.clusters
[19]:
array([  690,  5274,  5470, 11430, 14034, 16202, 17564], dtype=int32)
[20]:
langerin_reduced.labels = (pydpc_clustering.membership + 1)
[21]:
draw_evaluate(langerin_reduced)
../_images/tutorial_md_parameter_selection_91_0.png
[22]:
langerin_reduced.labels[pydpc_clustering.halo_idx] = 0
[23]:
draw_evaluate(langerin_reduced)
../_images/tutorial_md_parameter_selection_93_0.png
[24]:
M = cmsm.CMSM(langerin_reduced.get_dtraj(), unit="ns", step=1)
[25]:
# Estimate csMSM for different lag times (given in steps)
lags = [1, 2, 4, 8, 15, 30]
for i in lags:
    M.cmsm(lag=i, minlenfactor=5)
    M.get_its()

*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 1 ns
---------------------------------------------------------

Using 116 trajectories with 25900 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 2 ns
---------------------------------------------------------

Using 116 trajectories with 25900 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 4 ns
---------------------------------------------------------

Using 116 trajectories with 25900 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 8 ns
---------------------------------------------------------

Using 116 trajectories with 25900 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 15 ns
---------------------------------------------------------

Trajectories [0, 1, 73]
are shorter then step threshold (lag * minlenfactor = 75)
and will not be used to compute the MSM.

Using 113 trajectories with 25732 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 30 ns
---------------------------------------------------------

Trajectories [0, 1, 4, 63, 73]
are shorter then step threshold (lag * minlenfactor = 150)
and will not be used to compute the MSM.

Using 111 trajectories with 25447 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************

[26]:
# Plot the time scales
fig, ax, *_ = M.plot_its()
fig.tight_layout(pad=0.1)
../_images/tutorial_md_parameter_selection_96_0.png
[36]:
figsize = mpl.rcParams["figure.figsize"]
mpl.rcParams["figure.figsize"] = figsize[0], figsize[1] * 0.2
M.plot_eigenvectors()
mpl.rcParams["figure.figsize"] = figsize
../_images/tutorial_md_parameter_selection_97_0.png