- \n",
"
- Accurately estimating the infinite timescale diffusion coefficient from a single simulation \n", "
- Quantifying the variance in the diffusion coefficient that would be observed on repetition of the simulation \n", "

\n",
" \n",
"Thought experiment\n",
" \n",
"Consider, a particle travelling on a one-dimensional random walk with a step size of 1 Å.\n",
"If, after 10 steps, the particle has been displaced by 5 Å then after 11 steps the particle could only be displaced by either 4 Å or 6 Å and after 12 steps the particle could only be displaced by 3, 4, 5, 6, 7 Å. \n",
"

\n",
"\n",
"This fact results in a substantial [correlation](https://en.wikipedia.org/wiki/Correlation) between the distributions of mean-squared displacement at different timesteps. \n",
"To quantify this correlation, we have derived an approach to estimate the full covariance matrix (a description of the correlation between the timesteps). \n",
"The result of this derivation is that the covariance between two timesteps, $\\mathrm{cov}_i\\Big(\\big\\langle \\mathbf{r}^2(\\Delta t_n) \\big\\rangle, \\big\\langle \\mathbf{r}^2(\\Delta t_{n+m}) \\big\\rangle\\Big)$, is the product of the variance at the first timestep, $\\Delta t_n$ and the ratio of maximum independent trajectories at each timestep,\n",
"\n",
"$$\n",
"\\mathrm{cov}\\Big(\\big\\langle \\mathbf{r}^2(\\Delta t_n) \\big\\rangle, \\big\\langle \\mathbf{r}^2(\\Delta t_{n+m}) \\big\\rangle\\Big) = \\sigma^2\\big(\\langle \\mathbf{r}^2(\\Delta t_n) \\rangle\\big) \\frac{N_i(\\Delta t_{n})}{N_i(\\Delta t_{n+m})},\n",
"$$\n",
"\n",
"This approach is extremely computationally efficient, as there is no additional sampling required to determine this estimate of the full covariance matrix. \n",
"However, the noise sampled variances may lead to poorly defined covariance matrices, therefore the variances are smoothed to follow the function defined in Equation 14 of the derivation. \n",
"This leads to the covariance matrix shown for our LLZO simulation in the figure below. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3cc83c3f-c6c7-442c-b190-4daa0d01c995",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"data = np.load('_static/cov.npz')\n",
"cov = data['cov']\n",
"\n",
"plt.subplots(figsize=(6, 4.9))\n",
"plt.contourf(*np.meshgrid(dt, dt), cov, levels=20)\n",
"plt.xlabel('$\\Delta t_n$/ps')\n",
"plt.ylabel('$\\Delta t_{n+m}$/ps')\n",
"plt.axis('equal')\n",
"plt.colorbar(label=r'$\\mathrm{cov}' + \n",
" r'(\\langle \\mathbf{s}^2(\\Delta t_n) \\rangle, ' + \n",
" r'\\langle \\mathbf{s}^2(\\Delta t_{n+m}) \\rangle)$')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "af1bf34d-d199-4dc3-959b-67dc58905c2b",
"metadata": {},
"source": [
"## Modelling a multivariate normal distribution\n",
"\n",
"The determination of the variance in the mean-squared displacement and estimation of the full covariance matrix allows the mean-squared displacement to be described as a covariant [multivariate normal distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Definitions), and therefore we can define it with a `scipy.stats.multivariate_normal` object. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9758f623-06e9-4750-ba82-02d5fdab059e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"gp = multivariate_normal(mean=msd, cov=cov, allow_singular=True)"
]
},
{
"cell_type": "markdown",
"id": "e55e558b-1922-4b73-b212-edec14909c21",
"metadata": {},
"source": [
"This object, in theory, allows us to simulate potential trajectories that could be observed in our simulation were repeated. \n",
"In the plot below, we compare such a simulation from the multivariate normal distribution produced from the full covariance matrix with that produced when there only the diagonal terms are defined (i.e. only the variances for each mean-squared displacement). "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "232e8811-2820-4b0e-b403-4857b1c66584",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"gp_wls = multivariate_normal(\n",
" mean=msd, cov=np.diag(cov.diagonal()), allow_singular=True)\n",
"\n",
"plt.plot(dt, gp.rvs(1).T, label='GLS', c='#0173B2')\n",
"plt.plot(dt, gp_wls.rvs(1).T, label='WLS', c='#029E73')\n",
"plt.legend()\n",
"plt.ylabel('MSD/Å$^2$')\n",
"plt.xlabel('$\\Delta t$/ps')\n",
"plt.xlim(0, None)\n",
"plt.ylim(0, None)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "38682818-57c4-4955-96c2-fda52769c6f4",
"metadata": {},
"source": [
"The erratic changes in the mean-squared displacement that is observed in the plot with only the variances defined are unphysical when we consider the correlation thought experiment above. \n",
"\n",
"## Likelihood sampling a multivariate normal distribution\n",
"\n",
"As mentioned above, this process aims to determine the diffusion coefficient and ordinate offset, and their model variance, by fitting the Einstein relation. \n",
"In `kinisi`, we use Markov chain Monte Carlo (MCMC) posterior sampling to perform this, using the [emcee package](https://emcee.readthedocs.io).\n",
"To perform this, we define a `log_posterior` function, that imposes a Bayesian prior probability that the diffusion coefficient must be positive. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "de35f746-2b61-4dd9-966a-65676978238d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from scipy.linalg import pinvh\n",
"\n",
"_, logdet = np.linalg.slogdet(cov)\n",
"logdet += np.log(2 * np.pi) * msd.size\n",
"inv = pinvh(cov)\n",
"\n",
"def log_posterior(theta: np.ndarray) -> float:\n",
" \"\"\"\n",
" Get the log likelihood for multivariate normal distribution.\n",
" :param theta: Value of the gradient and intercept of the straight line.\n",
" :return: Log-likelihood value.\n",
" \"\"\"\n",
" if theta[0] < 0:\n",
" return -np.inf\n",
" model = dt * theta[0] + theta[1]\n",
" diff = (model - msd)\n",
" logl = -0.5 * (logdet + np.matmul(diff.T, np.matmul(inv, diff)))\n",
" return logl"
]
},
{
"cell_type": "markdown",
"id": "32618b8b-693b-40a8-a7bf-d766f3f1a014",
"metadata": {},
"source": [
"Then we can use a minimisation routine to determine maximum *a posteriori* values for the gradient and intercept. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "23c97983-1068-4d72-9e1d-cd7d36084e06",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def nll(*args) -> float:\n",
" \"\"\"\n",
" General purpose negative log-likelihood.\n",
"\n",
" :return: Negative log-likelihood\n",
" \"\"\"\n",
" return -log_posterior(*args)\n",
"\n",
"max_post = minimize(nll, [1, 0]).x"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2fc56d5-cfb9-4ff6-a3c5-7b37bf1a7770",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"print(f'MAP: m = {max_post[0]:.3f}, c = {max_post[1]:.3f}')"
]
},
{
"cell_type": "markdown",
"id": "9ac080b7-ffc6-4488-9289-cee13fe32c16",
"metadata": {},
"source": [
"After determining the maximum *a posteriori*, we can use `emcee` for sampling with 32 walkers for 1000 samples (with a 500 sample burn-in, which we discard in producing the `flatchain`). "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a904bd03-662e-4db3-bd79-e6cf34f96221",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"pos = max_post + max_post * 1e-3 * np.random.randn(32, max_post.size)\n",
"\n",
"sampler = EnsembleSampler(*pos.shape, log_posterior)\n",
"sampler.run_mcmc(pos, 1000 + 500, progress=False)\n",
"flatchain = sampler.get_chain(flat=True, discard=500)"
]
},
{
"cell_type": "markdown",
"id": "8b7c9424-5f6f-4e05-afe2-b232dd476126",
"metadata": {},
"source": [
"The diffusion coefficient (in units of cm