A basic tutorial on symbolic regression

In this tutorial we make use of a classic Evolutionary technique to evolve a model for our input data.

[1]:
# Some necessary imports.
import dcgpy
import pygmo as pg
# Sympy is nice to have for basic symbolic manipulation.
from sympy import init_printing
from sympy.parsing.sympy_parser import *
init_printing()
# Fundamental for plotting.
from matplotlib import pyplot as plt
%matplotlib inline

1 - The data

[2]:
# We load our data from some available ones shipped with dcgpy.
# In this particular case we use the problem chwirut2 from
# (https://www.itl.nist.gov/div898/strd/nls/data/chwirut2.shtml)
X, Y = dcgpy.generate_chwirut2()

[3]:
# And we plot them as to visualize the problem.
_ = plt.plot(X, Y, '.')
_ = plt.title('54 measurments')
_ = plt.xlabel('metal distance')
_ = plt.ylabel('ultrasonic response')



../_images/notebooks_symbolic_regression_1_4_0.png

2 - The symbolic regression problem

[4]:
# We define our kernel set, that is the mathematical operators we will
# want our final model to possibly contain. What to choose in here is left
# to the competence and knowledge of the user. A list of kernels shipped with dcgpy
# can be found on the online docs. The user can also define its own kernels (see the corresponding tutorial).
ss = dcgpy.kernel_set_double(["sum", "diff", "mul", "pdiv"])
[5]:
# We instantiate the symbolic regression optimization problem (note: many important options are here not
# specified and thus set to their default values)
udp = dcgpy.symbolic_regression(points = X, labels = Y, kernels=ss())
print(udp)
        Data dimension (points): 1
        Data dimension (labels): 1
        Data size: 54
        Kernels: [sum, diff, mul, pdiv]
        Loss: MSE

3 - The search algorithm

[6]:
# We instantiate here the evolutionary strategy we want to use to search for models.
uda  = dcgpy.es4cgp(gen = 10000, max_mut = 4)

5 - Inspecting the solution

[9]:
# Lets have a look to the symbolic representation of our model (using sympy)
parse_expr(udp.prettier(pop.champion_x))
[9]:
$\displaystyle \left[ \frac{128}{3 x_{0}}\right]$
[10]:
# And lets see what our model actually predicts on the inputs
Y_pred = udp.predict(X, pop.champion_x)
[11]:
# Lets comapre to the data
_ = plt.plot(X, Y_pred, 'r.')
_ = plt.plot(X, Y, '.', alpha=0.2)
_ = plt.title('54 measurments')
_ = plt.xlabel('metal distance')
_ = plt.ylabel('ultrasonic response')

../_images/notebooks_symbolic_regression_1_16_0.png

6 - Recovering the log

[12]:
# Here we get the log of the latest call to the evolve
log = algo.extract(dcgpy.es4cgp).get_log()
gen = [it[0] for it in log]
loss = [it[2] for it in log]
[13]:
# And here we plot, for example, the generations against the best loss
_ = plt.semilogy(gen, loss)
_ = plt.title('last call to evolve')
_ = plt.xlabel('generation')
_ = plt.ylabel('loss')
../_images/notebooks_symbolic_regression_1_19_0.png