|
| 1 | +Tutorial: Loan Data |
| 2 | +=================== |
| 3 | + |
| 4 | +There are many potential applications of synthetic data in banking and finance where the nature of the data, being both personally and commercially sensitive, may rule out sharing real, identifiable data. |
| 5 | + |
| 6 | +Here, we show how to use SqlSynthGen to generate a simple (uniformly random) synthetic version of the freely-available `PKDD'99 <https://relational.fit.cvut.cz/dataset/Financial>`_ dataset. |
| 7 | +This dataset contains 606 successful and 76 not successful loan applications. |
| 8 | + |
| 9 | +The PKDD'99 dataset is stored on a MariaDB database, which means that we need a local MariaDB database to store the synthetic data. |
| 10 | +MariaDB installation instructions can be found `here <https://mariadb.org/download/?t=mariadb&p=mariadb&r=11.2.0#entry-header>`_. |
| 11 | +We presume that you have a local server running on port 3306, with a user called ``myuser``, a password ``mypassword`` and a database called ``financial``. |
| 12 | + |
| 13 | +.. code-block:: console |
| 14 | +
|
| 15 | + $ mysql |
| 16 | + MariaDB > create user 'myuser'@'localhost' identified by 'mypassword'; |
| 17 | + MariaDB > create database financial; |
| 18 | + MariaDB > grant all privileges on financial.* to 'myuser'@'localhost'; |
| 19 | + MariaDB > \q |
| 20 | +
|
| 21 | +After :ref:`installing SqlSynthGen <enduser>`, we create a `.env` file to set some environment variables to define the source database as the one linked at the bottom of the PKDD'99 page, and the destination database as the local one: |
| 22 | + |
| 23 | +**.env** |
| 24 | + |
| 25 | +.. code-block:: console |
| 26 | +
|
| 27 | + SRC_DSN="mariadb+pymysql://guest:[email protected]:3306/Financial_ijs" |
| 28 | + DST_DSN="mariadb+pymysql://myuser:mypassword@localhost:3306/financial" |
| 29 | +
|
| 30 | +We run SqlSynthGen's ``make-tables`` command to create a file called ``orm.py`` that contains the schema of the source database. |
| 31 | + |
| 32 | +.. code-block:: console |
| 33 | +
|
| 34 | + $ sqlsynthgen make-tables |
| 35 | +
|
| 36 | +Inspecting the ``orm.py`` file, we see that the ``tkeys`` table has column called ``goodClient``, which is a ``TINYINT``. |
| 37 | +SqlSynthGen doesn't know what to do with ``TINYINT`` columns, so we need to create a config file to tell it how to handle them. This isn't necessary for normal ``Integer`` columns. |
| 38 | + |
| 39 | +**config.yaml** |
| 40 | + |
| 41 | +.. literalinclude:: ../../../tests/examples/loans/config.yaml |
| 42 | + :language: yaml |
| 43 | + |
| 44 | +We run SqlSynthGen's ``make-generators`` command to create ``ssg.py``, which contains a generator class for each table in the source database: |
| 45 | + |
| 46 | +.. code-block:: console |
| 47 | +
|
| 48 | + $ sqlsynthgen make-generators --config config.yaml |
| 49 | +
|
| 50 | +We then run SqlSynthGen's ``create-tables`` command to create the tables in the destination database: |
| 51 | + |
| 52 | +.. code-block:: console |
| 53 | +
|
| 54 | + $ sqlsynthgen create-tables |
| 55 | +
|
| 56 | +Note that, alternatively, you could use another tool, such as ``mysqldump`` to create the tables in the destination database. |
| 57 | + |
| 58 | +Finally, we run SqlSynthGen's ``create-data`` command to populate the tables with synthetic data: |
| 59 | + |
| 60 | +.. code-block:: console |
| 61 | +
|
| 62 | + $ sqlsynthgen create-data --num-passes 100 |
| 63 | +
|
| 64 | +This will make 100 rows in each of the nine tables. |
| 65 | +The data will be entirely random so you may wish to fine tune it using the source-statistics, custom generators or "story generators" explained in the longer :ref:`introduction <introduction>`. |
0 commit comments