Skip to content

Commit 1c273b9

Browse files
committed
Add PK99 tutorial
1 parent 2dfdacf commit 1c273b9

File tree

4 files changed

+77
-0
lines changed

4 files changed

+77
-0
lines changed

docs/source/tutorials/airbnb.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
An Introduction to SqlSynthGen
22
==============================
33

4+
.. _introduction:
5+
46
`SqlSynthGen <https://github.com/alan-turing-institute/sqlsynthgen/>`_, or SSG for short, is a software package that we have written for synthetic data generation, focussed on relational data.
57
When pointed to an existing relational database, SSG creates another database with the same database schema, and populates it with synthetic data.
68
By default the synthetic data is crudely low fidelity, but the user is given various ways to configure the behavior of SSG to increase fidelity, while maintaining transparency and control over how the original data is used to inform the synthetic data, to control privacy risks.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
Tutorial: Loan Data
2+
===================
3+
4+
There are many potential applications of synthetic data in banking and finance where the nature of the data, being both personally and commercially sensitive, may rule out sharing real, identifiable data.
5+
6+
Here, we show how to use SqlSynthGen to generate a simple (uniformly random) synthetic version of the freely-available `PKDD'99 <https://relational.fit.cvut.cz/dataset/Financial>`_ dataset.
7+
This dataset contains 606 successful and 76 not successful loan applications.
8+
9+
The PKDD'99 dataset is stored on a MariaDB database, which means that we need a local MariaDB database to store the synthetic data.
10+
MariaDB installation instructions can be found `here <https://mariadb.org/download/?t=mariadb&p=mariadb&r=11.2.0#entry-header>`_.
11+
We presume that you have a local server running on port 3306, with a user called ``myuser``, a password ``mypassword`` and a database called ``financial``.
12+
13+
.. code-block:: console
14+
15+
$ mysql
16+
MariaDB > create user 'myuser'@'localhost' identified by 'mypassword';
17+
MariaDB > create database financial;
18+
MariaDB > grant all privileges on financial.* to 'myuser'@'localhost';
19+
MariaDB > \q
20+
21+
After :ref:`installing SqlSynthGen <enduser>`, we create a `.env` file to set some environment variables to define the source database as the one linked at the bottom of the PKDD'99 page, and the destination database as the local one:
22+
23+
**.env**
24+
25+
.. code-block:: console
26+
27+
SRC_DSN="mariadb+pymysql://guest:[email protected]:3306/Financial_ijs"
28+
DST_DSN="mariadb+pymysql://myuser:mypassword@localhost:3306/financial"
29+
30+
We run SqlSynthGen's ``make-tables`` command to create a file called ``orm.py`` that contains the schema of the source database.
31+
32+
.. code-block:: console
33+
34+
$ sqlsynthgen make-tables
35+
36+
Inspecting the ``orm.py`` file, we see that the ``tkeys`` table has column called ``goodClient``, which is a ``TINYINT``.
37+
SqlSynthGen doesn't know what to do with ``TINYINT`` columns, so we need to create a config file to tell it how to handle them. This isn't necessary for normal ``Integer`` columns.
38+
39+
**config.yaml**
40+
41+
.. literalinclude:: ../../../tests/examples/loans/config.yaml
42+
:language: yaml
43+
44+
We run SqlSynthGen's ``make-generators`` command to create ``ssg.py``, which contains a generator class for each table in the source database:
45+
46+
.. code-block:: console
47+
48+
$ sqlsynthgen make-generators --config config.yaml
49+
50+
We then run SqlSynthGen's ``create-tables`` command to create the tables in the destination database:
51+
52+
.. code-block:: console
53+
54+
$ sqlsynthgen create-tables
55+
56+
Note that, alternatively, you could use another tool, such as ``mysqldump`` to create the tables in the destination database.
57+
58+
Finally, we run SqlSynthGen's ``create-data`` command to populate the tables with synthetic data:
59+
60+
.. code-block:: console
61+
62+
$ sqlsynthgen create-data --num-passes 100
63+
64+
This will make 100 rows in each of the nine tables.
65+
The data will be entirely random so you may wish to fine tune it using the source-statistics, custom generators or "story generators" explained in the longer :ref:`introduction <introduction>`.

tests/examples/loans/config.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
tables:
3+
tkeys:
4+
row_generators:
5+
- name: generic.numeric.integer_number
6+
columns_assigned: goodClient
7+
args:
8+
- 0
9+
- 127

tests/test_rst.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ def test_dir(self) -> None:
2626
'No directive entry for "automodule"',
2727
'No directive entry for "literalinclude"',
2828
'Hyperlink target "enduser" is not referenced',
29+
'Hyperlink target "introduction" is not referenced',
2930
]
3031
filtered_errors = []
3132
for file_errors in all_errors:

0 commit comments

Comments
 (0)