You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/configuration.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
Configuration Reference
2
2
=======================
3
3
4
-
SqlSynthGen is configured using a YAML file, which is passed to several commands with the ``--config`` option.
4
+
SqlSynthGen is configured using a YAML file, which is passed to several commands with the ``--config-file`` option.
5
5
Throughout the docs, we will refer to this file as ``config.yaml`` but it can be called anything (the exception being that there will be a naming conflict if you have a vocabulary table called ``config``).
6
6
7
7
Below, we see the schema for the configuration file.
Copy file name to clipboardExpand all lines: docs/source/faq.rst
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,14 +5,14 @@ Can SqlSynthGen work with two different schemas?
5
5
************************************************
6
6
7
7
SqlSynthGen can only work with a single source schema and a single destination schema at a time.
8
-
However, you can choose for the destination schema to have a different name to the source schema by setting the DST_SCHEMA environment variable.
8
+
However, you can choose for the destination schema to have a different name to the source schema by setting the ``DST_SCHEMA`` environment variable.
9
9
10
10
Which DBMSs does SqlSynthGen support?
11
11
*************************************
12
12
13
13
* SqlSynthGen most fully supports **PostgresSQL**, which it uses for its end-to-end functional tests.
14
14
* SqlSynthGen also supports **MariaDB**, as long as you don't set ``use-asyncio: true`` in your config.
15
-
* SqlSynthGen *might*, work with **SQLite** but this is largely untested.
15
+
* SqlSynthGen *might* work with **SQLite** but this is largely untested.
16
16
* SqlSynthGen may also work with SQL Server.
17
17
To connect to SQL Server, you will need to install `pyodbc <https://pypi.org/project/pyodbc/>`_ and an `ODBC driver <https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16>`_, after which you should be able to use a DSN setting similar to ``SRC_DSN="mssql+pyodbc://username:password@hostname/dbname?driver=ODBC Driver 18 for SQL Server"``.
Copy file name to clipboardExpand all lines: docs/source/health_data.rst
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,10 +13,10 @@ The full configuration we wrote for the CCHIC data set is available `here <https
13
13
14
14
Before getting into the config itself, we need to discuss a few peculiarities of the OMOP CDM that need to be taken into account:
15
15
16
-
1. Some versions of OMOP contain a circular foreign key, for instance between the `vocabulary`, `concept`, and `domain` tables.
17
-
2. There are several standardized vocabulary tables (`concept`, `concept_relationship`, etc).
16
+
1. Some versions of OMOP contain a circular foreign key, for instance between the ``vocabulary``, ``concept``, and ``domain`` tables.
17
+
2. There are several standardized vocabulary tables (``concept``, ``concept_relationship``, etc).
18
18
These should be marked as such in the sqlsynthgen config file.
19
-
The tables will be exported to ``.yaml`` files during the ``make-tables`` step.
19
+
The tables will be exported to ``.yaml`` files during the ``make-generators`` step.
20
20
However, some of these vocabulary tables may be too large to practically be writable to ``.yaml`` files, and will need to be dealt with manually.
21
21
You should also check the license agreement of each standardized vocabulary before sharing any of the ``.yaml`` files.
22
22
@@ -195,7 +195,7 @@ Here is our config for the person table:
195
195
columns_assigned: care_site_id
196
196
197
197
``num_rows_per_pass`` is set to 0, because all rows are generated by the story generator.
198
-
Let's use the gender columns as an emxample.
198
+
Let's use the gender columns as an example.
199
199
Here is the relevant function from ``row_generators.py``.
200
200
201
201
.. code-block:: python
@@ -355,8 +355,8 @@ You can find examples of this in the `full configuration <https://github.com/ala
355
355
After creating a person, ``patient_story`` creates possibly an entry in the ``death`` table, and then one for ``visit_occurrence``.
356
356
The configurations and generators for these aren't very interesting, their main point is to make the chronology and time scales make sense, so that people born a long time ago are more likely to have died, and the order of birth, visit start, visit end, and possible death is correct.
357
357
358
-
After that the story generates a set of rows for tables like `observation`, `measurement`, `condition_occurrence`, etc., the ones that involve procedures and events that took place during the hospital stay.
359
-
The procedure is very similar for each one of these, we'll discuss `measurement` as an example.
358
+
After that the story generates a set of rows for tables like ``observation``, ``measurement``, ``condition_occurrence``, etc., the ones that involve procedures and events that took place during the hospital stay.
359
+
The procedure is very similar for each one of these, we'll discuss ``measurement`` as an example.
360
360
361
361
The first stop is the ``avg_measurements_per_hour`` src-stats query, which looks like this
362
362
@@ -394,11 +394,11 @@ The first stop is the ``avg_measurements_per_hour`` src-stats query, which looks
394
394
upper: 100
395
395
396
396
Note how the ``query`` part, which is executed on the database server, tries to do as much of the work as possible:
397
-
It extracts the number of `measurement` entries, divided by the length of the hospital stay, for each person.
397
+
It extracts the number of ``measurement`` entries, divided by the length of the hospital stay, for each person.
398
398
The ``dp-query`` then only computes the average.
399
399
This is both to circumvent the limitations of SNSQL, which can't for instance do subqueries or differences between columns, and also to minimise the data transferred to and work done on the local machine running SSG.
400
400
401
-
Based on that information, we generate a set of times, roughly at the right frequency, at which a `measurement` entry should generated for our synthetic patient.
401
+
Based on that information, we generate a set of times, roughly at the right frequency, at which a ``measurement`` entry should generated for our synthetic patient.
402
402
The relevant `src-stats queries <https://github.com/alan-turing-institute/sqlsynthgen/blob/main/examples/cchic_omop/>`_ for this are
403
403
404
404
* ``count_measurements``, which counts the relative frequencies of various types of measurements, like blood pressure, pulse taking, different lab results, etc.
Copy file name to clipboardExpand all lines: docs/source/introduction.rst
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -106,7 +106,7 @@ Now when we run ``create-data`` we get valid, if not very sensible, values in ea
106
106
- 485
107
107
- 534
108
108
109
-
SSG’s default generators have minimal fidelity: All data is generated based purely on the datatype of the its column, e.g. random strings in string columns.
109
+
SSG’s default generators have minimal fidelity: All data is generated based purely on the datatype of the column, e.g. random strings in string columns.
110
110
Foreign key relations are respected by picking random rows from the table referenced.
111
111
Even this synthetic data, nearly the crudest imaginable, can be useful for instance for testing software pipelines.
112
112
Note that this data has no privacy implications, since it is only based on the schema.
@@ -121,7 +121,7 @@ This should of course only be done for tables that hold no privacy-sensitive dat
121
121
For instance, in the AirBnB dataset, the ``users`` table has a foreign key reference to a table of world countries: ``users.country_destination`` references the ``countries.country_destination`` primary key column.
122
122
Since the ``countries`` table doesn’t contain personal data, we can make it a vocabulary table.
123
123
124
-
Besides manual edition, on SSG we can also customise the generation of ``ssg.py`` via a YAML file,
124
+
Besides manually editing it, we can also customise the generation of ``ssg.py`` via a YAML file,
125
125
typically named ``config.yaml``.
126
126
We identify ``countries`` as a vocabulary table in our ``config.yaml`` file:
127
127
@@ -164,7 +164,7 @@ We need to truncate any tables in our destination database before importing the
Since ``make-generators`` rewrote ``ssg.py``, we must now re-edit it to add the primary key ``VARCHAR`` workaroundsfor the ``users`` and ``age_gender_bkts`` tables, as we did in section above.
167
+
Since ``make-generators`` rewrote ``ssg.py``, we must now re-edit it to add the primary key ``VARCHAR`` workarounds for the ``users`` and ``age_gender_bkts`` tables, as we did in section above.
168
168
Once this is done, we can generate random data for the other three tables with::
169
169
170
170
$ sqlsynthgen create-data
@@ -293,7 +293,7 @@ Then, we tell SSG to import our custom ``airbnb_generators.py`` and assign the r
Note how we pass the ``generic`` object as a keyword argument to ``user_dates_provider``.
296
-
Row generators can have positional arguments specified as a list under the ``args`` list and keyword arguments as a dictionary under the ``kwargs`` entry.
296
+
Row generators can have positional arguments specified as a list under the ``args`` entry and keyword arguments as a dictionary under the ``kwargs`` entry.
297
297
298
298
Limitations to this approach to increasing fidelity are that rows can not be correlated with other rows in the same table, nor with any rows in other tables, except for trivially fulfilling foreign key constraints as in the default configuration.
299
299
We will see how to address this later when we talk about :ref:`story generators <story-generators>`.
@@ -537,7 +537,7 @@ For instance, it may first yield a row specifying a person in the ``users`` tabl
537
537
Three features make story generators more practical than simply manually writing code that creates the synthetic data bit-by-bit:
538
538
539
539
1. When a story generator yields a row, it can choose to only specify values for some of the columns. The values for the other columns will be filled by custom row generators (as explained in a previous section) or, if none are specified, by SSG's default generators. Above, we have chosen to specify the value for ``first_device_type`` but the date columns will still be handled by our ``user_dates_provider`` and the age column will still be populated by the ``user_age_provider``.
540
-
2. Any default values that are set when the rows yielded by the story generator are written into the database are available to the story generator when it resumes. In our example, the user's ``id`` is available so that we can respect the foreign key relationship between ``users`` and ``sessions``, even though we did not explicitly set the user's ``id`` when creating the user.
540
+
2. Any default values that are set when the rows yielded by the story generator are written into the database are available to the story generator when it resumes. In our example, the user's ``id`` is available so that we can respect the foreign key relationship between ``users`` and ``sessions``, even though we did not explicitly set the user's ``id`` when creating the user on line 8.
541
541
542
542
To use and get the most from story generators, we will need to make some changes to our configuration:
0 commit comments