You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/health_data.rst
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ Before getting into the config itself, we need to discuss a few peculiarities of
20
20
However, some of these vocabulary tables may be too large to practically be writable to ``.yaml`` files, and will need to be dealt with manually.
21
21
You should also check the license agreement of each standardized vocabulary before sharing any of the ``.yaml`` files.
22
22
23
-
Dealing with circular foreign keys
23
+
Dealing with Circular Foreign Keys
24
24
++++++++++++++++++++++++++++++++++
25
25
26
26
SSG is currently unable to handle schemas with circular foreign keys properly.
@@ -51,7 +51,7 @@ One can then proceed with ``sqlsynthgen create-vocab`` to copy over the vocabula
51
51
If the problematic foreign key constraints would be between non-vocabulary tables, one would need to keep them disabled for the whole duration of creating synthetic data, while putting in a manual mechanism that guarantees that the synthetic data created does respect the constraint, and then reenable the constraint at the end.
52
52
Fortunately with OMOP this is not necessary.
53
53
54
-
Vocabulary tables
54
+
Vocabulary Tables
55
55
+++++++++++++++++++++
56
56
57
57
The OMOP schema has many vocabulary tables. Here's an excerpt from the CCHIC OMOP config file we've written:
Copy file name to clipboardExpand all lines: docs/source/introduction.rst
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ After migration, the database has the following structure:
17
17
:width:400
18
18
:alt:The AirBnb database diagram.
19
19
20
-
Default behavior
20
+
Default Behavior
21
21
----------------
22
22
23
23
SSG contains tools for replicating the schema of a source database.
@@ -99,7 +99,7 @@ Foreign key relations are respected by picking random rows from the table refere
99
99
Even this synthetic data, nearly the crudest imaginable, can be useful for instance for testing software pipelines.
100
100
Note that this data has no privacy implications, since it is only based on the schema.
101
101
102
-
Vocabulary tables
102
+
Vocabulary Tables
103
103
-----------------
104
104
105
105
The simplest configuration option available to increase fidelity is to mark some of the tables in the schema to be “vocabulary” tables.
@@ -166,7 +166,7 @@ To recap, “vocabularies” are tables that don’t need synthesising.
166
166
By itself this adds only limited utility, since the interesting parts of the data are typically in the non-vocabulary tables, but it saves great amounts of work by fixing some tables with no privacy concerns to have perfect fidelity from the get-go.
167
167
Note that one has to be careful in making sure that the tables marked as vocabulary tables truly do not hold privacy sensitive data, otherwise catastrophic privacy leaks are possible, where the original data is exposed raw and in full.
168
168
169
-
Specifying row-based custom generators
169
+
Specifying Row-based Custom Generators
170
170
--------------------------------------
171
171
172
172
As we’ve seen above, ``ssg.py`` is overwritten whenever you re-run ``make-generators``.
@@ -300,7 +300,7 @@ Still there are no privacy implications, but data can be generated that e.g. pas
300
300
301
301
.. _source_statistics:
302
302
303
-
Using aggregate statistics from the source data
303
+
Using Aggregate Statistics from the Source Data
304
304
-----------------------------------------------
305
305
306
306
Beyond copying vocabulary tables, SSG allows for the original data to affect the synthetic data generation process only through a particular mechanism we call source statistics.
@@ -439,7 +439,7 @@ One final aspect of source statistics bears mentioning:
439
439
At the top level of ``config.yaml`` one can also set ``use-asyncio: true``.
440
440
With this, if there are multiple source stats queries to be run, they will be run in parallel, which may speed up ``make-stats`` significantly if some of the queries are slow.
441
441
442
-
"Stories" within the data
442
+
"Stories" Within the Data
443
443
-------------------------
444
444
445
445
The final configuration option available to users of SSG is what we call "story generators".
0 commit comments