Skip to content

Commit e7e6e2c

Browse files
committed
Substantial update to the airbnb intro tutorial and related files
1 parent 431ed30 commit e7e6e2c

File tree

9 files changed

+215
-159
lines changed

9 files changed

+215
-159
lines changed

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
.. _page-index:
22

3-
sqlsynthgen's Documentation
3+
sqlsynthgen
44
---------------------------
55

66
**sqlsynthgen** is a package for making copies of relational databases and populating them with random data.

docs/source/installation.rst

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,6 @@
33
Installation
44
============
55

6-
End User
7-
--------
8-
96
To use SqlSynthGen, first install it:
107

118
.. code-block:: console

docs/source/introduction.rst

Lines changed: 131 additions & 85 deletions
Large diffs are not rendered by default.

docs/source/quickstart.rst

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,9 @@ After :ref:`Installation <page-installation>`, we can run ``sqlsynthgen`` to see
2525
remove-vocab Truncate all vocabulary tables in the dst schema.
2626
validate-config Validate the format of a config file.
2727
28-
For the simplest case, we will need `make-tables`, `make-generators`, `create-tables` and `create-data` but, first,
28+
For the simplest case, we will need ``make-tables``, ``make-generators``, ``create-tables`` and ``create-data`` but, first,
2929
we need to set environment variables to tell sqlsynthgen how to access our source database (where the real data resides now) and destination database (where the synthetic data will go).
30-
We can do that in the terminal with the `export` keyword, as shown below, or in a file called `.env`.
30+
We can do that in the terminal with the ``export`` keyword, as shown below, or in a file called ``.env``.
3131
The source and destination may be on the same database server, as long as the database or schema names differ.
3232
If the source and destination schemas are the default schema for the user on that database, you should not set those variables.
3333
If you are using a DBMS that does not support schemas (e.g. MariaDB), you must not set those variables.
@@ -40,39 +40,39 @@ If you are using a DBMS that does not support schemas (e.g. MariaDB), you must n
4040
$ export DST_DSN="postgresql://someuser:[email protected]/dst_db"
4141
$ export DST_SCHEMA='myschema'
4242
43-
Next, we make a SQLAlchemy file that defines the structure of your database using the `make-tables` command:
43+
Next, we make a SQLAlchemy file that defines the structure of your database using the ``make-tables`` command:
4444

4545
.. code-block:: console
4646
4747
$ sqlsynthgen make-tables
4848
49-
This will have created a file called `orm.py` in the current directory, with a SQLAlchemy class for each of your tables.
49+
This will have created a file called ``orm.py`` in the current directory, with a SQLAlchemy class for each of your tables.
5050

5151
The next step is to make a sqlsynthgen file that defines one data generator per table in the source database:
5252

5353
.. code-block:: console
5454
5555
$ sqlsynthgen make-generators
5656
57-
This will have created a file called `ssg.py` in the current directory.
57+
This will have created a file called ``ssg.py`` in the current directory.
5858

59-
We can use the `create-table` command to read the `orm.py` file, create our destination schema (if it doesn't already exist) and to create empty copies of all the tables that in the source database.
59+
We can use the ``create-table`` command to read the ``orm.py`` file, create our destination schema (if it doesn't already exist), and to create empty copies of all the tables that are in the source database.
6060

6161
.. code-block:: console
6262
6363
$ sqlsynthgen create-tables
6464
65-
Now that we have created the schema that will hold synthetic data, we can use the `create-data` command to read `orm.py` & `ssg.py` and generate data:
65+
Now that we have created the schema that will hold synthetic data, we can use the ``create-data`` command to read ``orm.py`` & ``ssg.py`` and generate data:
6666

6767
.. code-block:: console
6868
6969
$ sqlsynthgen create-data
7070
71-
By default, `create-data` will have inserted one row per table and will have used the column data types to decide how to randomly generate data.
72-
To create more data each time we call `create-data`, we can provide an integer argument:
71+
By default, ``create-data`` will have inserted one row per table and will have used the column data types to decide how to randomly generate data.
72+
To create more data each time we call ``create-data``, we can provide the ``num-passes`` argument:
7373

7474
.. code-block:: console
7575
76-
$ sqlsynthgen create-data 10
76+
$ sqlsynthgen create-data --num-passes=10
7777
7878
We will have inserted 11 rows per table, with the last two commands.
Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,27 @@
1-
import random
21
import datetime
2+
import random
33
from typing import Optional, Generator, Tuple
44

5+
56
def user_dates_provider(generic):
67
date_account_created: datetime.date = generic.datetime.date(start=2010, end=2015)
78

89
booking_date: Optional[datetime.date] = None
910
if generic.choice([True, False]):
1011
booking_date = generic.datetime.date(
1112
start=date_account_created.year + 1, end=2016
12-
)
13+
)
1314

1415
return date_account_created, booking_date
1516

17+
1618
def user_age_provider(query_results):
17-
mu: float = query_results[0][0]
18-
sigma: float = query_results[0][1]
19-
return random.gauss(mu, sigma)
19+
# The [0] picks up the first row of the query results. This is needed because all
20+
# query results are always tables, and could in principle have many rows.
21+
mean: float = query_results[0]["mean"]
22+
std_dev: float = query_results[0]["std_dev"]
23+
return random.gauss(mean, std_dev)
24+
2025

2126
def sessions_story():
2227
"""Generate users and their sessions."""
@@ -25,9 +30,7 @@ def sessions_story():
2530
# a new user will be sent back to us with our randomly chosen device type
2631
user: dict = yield (
2732
"users", # table name
28-
{
29-
"first_device_type": random.choice(device_types)
30-
} # see 1. below
33+
{"first_device_type": random.choice(device_types)},
3134
)
3235

3336
# create between 10 and 19 sessions per user
@@ -39,15 +42,13 @@ def sessions_story():
3942
yield (
4043
"sessions",
4144
{
42-
"user_id": user["id"], # see 2. below
45+
"user_id": user["id"],
4346
"device_type": user["first_device_type"],
44-
}
47+
},
4548
)
4649
else:
4750
# ...but sometimes it is from any device type
4851
yield (
4952
"sessions",
50-
{
51-
"user_id": user["id"],
52-
"device_type": random.choice(device_types)},
53+
{"user_id": user["id"], "device_type": random.choice(device_types)},
5354
)
Lines changed: 19 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,9 @@
1-
21
row_generators_module: airbnb_generators
32

43
tables:
54
countries:
65
vocabulary_table: true
76

8-
sessions:
9-
num_rows_per_pass: 0
10-
row_generators:
11-
- name: generic.numeric.integer_number
12-
kwargs:
13-
start: 0
14-
end: 3600
15-
columns_assigned: secs_elapsed
16-
- name: generic.choice
17-
kwargs:
18-
items: ["show", "index", "personalize"]
19-
columns_assigned: action
20-
217
age_gender_bkts:
228
num_rows_per_pass: 1
239
row_generators:
@@ -32,21 +18,29 @@ tables:
3218
users:
3319
num_rows_per_pass: 0
3420
row_generators:
35-
- name: generic.person.password
36-
kwargs: null
37-
columns_assigned: id
38-
- name: generic.person.identifier
21+
- name: airbnb_generators.user_age_provider
3922
kwargs:
40-
mask: '"@@##@@@@"' # Using this provider, we generate alpha-numeric IDs.
23+
query_results: SRC_STATS["age_stats"]
24+
columns_assigned: age
25+
- name: generic.person.password
4126
columns_assigned: id
4227
- name: airbnb_generators.user_dates_provider
4328
kwargs:
4429
generic: generic
4530
columns_assigned: ["date_account_created", "date_first_booking"]
46-
- name: airbnb_generators.user_age_provider
31+
32+
sessions:
33+
num_rows_per_pass: 0
34+
row_generators:
35+
- name: generic.numeric.integer_number
4736
kwargs:
48-
query_results: SRC_STATS["age_stats"]
49-
columns_assigned: age
37+
start: 0
38+
end: 3600
39+
columns_assigned: secs_elapsed
40+
- name: generic.choice
41+
kwargs:
42+
items: ["show", "index", "personalize"]
43+
columns_assigned: action
5044

5145
src-stats:
5246
- name: age_stats
@@ -55,9 +49,9 @@ src-stats:
5549
FROM users
5650
WHERE age <= 100
5751
dp-query: >
58-
SELECT AVG(age), STDDEV(age)
52+
SELECT AVG(age) AS mean, STDDEV(age) AS std_dev
5953
FROM query_result
60-
epsilon: 0.1
54+
epsilon: 0.5
6155
delta: 0.000001
6256
snsql-metadata:
6357
max_ids: 1
@@ -70,6 +64,7 @@ src-stats:
7064
upper: 100
7165

7266
story_generators_module: airbnb_generators
67+
7368
story_generators:
7469
- name: airbnb_generators.sessions_story
7570
num_stories_per_pass: 30

tests/examples/airbnb/csv_to_database.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -111,12 +111,14 @@ def upload_csv_to_database(
111111
dataframe = dataframe.replace({np.nan: None})
112112

113113
num_rows = len(dataframe)
114-
for _, data_as_series in tqdm(dataframe.iterrows(), total=num_rows):
114+
commit_frequency = 100000
115+
for idx, row in tqdm(enumerate(dataframe.iterrows()), total=num_rows):
116+
data_as_series = row[1]
115117
model_instance = mapped_class(**data_as_series)
116118
if filter_function(model_instance, session):
117119
session.add(model_instance)
118-
else:
119-
print(f"Skipping: {model_instance=}")
120+
if idx % commit_frequency == 0:
121+
session.commit()
120122
session.commit()
121123

122124

tests/examples/airbnb/final_ssg_example.py

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
"""This file was auto-generated by sqlsynthgen but can be edited manually."""
22
from mimesis import Generic
33
from mimesis.locales import Locale
4-
from sqlsynthgen.base import FileUploader
4+
from sqlsynthgen.base import FileUploader, TableGenerator
55
from sqlsynthgen.unique_generator import UniqueGenerator
66

77
generic = Generic(locale=Locale.EN_GB)
@@ -12,6 +12,12 @@
1212
from sqlsynthgen.providers import ColumnValueProvider
1313

1414
generic.add_provider(ColumnValueProvider)
15+
from sqlsynthgen.providers import NullProvider
16+
17+
generic.add_provider(NullProvider)
18+
from sqlsynthgen.providers import SQLGroupByProvider
19+
20+
generic.add_provider(SQLGroupByProvider)
1521
from sqlsynthgen.providers import TimedeltaProvider
1622

1723
generic.add_provider(TimedeltaProvider)
@@ -34,7 +40,7 @@
3440
countries_vocab = FileUploader(orm.Countries.__table__)
3541

3642

37-
class age_gender_bktsGenerator:
43+
class age_gender_bktsGenerator(TableGenerator):
3844
num_rows_per_pass = 1
3945

4046
def __init__(self):
@@ -52,23 +58,22 @@ def __call__(self, dst_db_conn):
5258
return result
5359

5460

55-
class usersGenerator:
61+
class usersGenerator(TableGenerator):
5662
num_rows_per_pass = 0
5763

5864
def __init__(self):
5965
pass
6066

6167
def __call__(self, dst_db_conn):
6268
result = {}
69+
result["age"] = airbnb_generators.user_age_provider(
70+
query_results=SRC_STATS["age_stats"]
71+
)
6372
result["id"] = generic.person.password()
64-
result["id"] = generic.person.identifier(mask="@@##@@@@")
6573
(
6674
result["date_account_created"],
6775
result["date_first_booking"],
6876
) = airbnb_generators.user_dates_provider(generic=generic)
69-
result["age"] = airbnb_generators.user_age_provider(
70-
query_results=SRC_STATS["age_stats"]
71-
)
7277
result["timestamp_first_active"] = generic.datetime.datetime()
7378
result["gender"] = generic.text.color()
7479
result["signup_method"] = generic.text.color()
@@ -86,7 +91,7 @@ def __call__(self, dst_db_conn):
8691
return result
8792

8893

89-
class sessionsGenerator:
94+
class sessionsGenerator(TableGenerator):
9095
num_rows_per_pass = 0
9196

9297
def __init__(self):
@@ -123,7 +128,8 @@ def run_airbnb_generators_sessions_story(dst_db_conn):
123128

124129
story_generator_list = [
125130
{
126-
"name": run_airbnb_generators_sessions_story,
131+
"function": run_airbnb_generators_sessions_story,
127132
"num_stories_per_pass": 30,
133+
"name": "airbnb_generators.sessions_story",
128134
},
129135
]

0 commit comments

Comments
 (0)