Fresh Segments

In [43]:

Copied!





from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

import os
import sys
from typing import Dict, Optional, Tuple

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns

sys.path.append("../../../")
from src.athena import Athena
from src.utils import create_session
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

import os
import sys
from typing import Dict, Optional, Tuple

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns

sys.path.append("../../../")
from src.athena import Athena
from src.utils import create_session

Global¶

In [87]:

Copied!





boto3_session = create_session(
    profile_name="dev",
    role_arn=os.getenv("ATHENA_IAM_ROLE_ARN"),
)

wait = True
ctas_approach = False
storage_format = "PARQUET"
write_compression = "SNAPPY"

database = "fresh_segments"
tables = ["interest_map", "interest_metrics"]

athena = Athena(boto3_session=boto3_session, s3_output=os.getenv("ATHENA_S3_OUTPUT"))
athena
boto3_session = create_session(
    profile_name="dev",
    role_arn=os.getenv("ATHENA_IAM_ROLE_ARN"),
)

wait = True
ctas_approach = False
storage_format = "PARQUET"
write_compression = "SNAPPY"

database = "fresh_segments"
tables = ["interest_map", "interest_metrics"]

athena = Athena(boto3_session=boto3_session, s3_output=os.getenv("ATHENA_S3_OUTPUT"))
athena

Out[87]:

Athena(boto3_session=Session(region_name='us-east-1'), s3_output=s3://sql-case-studies/query_results)

Problem Statement¶

Fresh Segments is a digital marketing agency specializing in analyzing trends in online ad click behavior for businesses with targeted customer bases. The agency receives customer lists from clients and aggregates interest metrics to create a comprehensive dataset for further analysis.

The main output includes detailed composition and rankings of various interests, highlighting the proportion of customers who engaged with online assets related to each interest on a monthly basis.

The goal of this case study is to analyze these aggregated metrics for a sample client to provide high-level insights into their customer base and associated interests.

Entity Relationship Diagram¶

No description has been provided for this image

Tables¶

In [3]:

Copied!





for table in tables:
    athena.query(
        database=database,
        query=f""" 
                SELECT
                    *
                FROM
                    {database}.{table} TABLESAMPLE BERNOULLI(30);
              """,
        ctas_approach=ctas_approach,
    )
for table in tables:
    athena.query(
        database=database,
        query=f""" 
                SELECT
                    *
                FROM
                    {database}.{table} TABLESAMPLE BERNOULLI(30);
              """,
        ctas_approach=ctas_approach,
    )

Out[3]:

	id	interest_name	interest_summary	created_at	last_modified
0	1	Fitness Enthusiasts	Consumers using fitness tracking apps and webs...	2016-05-26 14:57:59	2018-05-23 11:30:12
1	5	Brides & Wedding Planners	People researching wedding ideas and vendors.	2016-05-26 14:57:59	2018-05-23 11:30:12
2	6	Vacation Planners	Consumers reading reviews of vacation destinat...	2016-05-26 14:57:59	2018-05-23 11:30:13
3	8	Business News Readers	Readers of online business news content.	2016-05-26 14:57:59	2018-05-23 11:30:12
4	12	Thrift Store Shoppers	Consumers shopping online for clothing at thri...	2016-05-26 14:57:59	2018-03-16 13:14:00
...	...	...	...	...	...
366	47850	Asheville Trip Planners	People researching attractions and accommodati...	2019-03-15 22:00:02	2019-03-21 15:33:09
367	48154	Elite Cycling Gear Shoppers	Consumers researching and shopping for elite c...	2019-03-21 22:00:00	2019-03-22 15:31:46
368	48465	HGTV Enthusiasts	People interested in HGTV shows and home remod...	2019-03-27 22:00:01	2019-04-02 16:17:29
369	50860	Food Delivery Service Users	Users of online food delivery services.	2019-04-23 18:00:02	2019-04-24 18:30:04
370	51119	Skin Disorder Researchers	People reading news and advice on preventing a...	2019-04-26 18:00:00	2019-04-29 14:20:04

371 rows × 5 columns

Out[3]:

	record_month	record_year	month_year	interest_id	composition	index_value	ranking	percentile_ranking
0	7	2018	07-2018	32486	11.89	6.19	1	99.86
1	7	2018	07-2018	18923	10.85	5.29	3	99.59
2	7	2018	07-2018	6110	11.57	4.79	11	98.49
3	7	2018	07-2018	4	13.97	4.53	14	98.08
4	7	2018	07-2018	17	7.89	4.15	19	97.39
...	...	...	...	...	...	...	...	...
4307	<NA>	<NA>	<NA>	<NA>	1.54	0.77	1181	1.09
4308	<NA>	<NA>	<NA>	<NA>	1.94	0.77	1181	1.09
4309	<NA>	<NA>	<NA>	<NA>	1.62	0.74	1187	0.59
4310	<NA>	<NA>	<NA>	<NA>	1.62	0.68	1191	0.25
4311	<NA>	<NA>	<NA>	<NA>	1.64	0.62	1194	0.00

4312 rows × 8 columns

Duplicate Check¶

Group by all columns and count the occurrences, while excluding rows with NULL values to simplify the grouping process.

Interest Metrics¶

In [4]:

Copied!





query = """ 
WITH grouped_counts AS (
  SELECT
    record_month,
    record_year,
    month_year,
    interest_id,
    composition,
    index_value,
    ranking,
    percentile_ranking,
    COUNT(*) AS freq
  FROM
    fresh_segments.interest_metrics
  WHERE
    record_month IS NOT NULL
    AND record_year IS NOT NULL
    AND month_year IS NOT NULL
    AND interest_id IS NOT NULL
  GROUP BY
    record_month,
    record_year,
    month_year,
    interest_id,
    composition,
    index_value,
    ranking,
    percentile_ranking
)
SELECT
  *
FROM
  grouped_counts
WHERE
  freq > 1
ORDER BY
  freq DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
WITH grouped_counts AS (
  SELECT
    record_month,
    record_year,
    month_year,
    interest_id,
    composition,
    index_value,
    ranking,
    percentile_ranking,
    COUNT(*) AS freq
  FROM
    fresh_segments.interest_metrics
  WHERE
    record_month IS NOT NULL
    AND record_year IS NOT NULL
    AND month_year IS NOT NULL
    AND interest_id IS NOT NULL
  GROUP BY
    record_month,
    record_year,
    month_year,
    interest_id,
    composition,
    index_value,
    ranking,
    percentile_ranking
)
SELECT
  *
FROM
  grouped_counts
WHERE
  freq > 1
ORDER BY
  freq DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[4]:

	record_month	record_year	month_year	interest_id	composition	index_value	ranking	percentile_ranking	freq

No duplicate are identified in the table.

NULL Check¶

Number of NULL values in each column.

In [5]:

Copied!





query = """ 
SELECT
    SUM(CASE WHEN record_month IS NULL THEN 1 ELSE 0 END) AS record_month_null,
    SUM(CASE WHEN record_year IS NULL THEN 1 ELSE 0 END) AS record_year_null,
    SUM(CASE WHEN month_year IS NULL THEN 1 ELSE 0 END) AS month_year_null,
    SUM(CASE WHEN interest_id IS NULL THEN 1 ELSE 0 END) AS interest_id_null,
    SUM(CASE WHEN composition IS NULL THEN 1 ELSE 0 END) AS composition_null,
    SUM(CASE WHEN index_value IS NULL THEN 1 ELSE 0 END) AS index_value_null,
    SUM(CASE WHEN ranking IS NULL THEN 1 ELSE 0 END) AS ranking_null,
    SUM(CASE WHEN percentile_ranking IS NULL THEN 1 ELSE 0 END) AS percentile_ranking_null
FROM
    fresh_segments.interest_metrics;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    SUM(CASE WHEN record_month IS NULL THEN 1 ELSE 0 END) AS record_month_null,
    SUM(CASE WHEN record_year IS NULL THEN 1 ELSE 0 END) AS record_year_null,
    SUM(CASE WHEN month_year IS NULL THEN 1 ELSE 0 END) AS month_year_null,
    SUM(CASE WHEN interest_id IS NULL THEN 1 ELSE 0 END) AS interest_id_null,
    SUM(CASE WHEN composition IS NULL THEN 1 ELSE 0 END) AS composition_null,
    SUM(CASE WHEN index_value IS NULL THEN 1 ELSE 0 END) AS index_value_null,
    SUM(CASE WHEN ranking IS NULL THEN 1 ELSE 0 END) AS ranking_null,
    SUM(CASE WHEN percentile_ranking IS NULL THEN 1 ELSE 0 END) AS percentile_ranking_null
FROM
    fresh_segments.interest_metrics;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[5]:

	record_month_null	record_year_null	month_year_null	interest_id_null	composition_null	index_value_null	ranking_null	percentile_ranking_null
0	1194	1194	1194	1193	0	0	0	0

EDA & Data Preprocessing¶

Q1 & Q3¶

There are two preprocessing steps that can be helpful:

Update the fresh_segments.interest_metrics table by modifying the month_year column to be a DATE data type with the start of the month as the day.
Handling the null values in the fresh_segments.interest_metrics table.

First, check if there are any records with non-missing interest ID but missing month_year value:

In [6]:

Copied!





query = """ 
SELECT 
    interest_id
FROM
    fresh_segments.interest_metrics
WHERE
    1 = 1
    AND month_year IS NULL
    AND interest_id IS NOT NULL;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT 
    interest_id
FROM
    fresh_segments.interest_metrics
WHERE
    1 = 1
    AND month_year IS NULL
    AND interest_id IS NOT NULL;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[6]:

	interest_id
0	21246

Missing Both ID & Date¶

For all records that have both missing month_year and interest_id values in the fresh_segments.interest_metrics table, we unfortunately cannot make any good business interpretations on their interest metrics since we do not know what interest they are associated with.

Missing Date¶

For the record where the interest_id is known (specifically interest_id = 21246) but the month_year value is missing, we want to determine if there are other records in the dataset with the same interest_id that have a month_year value that is not missing. This allows us to potentially infer or cross-reference information about the missing month_year value.

In [7]:

Copied!





query = """ 
SELECT
    *
FROM
    fresh_segments.interest_metrics
WHERE
    interest_id = 21246;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    *
FROM
    fresh_segments.interest_metrics
WHERE
    interest_id = 21246;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[7]:

	record_month	record_year	month_year	interest_id	composition	index_value	ranking	percentile_ranking
0	7	2018	07-2018	21246	2.26	0.65	722	0.96
1	8	2018	08-2018	21246	2.13	0.59	765	0.26
2	9	2018	09-2018	21246	2.06	0.61	774	0.77
3	10	2018	10-2018	21246	1.74	0.58	855	0.23
4	11	2018	11-2018	21246	2.25	0.78	908	2.16
5	12	2018	12-2018	21246	1.97	0.70	983	1.21
6	1	2019	01-2019	21246	2.05	0.76	954	1.95
7	2	2019	02-2019	21246	1.84	0.68	1109	1.07
8	3	2019	03-2019	21246	1.75	0.67	1123	1.14
9	4	2019	04-2019	21246	1.58	0.63	1092	0.64
10	<NA>	<NA>	<NA>	21246	1.61	0.68	1191	0.25

In this case, the options are:

Impute the missing month_year value with the month_year value of the lateset record with the same interest_id value
Impute the missing month_year value with the created_at value from the fresh_segments.interest_map table
Drop the record

The first two options run the risk of introducing bias into the data, but dropping the record may result in a loss of valuable information. Given that there is only one record with a missing month_year value, we can argue that the risk of losing valuable information is low relative to the risk of introducing bias into the data by imputing date information.

In [8]:

Copied!





# New ctas table with modified column and dropping all records with missing date information
query = """ 
SELECT
    record_month,
    record_year,
    CASE WHEN month_year IS NULL THEN NULL ELSE DATE_PARSE(CONCAT_WS('-', '01', month_year), '%d-%m-%Y') END AS month_year,
    interest_id,
    composition,
    index_value,
    ranking,
    percentile_ranking
FROM
    fresh_segments.interest_metrics
WHERE
    month_year IS NOT NULL;
"""

athena.create_ctas_table(
    database=database,
    query=query,
    ctas_table="interest_metrics_modified",
    s3_output="s3://sql-case-studies/fresh_segments/interest_metrics/",
    storage_format=storage_format,
    write_compression=write_compression,
    wait=wait,
)

# Drop old table
athena.drop_table(database=database, table="interest_metrics")
# New ctas table with modified column and dropping all records with missing date information
query = """ 
SELECT
    record_month,
    record_year,
    CASE WHEN month_year IS NULL THEN NULL ELSE DATE_PARSE(CONCAT_WS('-', '01', month_year), '%d-%m-%Y') END AS month_year,
    interest_id,
    composition,
    index_value,
    ranking,
    percentile_ranking
FROM
    fresh_segments.interest_metrics
WHERE
    month_year IS NOT NULL;
"""

athena.create_ctas_table(
    database=database,
    query=query,
    ctas_table="interest_metrics_modified",
    s3_output="s3://sql-case-studies/fresh_segments/interest_metrics/",
    storage_format=storage_format,
    write_compression=write_compression,
    wait=wait,
)

# Drop old table
athena.drop_table(database=database, table="interest_metrics")

Out[8]:

{'ctas_database': 'fresh_segments',
 'ctas_table': 'interest_metrics_modified',
 'ctas_query_metadata': _QueryMetadata(execution_id='349c49e1-5136-4e2d-baad-7e7764a0c489', dtype={'rows': 'Int64'}, parse_timestamps=[], parse_dates=[], parse_geometry=[], converters={}, binaries=[], output_location='s3://sql-case-studies/fresh_segments/interest_metrics/tables/349c49e1-5136-4e2d-baad-7e7764a0c489', manifest_location='s3://sql-case-studies/fresh_segments/interest_metrics/tables/349c49e1-5136-4e2d-baad-7e7764a0c489-manifest.csv', raw_payload={'QueryExecutionId': '349c49e1-5136-4e2d-baad-7e7764a0c489', 'Query': 'CREATE TABLE "fresh_segments"."interest_metrics_modified"\nWITH(\n    external_location = \'s3://sql-case-studies/fresh_segments/interest_metrics/interest_metrics_modified\',\n    write_compression = \'SNAPPY\',\n    format = \'PARQUET\')\nAS  \nSELECT\n    record_month,\n    record_year,\n    CASE WHEN month_year IS NULL THEN NULL ELSE DATE_PARSE(CONCAT_WS(\'-\', \'01\', month_year), \'%d-%m-%Y\') END AS month_year,\n    interest_id,\n    composition,\n    index_value,\n    ranking,\n    percentile_ranking\nFROM\n    fresh_segments.interest_metrics\nWHERE\n    month_year IS NOT NULL', 'StatementType': 'DDL', 'ResultConfiguration': {'OutputLocation': 's3://sql-case-studies/fresh_segments/interest_metrics/tables/349c49e1-5136-4e2d-baad-7e7764a0c489'}, 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}}, 'QueryExecutionContext': {'Database': 'fresh_segments'}, 'Status': {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2025, 1, 20, 15, 8, 51, 96000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2025, 1, 20, 15, 8, 52, 488000, tzinfo=tzlocal())}, 'Statistics': {'EngineExecutionTimeInMillis': 1199, 'DataScannedInBytes': 95524, 'DataManifestLocation': 's3://sql-case-studies/fresh_segments/interest_metrics/tables/349c49e1-5136-4e2d-baad-7e7764a0c489-manifest.csv', 'TotalExecutionTimeInMillis': 1392, 'QueryQueueTimeInMillis': 70, 'ServicePreProcessingTimeInMillis': 96, 'QueryPlanningTimeInMillis': 146, 'ServiceProcessingTimeInMillis': 27, 'ResultReuseInformation': {'ReusedPreviousResult': False}}, 'WorkGroup': 'primary', 'EngineVersion': {'SelectedEngineVersion': 'AUTO', 'EffectiveEngineVersion': 'Athena engine version 3'}, 'SubstatementType': 'CREATE_TABLE_AS_SELECT'})}

Query executed successfully

In [9]:

Copied!





# New ctas table with original name
query = """ 
SELECT
    *
FROM
    fresh_segments.interest_metrics_modified;
"""

athena.create_ctas_table(
    database=database,
    query=query,
    ctas_table="interest_metrics",
    s3_output="s3://sql-case-studies/fresh_segments/interest_metrics/",
    storage_format=storage_format,
    write_compression=write_compression,
    wait=wait,
)

# Drop temporary ctas table
athena.drop_table(database=database, table="interest_metrics_modified")
# New ctas table with original name
query = """ 
SELECT
    *
FROM
    fresh_segments.interest_metrics_modified;
"""

athena.create_ctas_table(
    database=database,
    query=query,
    ctas_table="interest_metrics",
    s3_output="s3://sql-case-studies/fresh_segments/interest_metrics/",
    storage_format=storage_format,
    write_compression=write_compression,
    wait=wait,
)

# Drop temporary ctas table
athena.drop_table(database=database, table="interest_metrics_modified")

Out[9]:

{'ctas_database': 'fresh_segments',
 'ctas_table': 'interest_metrics',
 'ctas_query_metadata': _QueryMetadata(execution_id='cc381175-9e88-4338-8acc-44814a4465b5', dtype={'rows': 'Int64'}, parse_timestamps=[], parse_dates=[], parse_geometry=[], converters={}, binaries=[], output_location='s3://sql-case-studies/fresh_segments/interest_metrics/tables/cc381175-9e88-4338-8acc-44814a4465b5', manifest_location='s3://sql-case-studies/fresh_segments/interest_metrics/tables/cc381175-9e88-4338-8acc-44814a4465b5-manifest.csv', raw_payload={'QueryExecutionId': 'cc381175-9e88-4338-8acc-44814a4465b5', 'Query': 'CREATE TABLE "fresh_segments"."interest_metrics"\nWITH(\n    external_location = \'s3://sql-case-studies/fresh_segments/interest_metrics/interest_metrics\',\n    write_compression = \'SNAPPY\',\n    format = \'PARQUET\')\nAS  \nSELECT\n    *\nFROM\n    fresh_segments.interest_metrics_modified', 'StatementType': 'DDL', 'ResultConfiguration': {'OutputLocation': 's3://sql-case-studies/fresh_segments/interest_metrics/tables/cc381175-9e88-4338-8acc-44814a4465b5'}, 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}}, 'QueryExecutionContext': {'Database': 'fresh_segments'}, 'Status': {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2025, 1, 20, 15, 8, 54, 384000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2025, 1, 20, 15, 8, 55, 549000, tzinfo=tzlocal())}, 'Statistics': {'EngineExecutionTimeInMillis': 988, 'DataScannedInBytes': 85403, 'DataManifestLocation': 's3://sql-case-studies/fresh_segments/interest_metrics/tables/cc381175-9e88-4338-8acc-44814a4465b5-manifest.csv', 'TotalExecutionTimeInMillis': 1165, 'QueryQueueTimeInMillis': 57, 'ServicePreProcessingTimeInMillis': 105, 'QueryPlanningTimeInMillis': 131, 'ServiceProcessingTimeInMillis': 15, 'ResultReuseInformation': {'ReusedPreviousResult': False}}, 'WorkGroup': 'primary', 'EngineVersion': {'SelectedEngineVersion': 'AUTO', 'EffectiveEngineVersion': 'Athena engine version 3'}, 'SubstatementType': 'CREATE_TABLE_AS_SELECT'})}

Query executed successfully

In [10]:

Copied!





query = """ 
SELECT
    * 
FROM
    fresh_segments.interest_metrics TABLESAMPLE BERNOULLI(30);
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    * 
FROM
    fresh_segments.interest_metrics TABLESAMPLE BERNOULLI(30);
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[10]:

	record_month	record_year	month_year	interest_id	composition	index_value	ranking	percentile_ranking
0	7	2018	2018-07-01	6344	10.32	5.10	4	99.45
1	7	2018	2018-07-01	69	10.82	5.03	6	99.18
2	7	2018	2018-07-01	17	7.89	4.15	19	97.39
3	7	2018	2018-07-01	6286	14.10	3.82	25	96.57
4	7	2018	2018-07-01	6184	13.35	3.75	29	96.02
...	...	...	...	...	...	...	...	...
3925	8	2019	2019-08-01	33957	1.74	0.93	1130	1.65
3926	8	2019	2019-08-01	136	1.74	0.93	1130	1.65
3927	8	2019	2019-08-01	16198	1.82	0.90	1139	0.87
3928	8	2019	2019-08-01	15884	1.68	0.86	1144	0.44
3929	8	2019	2019-08-01	6225	2.15	0.86	1144	0.44

3930 rows × 8 columns

In [11]:

Copied!





query = """ 
SELECT
    COUNT(*) AS record_count
FROM
    fresh_segments.interest_metrics;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    COUNT(*) AS record_count
FROM
    fresh_segments.interest_metrics;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[11]:

	record_count
0	13079

Q2¶

What is count of records in the fresh_segments.interest_metrics table for each month_year value sorted in chronological order (earliest to latest) with the null values appearing first?

In [12]:

Copied!





query = """ 
SELECT
    month_year,
    COUNT(DISTINCT interest_id) AS unique_interest_count
FROM
    fresh_segments.interest_metrics
GROUP BY
    month_year
ORDER BY
    month_year ASC NULLS FIRST;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    month_year,
    COUNT(DISTINCT interest_id) AS unique_interest_count
FROM
    fresh_segments.interest_metrics
GROUP BY
    month_year
ORDER BY
    month_year ASC NULLS FIRST;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[12]:

	month_year	unique_interest_count
0	2018-07-01	729
1	2018-08-01	767
2	2018-09-01	780
3	2018-10-01	857
4	2018-11-01	928
5	2018-12-01	995
6	2019-01-01	973
7	2019-02-01	1121
8	2019-03-01	1136
9	2019-04-01	1099
10	2019-05-01	857
11	2019-06-01	824
12	2019-07-01	864
13	2019-08-01	1149

Q4¶

How many interest_id values exist in the fresh_segments.interest_metrics table but not in the fresh_segments.interest_map table? What about the other way around?

Let A be the set of interest_id values in the fresh_segments.interest_metrics table and B be the set of interest_id values in the fresh_segments.interest_map table.

A \ B & B \ A¶

We can use a full outer join, which returns all rows from both tables regardless of whether there is a match or not:

In [13]:

Copied!





query = """ 
SELECT
    COUNT(DISTINCT met.interest_id) AS metrics_id_unique_count,
    COUNT(DISTINCT map.id) AS map_id_unique_count,
    SUM(CASE WHEN met.interest_id IS NULL AND map.id IS NOT NULL THEN 1 ELSE 0 END) AS in_map_not_in_metrics_count,
    SUM(CASE WHEN met.interest_id IS NOT NULL AND map.id IS NULL THEN 1 ELSE 0 END) AS in_metrics_not_in_map_count
FROM
    fresh_segments.interest_metrics AS met 
        FULL JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    COUNT(DISTINCT met.interest_id) AS metrics_id_unique_count,
    COUNT(DISTINCT map.id) AS map_id_unique_count,
    SUM(CASE WHEN met.interest_id IS NULL AND map.id IS NOT NULL THEN 1 ELSE 0 END) AS in_map_not_in_metrics_count,
    SUM(CASE WHEN met.interest_id IS NOT NULL AND map.id IS NULL THEN 1 ELSE 0 END) AS in_metrics_not_in_map_count
FROM
    fresh_segments.interest_metrics AS met 
        FULL JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[13]:

	metrics_id_unique_count	map_id_unique_count	in_map_not_in_metrics_count	in_metrics_not_in_map_count
0	1202	1209	7	0

All IDs in interest_metrics are also in interest_map: This means that there are no interest IDs in interest_metrics that are missing from interest_map. In set notation:

interest_metrics ∖ interest_map = ∅ (empty set)

Some IDs in interest_map are not in interest_metrics: There are 7 interest IDs that are found in interest_map but do not appear in interest_metrics. In set notation:

interest_map ∖ interest_metrics = {7 IDs}`

Q5¶

Summarise the id values in the fresh_segments.interest_map by its total record count in this table.

In [16]:

Copied!





query = """ 
SELECT
    COUNT(id) AS raw_count
FROM
    fresh_segments.interest_map;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    COUNT(id) AS raw_count
FROM
    fresh_segments.interest_map;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[16]:

	raw_count
0	1209

Q6¶

What sort of table join would be most appropriate to combine the fresh_segments.interest_metrics and fresh_segments.interest_map tables?

A left join of the interest_map table onto interest_metrics is most appropriate because the asymmetric set difference shows that all interest_id values in interest_metrics are present in interest_map.

This ensures all rows from interest_metrics are retained, while relevant data from interest_map is included without unnecessary rows from interest_map that have no match in interest_metrics.

In [17]:

Copied!





query = """ 
SELECT
    met.interest_id,
    met.record_month,
    met.record_year,
    met.month_year,
    met.composition,
    met.index_value,
    met.ranking,
    met.percentile_ranking,
    map.interest_name,
    map.interest_summary,
    map.created_at,
    map.last_modified
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
    met.interest_id = 21246;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    met.interest_id,
    met.record_month,
    met.record_year,
    met.month_year,
    met.composition,
    met.index_value,
    met.ranking,
    met.percentile_ranking,
    map.interest_name,
    map.interest_summary,
    map.created_at,
    map.last_modified
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
    met.interest_id = 21246;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[17]:

	interest_id	record_month	record_year	month_year	composition	index_value	ranking	percentile_ranking	interest_name	interest_summary	created_at	last_modified
0	21246	7	2018	2018-07-01	2.26	0.65	722	0.96	Readers of El Salvadoran Content	People reading news from El Salvadoran media s...	2018-06-11 17:50:04	2018-06-11 17:50:04
1	21246	8	2018	2018-08-01	2.13	0.59	765	0.26	Readers of El Salvadoran Content	People reading news from El Salvadoran media s...	2018-06-11 17:50:04	2018-06-11 17:50:04
2	21246	9	2018	2018-09-01	2.06	0.61	774	0.77	Readers of El Salvadoran Content	People reading news from El Salvadoran media s...	2018-06-11 17:50:04	2018-06-11 17:50:04
3	21246	10	2018	2018-10-01	1.74	0.58	855	0.23	Readers of El Salvadoran Content	People reading news from El Salvadoran media s...	2018-06-11 17:50:04	2018-06-11 17:50:04
4	21246	11	2018	2018-11-01	2.25	0.78	908	2.16	Readers of El Salvadoran Content	People reading news from El Salvadoran media s...	2018-06-11 17:50:04	2018-06-11 17:50:04
5	21246	12	2018	2018-12-01	1.97	0.70	983	1.21	Readers of El Salvadoran Content	People reading news from El Salvadoran media s...	2018-06-11 17:50:04	2018-06-11 17:50:04
6	21246	1	2019	2019-01-01	2.05	0.76	954	1.95	Readers of El Salvadoran Content	People reading news from El Salvadoran media s...	2018-06-11 17:50:04	2018-06-11 17:50:04
7	21246	2	2019	2019-02-01	1.84	0.68	1109	1.07	Readers of El Salvadoran Content	People reading news from El Salvadoran media s...	2018-06-11 17:50:04	2018-06-11 17:50:04
8	21246	3	2019	2019-03-01	1.75	0.67	1123	1.14	Readers of El Salvadoran Content	People reading news from El Salvadoran media s...	2018-06-11 17:50:04	2018-06-11 17:50:04
9	21246	4	2019	2019-04-01	1.58	0.63	1092	0.64	Readers of El Salvadoran Content	People reading news from El Salvadoran media s...	2018-06-11 17:50:04	2018-06-11 17:50:04

Q7¶

Are there any records in the joined table where the month_year value is before the created_at value from the fresh_segments.interest_map table? Are these values valid and why?

In [67]:

Copied!





query = """ 
SELECT
    met.interest_id,
    met.record_month,
    met.record_year,
    met.month_year,
    met.composition,
    met.index_value,
    met.ranking,
    met.percentile_ranking,
    map.interest_name,
    map.interest_summary,
    map.created_at,
    map.last_modified
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
    met.month_year < map.created_at;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    met.interest_id,
    met.record_month,
    met.record_year,
    met.month_year,
    met.composition,
    met.index_value,
    met.ranking,
    met.percentile_ranking,
    map.interest_name,
    map.interest_summary,
    map.created_at,
    map.last_modified
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
    met.month_year < map.created_at;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[67]:

	interest_id	record_month	record_year	month_year	composition	index_value	ranking	percentile_ranking	interest_name	interest_summary	created_at	last_modified
0	32704	7	2018	2018-07-01	8.04	2.27	225	69.14	Major Airline Customers	People visiting sites for major airline brands...	2018-07-06 14:35:04	2018-07-06 14:35:04
1	33191	7	2018	2018-07-01	3.99	2.11	283	61.18	Online Shoppers	People who spend money online	2018-07-17 10:40:03	2018-07-17 10:46:58
2	32703	7	2018	2018-07-01	5.53	1.80	375	48.56	School Supply Shoppers	Consumers shopping for classroom supplies for ...	2018-07-06 14:35:04	2018-07-06 14:35:04
3	32701	7	2018	2018-07-01	4.23	1.41	483	33.74	Womens Equality Advocates	People visiting sites advocating for womens eq...	2018-07-06 14:35:03	2018-07-06 14:35:03
4	32705	7	2018	2018-07-01	4.38	1.34	505	30.73	Certified Events Professionals	Professionals reading industry news and resear...	2018-07-06 14:35:04	2018-07-06 14:35:04
...	...	...	...	...	...	...	...	...	...	...	...	...
183	49972	4	2019	2019-04-01	2.11	1.15	722	34.30	Horseback Riding Enthusiasts	People reading horseback riding news and resou...	2019-04-15 18:00:00	2019-04-24 17:40:04
184	49502	4	2019	2019-04-01	1.87	1.12	768	30.12	Veterinarians	Veterinarians	2019-04-08 18:00:05	2019-07-09 13:57:13
185	49974	4	2019	2019-04-01	2.00	1.10	799	27.30	Agricultural and Food Issues Researchers	People researching organizations for food and ...	2019-04-15 18:00:00	2019-04-24 17:40:04
186	49973	4	2019	2019-04-01	1.66	1.03	910	17.20	Farm Finance Researchers	People researching financial institutions spec...	2019-04-15 18:00:00	2019-04-24 17:40:04
187	51678	5	2019	2019-05-01	1.71	1.53	475	44.57	Plumbers	Professionals reading industry news and resear...	2019-05-06 22:00:00	2019-05-07 18:50:04

188 rows × 12 columns

Some month_year values are before the created_at values because the former is less granular than the latter. The month_year value is the month and year of the interest metrics, while the created_at value is the timestamp of when the interest was created. We also preprocessed the month_year to be the first day of the month, so it is expected that some month_year values will be before the created_at values.

We can verify that there are no records where the difference between the month_year and created_at values is greater than 1 month, which would suggest that there are data quality issues.

In [18]:

Copied!





query = """ 
SELECT
    met.interest_id,
    met.record_month,
    met.record_year,
    met.month_year,
    met.composition,
    met.index_value,
    met.ranking,
    met.percentile_ranking,
    map.interest_name,
    map.interest_summary,
    map.created_at,
    map.last_modified
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
    met.month_year < map.created_at
    AND DATE_DIFF('DAY', met.month_year, map.created_at) > 30;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    met.interest_id,
    met.record_month,
    met.record_year,
    met.month_year,
    met.composition,
    met.index_value,
    met.ranking,
    met.percentile_ranking,
    map.interest_name,
    map.interest_summary,
    map.created_at,
    map.last_modified
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
    met.month_year < map.created_at
    AND DATE_DIFF('DAY', met.month_year, map.created_at) > 30;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[18]:

	interest_id	record_month	record_year	month_year	composition	index_value	ranking	percentile_ranking	interest_name	interest_summary	created_at	last_modified

Another way to verfiy is to use DATE_TRUNC function to truncate the created_at values to the month level (i.e., first day of each created_at month) and compare the month_year values with the truncated created_at values.

In [19]:

Copied!





query = """ 
SELECT
    met.interest_id,
    met.record_month,
    met.record_year,
    met.month_year,
    met.composition,
    met.index_value,
    met.ranking,
    met.percentile_ranking,
    map.interest_name,
    map.interest_summary,
    map.created_at,
    map.last_modified
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
    met.month_year < DATE_TRUNC('MONTH', map.created_at);
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    met.interest_id,
    met.record_month,
    met.record_year,
    met.month_year,
    met.composition,
    met.index_value,
    met.ranking,
    met.percentile_ranking,
    map.interest_name,
    map.interest_summary,
    map.created_at,
    map.last_modified
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
    met.month_year < DATE_TRUNC('MONTH', map.created_at);
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[19]:

	interest_id	record_month	record_year	month_year	composition	index_value	ranking	percentile_ranking	interest_name	interest_summary	created_at	last_modified

Interest Analysis¶

Q1¶

Which interests have been present in all month_year dates in our dataset?

In [33]:

Copied!





query = """ 
SELECT
    map.interest_name,
    COUNT(DISTINCT met.month_year) AS total_monthS
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
GROUP BY
    interest_name
HAVING 
    COUNT(DISTINCT met.month_year) = (
        SELECT COUNT(DISTINCT month_year)
        FROM fresh_segments.interest_metrics
    )
ORDER BY
    interest_name ASC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    map.interest_name,
    COUNT(DISTINCT met.month_year) AS total_monthS
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
GROUP BY
    interest_name
HAVING 
    COUNT(DISTINCT met.month_year) = (
        SELECT COUNT(DISTINCT month_year)
        FROM fresh_segments.interest_metrics
    )
ORDER BY
    interest_name ASC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[33]:

	interest_name	total_monthS
0	Accounting & CPA Continuing Education Researchers	14
1	Affordable Hotel Bookers	14
2	Aftermarket Accessories Shoppers	14
3	Alabama Trip Planners	14
4	Alaskan Cruise Planners	14
...	...	...
475	World Cup Enthusiasts	14
476	Yachting Enthusiasts	14
477	Yale University Fans	14
478	Yogis	14
479	Zoo Visitors	14

480 rows × 2 columns

In [38]:

Copied!





query = """ 
WITH count_month_year AS (
    SELECT
        interest_id,
        COUNT(DISTINCT month_year) AS total_months
    FROM
        fresh_segments.interest_metrics
    GROUP BY
        interest_id
)
SELECT
    total_months,
    COUNT(DISTINCT interest_id) AS interest_counts
FROM
    count_month_year
GROUP BY
    total_months
ORDER BY
    total_months DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
WITH count_month_year AS (
    SELECT
        interest_id,
        COUNT(DISTINCT month_year) AS total_months
    FROM
        fresh_segments.interest_metrics
    GROUP BY
        interest_id
)
SELECT
    total_months,
    COUNT(DISTINCT interest_id) AS interest_counts
FROM
    count_month_year
GROUP BY
    total_months
ORDER BY
    total_months DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[38]:

	total_months	interest_counts
0	14	480
1	13	82
2	12	65
3	11	94
4	10	86
5	9	95
6	8	67
7	7	90
8	6	33
9	5	38
10	4	32
11	3	15
12	2	12
13	1	13

Q2¶

Using this same total_months measure, calculate the cumulative percentage of all records starting at 14 months - which total_months value passes the $90\%$ cumulative percentage value?

In [51]:

Copied!





query = """ 
WITH count_month_year AS (
    SELECT
        interest_id,
        COUNT(DISTINCT month_year) AS total_months
    FROM
        fresh_segments.interest_metrics
    GROUP BY
        interest_id
),

count_interest_by_total_months AS (
    SELECT
        total_months,
        COUNT(DISTINCT interest_id) AS interest_counts
    FROM
        count_month_year
    GROUP BY
        total_months
)

SELECT
    total_months,
    interest_counts,
    ROUND(
        100.0 * SUM(interest_counts) OVER (
            ORDER BY total_months DESC 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) / SUM(interest_counts) OVER(),
        2
    ) AS cumulative_percentage
FROM
    count_interest_by_total_months
ORDER BY
    total_months DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
WITH count_month_year AS (
    SELECT
        interest_id,
        COUNT(DISTINCT month_year) AS total_months
    FROM
        fresh_segments.interest_metrics
    GROUP BY
        interest_id
),

count_interest_by_total_months AS (
    SELECT
        total_months,
        COUNT(DISTINCT interest_id) AS interest_counts
    FROM
        count_month_year
    GROUP BY
        total_months
)

SELECT
    total_months,
    interest_counts,
    ROUND(
        100.0 * SUM(interest_counts) OVER (
            ORDER BY total_months DESC 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) / SUM(interest_counts) OVER(),
        2
    ) AS cumulative_percentage
FROM
    count_interest_by_total_months
ORDER BY
    total_months DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[51]:

	total_months	interest_counts	cumulative_percentage
0	14	480	39.93
1	13	82	46.76
2	12	65	52.16
3	11	94	59.98
4	10	86	67.14
5	9	95	75.04
6	8	67	80.62
7	7	90	88.10
8	6	33	90.85
9	5	38	94.01
10	4	32	96.67
11	3	15	97.92
12	2	12	98.92
13	1	13	100.00

There are 90 distinct interests with data available for at least 6 of the total month_year values in the dataset. This corresponds to the point where the cumulative percentage surpasses $90\%$.

This indicates that the majority (i.e., $90\%$) of the interests in the dataset have data coverage across fewer than 6 months, emphasizing a significant drop-off in the number of interests with longer-term data availability since the max total_months value is 14.

Q3¶

If we were to remove all interest_id values which are lower than the total_months value we found in the previous question - how many total data points would we be removing?

In [66]:

Copied!





query = """ 
WITH interest_ids_to_remove AS (
    SELECT
        interest_id
    FROM
        fresh_segments.interest_metrics
    GROUP BY
        interest_id
    HAVING 
        COUNT(DISTINCT month_year) < 6
)

SELECT
    COUNT(met.interest_id) AS remove_row_count
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN interest_ids_to_remove AS remove ON met.interest_id = remove.interest_id
WHERE
    1 = 1
    AND remove.interest_id IS NOT NULL;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
WITH interest_ids_to_remove AS (
    SELECT
        interest_id
    FROM
        fresh_segments.interest_metrics
    GROUP BY
        interest_id
    HAVING 
        COUNT(DISTINCT month_year) < 6
)

SELECT
    COUNT(met.interest_id) AS remove_row_count
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN interest_ids_to_remove AS remove ON met.interest_id = remove.interest_id
WHERE
    1 = 1
    AND remove.interest_id IS NOT NULL;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[66]:

	remove_row_count
0	400

Q4¶

Does it make sense to remove these data points from a business perspective? Consider an example where an interest has data for all 14 months compared to one with significantly fewer months, and evaluate what having fewer months of data implies from a segmentation perspective. Additionally, if we include all interests regardless of the number of months they appear, how many unique interests are represented in each month?

In [71]:

Copied!





query = """ 
SELECT
    month_year,
    COUNT(DISTINCT interest_id) AS num_unique_interest
FROM
    fresh_segments.interest_metrics
GROUP BY
    month_year
ORDER BY
    num_unique_interest DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    month_year,
    COUNT(DISTINCT interest_id) AS num_unique_interest
FROM
    fresh_segments.interest_metrics
GROUP BY
    month_year
ORDER BY
    num_unique_interest DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[71]:

	month_year	num_unique_interest
0	2019-08-01	1149
1	2019-03-01	1136
2	2019-02-01	1121
3	2019-04-01	1099
4	2018-12-01	995
5	2019-01-01	973
6	2018-11-01	928
7	2019-07-01	864
8	2018-10-01	857
9	2019-05-01	857
10	2019-06-01	824
11	2018-09-01	780
12	2018-08-01	767
13	2018-07-01	729

Metrics computed for interests with limited availability during the sampling period may lead to less reliable or, even worse, biased results. Furthermore, conducting statistical hypothesis testing on smaller sample sizes may reduce the statistical power, making it more challenging to detect true effects.

Segment Analysis¶

Q1¶

Using the complete dataset - which are the top 10 and bottom 10 interests that have the largest composition values in any month_year? Only use the maximum composition value for each interest but keep the corresponding month_year.

In [32]:

Copied!





query = """ 
WITH compositions AS (
    SELECT
        map.interest_name,
        met.month_year,
        met.composition,
        DENSE_RANK() OVER (
            PARTITION BY map.interest_name
            ORDER BY met.composition DESC
        ) AS comp_ranks
    FROM
        fresh_segments.interest_metrics AS met
    LEFT JOIN 
        fresh_segments.interest_map AS map 
    ON 
        met.interest_id = map.id
),

top_10 AS (
    SELECT
        interest_name,
        month_year,
        composition,
        'top_10' AS rank
    FROM
        compositions
    WHERE
        comp_ranks = 1
    ORDER BY
        composition DESC
    LIMIT 
        10
),

bottom_10 AS (
    SELECT
        interest_name,
        month_year,
        composition,
        'bottom_10' AS rank
    FROM
        compositions
    WHERE
        comp_ranks = 1
    ORDER BY 
        composition ASC
    LIMIT 
        10
),

row_bind AS (
    SELECT * FROM top_10
    UNION
    SELECT * FROM bottom_10
)
SELECT
    *
FROM
    row_bind
ORDER BY 
    rank DESC,
    composition DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
WITH compositions AS (
    SELECT
        map.interest_name,
        met.month_year,
        met.composition,
        DENSE_RANK() OVER (
            PARTITION BY map.interest_name
            ORDER BY met.composition DESC
        ) AS comp_ranks
    FROM
        fresh_segments.interest_metrics AS met
    LEFT JOIN 
        fresh_segments.interest_map AS map 
    ON 
        met.interest_id = map.id
),

top_10 AS (
    SELECT
        interest_name,
        month_year,
        composition,
        'top_10' AS rank
    FROM
        compositions
    WHERE
        comp_ranks = 1
    ORDER BY
        composition DESC
    LIMIT 
        10
),

bottom_10 AS (
    SELECT
        interest_name,
        month_year,
        composition,
        'bottom_10' AS rank
    FROM
        compositions
    WHERE
        comp_ranks = 1
    ORDER BY 
        composition ASC
    LIMIT 
        10
),

row_bind AS (
    SELECT * FROM top_10
    UNION
    SELECT * FROM bottom_10
)
SELECT
    *
FROM
    row_bind
ORDER BY 
    rank DESC,
    composition DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[32]:

	interest_name	month_year	composition	rank
0	Work Comes First Travelers	2018-12-01	21.20	top_10
1	Gym Equipment Owners	2018-07-01	18.82	top_10
2	Furniture Shoppers	2018-07-01	17.44	top_10
3	Luxury Retail Shoppers	2018-07-01	17.19	top_10
4	Luxury Boutique Hotel Researchers	2018-10-01	15.15	top_10
5	Luxury Bedding Shoppers	2018-12-01	15.05	top_10
6	Shoe Shoppers	2018-07-01	14.91	top_10
7	Cosmetics and Beauty Shoppers	2018-07-01	14.23	top_10
8	Luxury Hotel Guests	2018-07-01	14.10	top_10
9	Luxury Retail Researchers	2018-07-01	13.97	top_10
10	Readers of Jamaican Content	2018-07-01	1.86	bottom_10
11	Automotive News Readers	2019-02-01	1.84	bottom_10
12	Comedy Fans	2018-07-01	1.83	bottom_10
13	World of Warcraft Enthusiasts	2019-08-01	1.82	bottom_10
14	Miami Heat Fans	2018-08-01	1.81	bottom_10
15	Online Role Playing Game Enthusiasts	2018-07-01	1.73	bottom_10
16	Hearthstone Video Game Fans	2019-08-01	1.66	bottom_10
17	Scifi Movie and TV Enthusiasts	2018-09-01	1.61	bottom_10
18	Action Movie and TV Enthusiasts	2018-09-01	1.59	bottom_10
19	The Sims Video Game Fans	2019-03-01	1.57	bottom_10

Q2¶

Which 5 interests had the lowest average ranking value?

In [52]:

Copied!





query = """ 
SELECT
    map.interest_name,
    AVG(met.ranking) AS avg_ranking,
    COUNT(*) AS n
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
GROUP BY
    map.interest_name
ORDER BY
    avg_ranking DESC
LIMIT
    5;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    map.interest_name,
    AVG(met.ranking) AS avg_ranking,
    COUNT(*) AS n
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
GROUP BY
    map.interest_name
ORDER BY
    avg_ranking DESC
LIMIT
    5;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[52]:

	interest_name	avg_ranking	n
0	Hearthstone Video Game Fans	1141.0	1
1	The Sims Video Game Fans	1135.0	1
2	Grand Theft Auto Video Game Fans	1110.0	2
3	Hair Color Shoppers	1110.0	4
4	Bigfoot Folklore Enthusiasts	1078.0	2

Q3¶

Which 5 interests had the largest standard deviation in their percentile_ranking value?

In [54]:

Copied!





query = """ 
SELECT
    met.interest_id,
    map.interest_name,
    STDDEV(met.percentile_ranking) AS pct_ranking_std,
    MIN(met.percentile_ranking) AS pct_ranking_min,
    APPROX_PERCENTILE(met.percentile_ranking, 0.25) AS pct_ranking_25,
    APPROX_PERCENTILE(met.percentile_ranking, 0.50) AS pct_ranking_50,
    APPROX_PERCENTILE(met.percentile_ranking, 0.75) AS pct_ranking_75,
    MAX(met.percentile_ranking) AS pct_ranking_max,
    COUNT(*) AS n
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
GROUP BY
    met.interest_id,
    map.interest_name
ORDER BY
    pct_ranking_std DESC
LIMIT
    5;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    met.interest_id,
    map.interest_name,
    STDDEV(met.percentile_ranking) AS pct_ranking_std,
    MIN(met.percentile_ranking) AS pct_ranking_min,
    APPROX_PERCENTILE(met.percentile_ranking, 0.25) AS pct_ranking_25,
    APPROX_PERCENTILE(met.percentile_ranking, 0.50) AS pct_ranking_50,
    APPROX_PERCENTILE(met.percentile_ranking, 0.75) AS pct_ranking_75,
    MAX(met.percentile_ranking) AS pct_ranking_max,
    COUNT(*) AS n
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
GROUP BY
    met.interest_id,
    map.interest_name
ORDER BY
    pct_ranking_std DESC
LIMIT
    5;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[54]:

	interest_id	interest_name	pct_ranking_std	pct_ranking_min	pct_ranking_25	pct_ranking_50	pct_ranking_75	pct_ranking_max	n
0	6260	Blockbuster Movie Fans	41.273823	2.26	2.26	60.63	60.63	60.63	2
1	131	Android Fans	30.720768	4.84	4.96	5.62	10.82	75.03	5
2	150	TV Junkies	30.363975	10.01	37.29	37.79	49.82	93.28	5
3	23	Techies	30.175047	7.92	9.46	23.85	30.90	86.69	6
4	20764	Entertainment Industry Decision Makers	28.974920	11.23	11.53	18.67	22.12	86.15	6

Q4¶

For the 5 interests found in the previous question - what was minimum and maximum percentile_ranking values for each interest and its corresponding year_month value? Can you describe what is happening for these 5 interests?

In [72]:

Copied!





query = """ 
WITH ranked_data AS (
    SELECT
        interest_id,
        month_year,
        composition,
        ranking,
        percentile_ranking,
        DENSE_RANK() OVER (PARTITION BY interest_id ORDER BY percentile_ranking DESC) AS percentile_ranking_max,
        DENSE_RANK() OVER (PARTITION BY interest_id ORDER BY percentile_ranking ASC) AS percentile_ranking_min
    FROM
        fresh_segments.interest_metrics
    WHERE
        1 = 1
        AND interest_id IN (
            SELECT
                interest_id
            FROM
                fresh_segments.interest_metrics
            GROUP BY
                interest_id
            ORDER BY
                STDDEV(percentile_ranking) DESC
            LIMIT
                5
        )
)
SELECT
    interest_id,
    month_year,
    composition,
    ranking,
    percentile_ranking
FROM
    ranked_data
WHERE
    1 = 1
    AND (
        percentile_ranking_max = 1
        OR percentile_ranking_min = 1
    );
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
WITH ranked_data AS (
    SELECT
        interest_id,
        month_year,
        composition,
        ranking,
        percentile_ranking,
        DENSE_RANK() OVER (PARTITION BY interest_id ORDER BY percentile_ranking DESC) AS percentile_ranking_max,
        DENSE_RANK() OVER (PARTITION BY interest_id ORDER BY percentile_ranking ASC) AS percentile_ranking_min
    FROM
        fresh_segments.interest_metrics
    WHERE
        1 = 1
        AND interest_id IN (
            SELECT
                interest_id
            FROM
                fresh_segments.interest_metrics
            GROUP BY
                interest_id
            ORDER BY
                STDDEV(percentile_ranking) DESC
            LIMIT
                5
        )
)
SELECT
    interest_id,
    month_year,
    composition,
    ranking,
    percentile_ranking
FROM
    ranked_data
WHERE
    1 = 1
    AND (
        percentile_ranking_max = 1
        OR percentile_ranking_min = 1
    );
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[72]:

	interest_id	month_year	composition	ranking	percentile_ranking
0	6260	2018-07-01	5.27	287	60.63
1	6260	2019-08-01	1.83	1123	2.26
2	131	2018-07-01	5.09	182	75.03
3	131	2019-03-01	1.72	1081	4.84
4	150	2018-07-01	5.30	49	93.28
5	150	2019-08-01	1.94	1034	10.01
6	23	2018-07-01	5.41	97	86.69
7	23	2019-08-01	1.90	1058	7.92
8	20764	2018-07-01	5.85	101	86.15
9	20764	2019-08-01	1.91	1020	11.23

One clear pattern emerges: the maximum percentile ranking values for these five interests all occur in July 2018, while the minimum percentile values, with one exception, are observed in August 2019.

Q5¶

How would you describe our customers in this segment based off their composition and ranking values? What sort of products or services should we show to these customers and what should we avoid?

Top & Bottom Interests By Composition Over Time¶

In [18]:

Copied!





query = """ 
WITH compositions AS (
    SELECT
        map.interest_name,
        composition,
        DENSE_RANK() OVER (
            PARTITION BY map.interest_name
            ORDER BY met.composition DESC
        ) AS comp_ranks
    FROM
        fresh_segments.interest_metrics AS met
    LEFT JOIN 
        fresh_segments.interest_map AS map 
    ON 
        met.interest_id = map.id
),

top_10 AS (
    SELECT
        interest_name,
        'top_10' AS rank
    FROM
        compositions
    WHERE
        comp_ranks = 1
    ORDER BY
        composition DESC
    LIMIT 
        10
),

bottom_10 AS (
    SELECT
        interest_name,
        'bottom_10' AS rank
    FROM
        compositions
    WHERE
        comp_ranks = 1
    ORDER BY 
        composition ASC
    LIMIT 
        10
),

row_bind AS (
    SELECT * FROM top_10
    UNION
    SELECT * FROM bottom_10
)
SELECT
    map.interest_name,
    met.month_year,
    met.composition,
    met.ranking,
    rank
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
        LEFT JOIN row_bind AS rb ON map.interest_name = rb.interest_name
WHERE
    1 = 1
    AND rb.interest_name IS NOT NULL
ORDER BY
    rank DESC,
    interest_name ASC;
"""

top_bottom_10 = athena.query(
    database=database, query=query, ctas_approach=ctas_approach
)
top_bottom_10
query = """ 
WITH compositions AS (
    SELECT
        map.interest_name,
        composition,
        DENSE_RANK() OVER (
            PARTITION BY map.interest_name
            ORDER BY met.composition DESC
        ) AS comp_ranks
    FROM
        fresh_segments.interest_metrics AS met
    LEFT JOIN 
        fresh_segments.interest_map AS map 
    ON 
        met.interest_id = map.id
),

top_10 AS (
    SELECT
        interest_name,
        'top_10' AS rank
    FROM
        compositions
    WHERE
        comp_ranks = 1
    ORDER BY
        composition DESC
    LIMIT 
        10
),

bottom_10 AS (
    SELECT
        interest_name,
        'bottom_10' AS rank
    FROM
        compositions
    WHERE
        comp_ranks = 1
    ORDER BY 
        composition ASC
    LIMIT 
        10
),

row_bind AS (
    SELECT * FROM top_10
    UNION
    SELECT * FROM bottom_10
)
SELECT
    map.interest_name,
    met.month_year,
    met.composition,
    met.ranking,
    rank
FROM
    fresh_segments.interest_metrics AS met 
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
        LEFT JOIN row_bind AS rb ON map.interest_name = rb.interest_name
WHERE
    1 = 1
    AND rb.interest_name IS NOT NULL
ORDER BY
    rank DESC,
    interest_name ASC;
"""

top_bottom_10 = athena.query(
    database=database, query=query, ctas_approach=ctas_approach
)
top_bottom_10

Out[18]:

	interest_name	month_year	composition	ranking	rank
0	Cosmetics and Beauty Shoppers	2019-04-01	5.10	695	top_10
1	Cosmetics and Beauty Shoppers	2019-02-01	6.50	584	top_10
2	Cosmetics and Beauty Shoppers	2019-03-01	5.93	619	top_10
3	Cosmetics and Beauty Shoppers	2018-08-01	7.98	389	top_10
4	Cosmetics and Beauty Shoppers	2018-12-01	6.73	487	top_10
...	...	...	...	...	...
147	World of Warcraft Enthusiasts	2018-11-01	1.63	902	bottom_10
148	World of Warcraft Enthusiasts	2019-08-01	1.82	1139	bottom_10
149	World of Warcraft Enthusiasts	2019-03-01	1.52	1130	bottom_10
150	World of Warcraft Enthusiasts	2019-04-01	1.58	1076	bottom_10
151	World of Warcraft Enthusiasts	2018-07-01	1.74	712	bottom_10

152 rows × 5 columns

In [19]:

Copied!





fig, axes = plt.subplots(1, 2, figsize=(16, 8))

top_10_data = top_bottom_10.loc[top_bottom_10["rank"] == "top_10"].sort_values(
    by="month_year"
)
bottom_10_data = top_bottom_10.loc[top_bottom_10["rank"] == "bottom_10"].sort_values(
    by="month_year"
)

# Top 10 Interests
for name, group in top_10_data.groupby("interest_name"):
    axes[0].plot(group["month_year"], group["composition"], label=name, marker="o")
axes[0].set_title("Top 10 Interests Over Time")
axes[0].set_xlabel("Month-Year")
axes[0].set_ylabel("Composition (%)")
axes[0].legend(title="Interest Name", fontsize=8)
axes[0].grid(True)

# Bottom 10 Interests
for name, group in bottom_10_data.groupby("interest_name"):
    axes[1].plot(group["month_year"], group["composition"], label=name, marker="o")
axes[1].set_title("Bottom 10 Interests Over Time")
axes[1].set_xlabel("Month-Year")
axes[1].legend(title="Interest Name", fontsize=8)
axes[1].grid(True)

plt.tight_layout()
plt.show();
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

top_10_data = top_bottom_10.loc[top_bottom_10["rank"] == "top_10"].sort_values(
    by="month_year"
)
bottom_10_data = top_bottom_10.loc[top_bottom_10["rank"] == "bottom_10"].sort_values(
    by="month_year"
)

# Top 10 Interests
for name, group in top_10_data.groupby("interest_name"):
    axes[0].plot(group["month_year"], group["composition"], label=name, marker="o")
axes[0].set_title("Top 10 Interests Over Time")
axes[0].set_xlabel("Month-Year")
axes[0].set_ylabel("Composition (%)")
axes[0].legend(title="Interest Name", fontsize=8)
axes[0].grid(True)

# Bottom 10 Interests
for name, group in bottom_10_data.groupby("interest_name"):
    axes[1].plot(group["month_year"], group["composition"], label=name, marker="o")
axes[1].set_title("Bottom 10 Interests Over Time")
axes[1].set_xlabel("Month-Year")
axes[1].legend(title="Interest Name", fontsize=8)
axes[1].grid(True)

plt.tight_layout()
plt.show();

Risk Analysis for Top 10 Interests¶

To assess the risk associated with focusing on the top 10 interests, we can conduct a risk analysis by evaluating the variability in their composition values over time. One effective way to define risk is to analyze the distribution of composition values for each interest across the observed time period.

By quantifying the fluctuations in composition, we can construct a risk score for each of the top 10 interests. These scores reflect the degree of uncertainty or variability in customer engagement with each interest. A higher risk score indicates greater variability, signaling potential uncertainty in prioritizing that interest.

This approach allows us to interpret the risk scores as indicators of the stability of each interest’s engagement levels over time, providing actionable insights for prioritization and resource allocation.

In [25]:

Copied!





def define_priors(data: pd.Series) -> Dict[str, float]:
    """
    Define priors based on the provided data.

    - A Normal prior is used for the mean, centered around the observed mean of the data
    - A flexible Normal prior allows for uncertainty, with its standard deviation set to twice the observed standard deviation of the data
    - An Inverse Gamma prior is used for the variance. This is a common choice for modeling uncertainty in scale parameters like variance
      or standard deviation. The shape (alpha) is set to 2, and the scale (beta) is informed by the data's variance (`data.var()`).

    Parameters
    ----------
    data : pd.Series
        The data series to calculate priors from.

    Returns
    -------
    Dict[str, float]
        A dictionary containing the calculated priors:
        - mu_prior_mean: Prior mean for mu.
        - mu_prior_std: Prior standard deviation for mu.
        - sigma_prior_alpha: Alpha parameter for the InverseGamma prior on sigma.
        - sigma_prior_beta: Beta parameter for the InverseGamma prior on sigma.
    """
    mu_prior_mean = data.mean()
    mu_prior_std = data.std() * 2
    sigma_prior_alpha = 2
    sigma_prior_beta = data.var()
    return {
        "mu_prior_mean": mu_prior_mean,
        "mu_prior_std": mu_prior_std,
        "sigma_prior_alpha": sigma_prior_alpha,
        "sigma_prior_beta": sigma_prior_beta,
    }


def run_bayesian_model(
    data: pd.Series,
    priors: Dict[str, float],
    samples: Optional[int] = 2000,
    tune: Optional[int] = 1000,
    hdi_prob: Optional[float] = 0.95,
) -> Tuple[np.ndarray, np.ndarray, Dict[str, Tuple[float, float]]]:
    """
    Run Bayesian inference on the data and return posterior samples.

    Parameters
    ----------
    data : pd.Series
        The observed data series.
    priors : dict
        A dictionary containing the priors for the model.
    samples : Optional[int]
        The number of posterior samples to draw, by default 2000.
    tune : Optional[int]
        The number of tuning steps for the sampler, by default 1000.
    hdi_prob: Optional[float]
        Prob for which the highest density interval will be computed, by default 0.95.


    Returns
    -------
    Tuple[np.ndarray, np.ndarray, Dict[str, Tuple[float, float]]]
        - posterior_mu: Posterior samples for the mean (mu).
        - posterior_sigma: Posterior samples for the standard deviation (sigma).
        - hdi_mu: HDI for the posterior mean.
    """
    with pm.Model() as model:
        # Priors
        mu = pm.Normal("mu", mu=priors["mu_prior_mean"], sigma=priors["mu_prior_std"])
        sigma = pm.InverseGamma(
            "sigma", alpha=priors["sigma_prior_alpha"], beta=priors["sigma_prior_beta"]
        )

        # Likelihood
        composition_obs = pm.Normal(
            "composition_obs", mu=mu, sigma=sigma, observed=data
        )

        # Posterior Sampling
        trace = pm.sample(
            samples, tune=tune, return_inferencedata=True, progressbar=False
        )

        # Calculate HDI for mu
        hdi_mu = pm.hdi(trace.posterior["mu"], hdi_prob=hdi_prob)

    posterior_mu = trace.posterior["mu"].values.flatten()
    posterior_sigma = trace.posterior["sigma"].values.flatten()
    return posterior_mu, posterior_sigma, hdi_mu


def compute_risk_score(posterior_mu: np.ndarray) -> float:
    """
    Calculate the risk score as the standard deviation of the posterior mean.

    Parameters
    ----------
    posterior_mu : np.ndarray
        The posterior samples for the mean (mu).

    Returns
    -------
    float
        The calculated risk score.
    """
    return np.std(posterior_mu)
def define_priors(data: pd.Series) -> Dict[str, float]:
    """
    Define priors based on the provided data.

    - A Normal prior is used for the mean, centered around the observed mean of the data
    - A flexible Normal prior allows for uncertainty, with its standard deviation set to twice the observed standard deviation of the data
    - An Inverse Gamma prior is used for the variance. This is a common choice for modeling uncertainty in scale parameters like variance
      or standard deviation. The shape (alpha) is set to 2, and the scale (beta) is informed by the data's variance (`data.var()`).

    Parameters
    ----------
    data : pd.Series
        The data series to calculate priors from.

    Returns
    -------
    Dict[str, float]
        A dictionary containing the calculated priors:
        - mu_prior_mean: Prior mean for mu.
        - mu_prior_std: Prior standard deviation for mu.
        - sigma_prior_alpha: Alpha parameter for the InverseGamma prior on sigma.
        - sigma_prior_beta: Beta parameter for the InverseGamma prior on sigma.
    """
    mu_prior_mean = data.mean()
    mu_prior_std = data.std() * 2
    sigma_prior_alpha = 2
    sigma_prior_beta = data.var()
    return {
        "mu_prior_mean": mu_prior_mean,
        "mu_prior_std": mu_prior_std,
        "sigma_prior_alpha": sigma_prior_alpha,
        "sigma_prior_beta": sigma_prior_beta,
    }


def run_bayesian_model(
    data: pd.Series,
    priors: Dict[str, float],
    samples: Optional[int] = 2000,
    tune: Optional[int] = 1000,
    hdi_prob: Optional[float] = 0.95,
) -> Tuple[np.ndarray, np.ndarray, Dict[str, Tuple[float, float]]]:
    """
    Run Bayesian inference on the data and return posterior samples.

    Parameters
    ----------
    data : pd.Series
        The observed data series.
    priors : dict
        A dictionary containing the priors for the model.
    samples : Optional[int]
        The number of posterior samples to draw, by default 2000.
    tune : Optional[int]
        The number of tuning steps for the sampler, by default 1000.
    hdi_prob: Optional[float]
        Prob for which the highest density interval will be computed, by default 0.95.


    Returns
    -------
    Tuple[np.ndarray, np.ndarray, Dict[str, Tuple[float, float]]]
        - posterior_mu: Posterior samples for the mean (mu).
        - posterior_sigma: Posterior samples for the standard deviation (sigma).
        - hdi_mu: HDI for the posterior mean.
    """
    with pm.Model() as model:
        # Priors
        mu = pm.Normal("mu", mu=priors["mu_prior_mean"], sigma=priors["mu_prior_std"])
        sigma = pm.InverseGamma(
            "sigma", alpha=priors["sigma_prior_alpha"], beta=priors["sigma_prior_beta"]
        )

        # Likelihood
        composition_obs = pm.Normal(
            "composition_obs", mu=mu, sigma=sigma, observed=data
        )

        # Posterior Sampling
        trace = pm.sample(
            samples, tune=tune, return_inferencedata=True, progressbar=False
        )

        # Calculate HDI for mu
        hdi_mu = pm.hdi(trace.posterior["mu"], hdi_prob=hdi_prob)

    posterior_mu = trace.posterior["mu"].values.flatten()
    posterior_sigma = trace.posterior["sigma"].values.flatten()
    return posterior_mu, posterior_sigma, hdi_mu


def compute_risk_score(posterior_mu: np.ndarray) -> float:
    """
    Calculate the risk score as the standard deviation of the posterior mean.

    Parameters
    ----------
    posterior_mu : np.ndarray
        The posterior samples for the mean (mu).

    Returns
    -------
    float
        The calculated risk score.
    """
    return np.std(posterior_mu)

In [27]:

Copied!





risk_scores = []

for interest_name in top_10_data["interest_name"].unique():
    # Filter data for the current interest
    interest_data = top_10_data[top_10_data["interest_name"] == interest_name][
        "composition"
    ]

    # Define priors
    priors = define_priors(interest_data)

    # Run Bayesian model and get HDI
    posterior_mu, _, hdi_mu = run_bayesian_model(interest_data, priors, hdi_prob=0.97)

    # Compute risk score as the standard deviations of the posterior means
    risk_score = compute_risk_score(posterior_mu)

    risk_scores.append(
        {
            "interest_name": interest_name,
            "risk_score": risk_score,
            "hdi_lower_mean_composition": hdi_mu["mu"].values[0],
            "hdi_upper_mean_composition": hdi_mu["mu"].values[1],
        }
    )

risk_scores_data = pd.DataFrame(risk_scores).sort_values("risk_score", ascending=False)
risk_scores_data
risk_scores = []

for interest_name in top_10_data["interest_name"].unique():
    # Filter data for the current interest
    interest_data = top_10_data[top_10_data["interest_name"] == interest_name][
        "composition"
    ]

    # Define priors
    priors = define_priors(interest_data)

    # Run Bayesian model and get HDI
    posterior_mu, _, hdi_mu = run_bayesian_model(interest_data, priors, hdi_prob=0.97)

    # Compute risk score as the standard deviations of the posterior means
    risk_score = compute_risk_score(posterior_mu)

    risk_scores.append(
        {
            "interest_name": interest_name,
            "risk_score": risk_score,
            "hdi_lower_mean_composition": hdi_mu["mu"].values[0],
            "hdi_upper_mean_composition": hdi_mu["mu"].values[1],
        }
    )

risk_scores_data = pd.DataFrame(risk_scores).sort_values("risk_score", ascending=False)
risk_scores_data

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.

Out[27]:

	interest_name	risk_score	hdi_lower_mean_composition	hdi_upper_mean_composition
7	Work Comes First Travelers	1.645555	13.992587	21.492166
9	Luxury Bedding Shoppers	1.208149	8.508197	13.899054
2	Luxury Boutique Hotel Researchers	1.100884	8.039092	13.009123
5	Luxury Hotel Guests	1.001667	7.134219	11.614973
3	Gym Equipment Owners	0.944078	8.352871	12.541372
1	Luxury Retail Shoppers	0.915064	7.708057	11.805010
6	Furniture Shoppers	0.847117	7.631271	11.361080
4	Shoe Shoppers	0.811842	4.634176	8.229593
8	Cosmetics and Beauty Shoppers	0.790816	4.579328	8.025522
0	Luxury Retail Researchers	0.741084	4.474628	7.776604

Index Analysis¶

The index_value is a measure that can be used to reverse calculate the average composition for Fresh Segments’ clients.

Average composition can be calculated by dividing the composition column by the index_value column rounded to 2 decimal places.

$$ \text{Average Composition} = \frac{\text{Composition}}{\text{Index Value}} $$

Q1¶

What is the top 10 interests by the average composition for each month?

In [74]:

Copied!





query = """ 
WITH
    ranked_avg_comp AS (
        SELECT
            met.month_year,
            map.interest_name,
            ROUND((met.composition / met.index_value), 2) AS avg_composition,
            DENSE_RANK() OVER (
                PARTITION BY
                    met.month_year
                ORDER BY
                    (met.composition / met.index_value) DESC
            ) AS ranks
        FROM
            fresh_segments.interest_metrics AS met
                LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
    )
SELECT
    month_year,
    interest_name,
    avg_composition,
    ranks
FROM
    ranked_avg_comp
WHERE
    1 = 1
    AND ranks <= 10
ORDER BY
    month_year ASC,
    avg_composition DESC;
"""

avg_composition_bv_month_year = athena.query(
    database=database, query=query, ctas_approach=ctas_approach
)
avg_composition_bv_month_year
query = """ 
WITH
    ranked_avg_comp AS (
        SELECT
            met.month_year,
            map.interest_name,
            ROUND((met.composition / met.index_value), 2) AS avg_composition,
            DENSE_RANK() OVER (
                PARTITION BY
                    met.month_year
                ORDER BY
                    (met.composition / met.index_value) DESC
            ) AS ranks
        FROM
            fresh_segments.interest_metrics AS met
                LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
    )
SELECT
    month_year,
    interest_name,
    avg_composition,
    ranks
FROM
    ranked_avg_comp
WHERE
    1 = 1
    AND ranks <= 10
ORDER BY
    month_year ASC,
    avg_composition DESC;
"""

avg_composition_bv_month_year = athena.query(
    database=database, query=query, ctas_approach=ctas_approach
)
avg_composition_bv_month_year

Out[74]:

	month_year	interest_name	avg_composition	ranks
0	2018-07-01	Las Vegas Trip Planners	7.36	1
1	2018-07-01	Gym Equipment Owners	6.94	2
2	2018-07-01	Cosmetics and Beauty Shoppers	6.78	3
3	2018-07-01	Luxury Retail Shoppers	6.61	4
4	2018-07-01	Furniture Shoppers	6.51	5
...	...	...	...	...
135	2019-08-01	Luxury Retail Shoppers	2.59	6
136	2019-08-01	Furniture Shoppers	2.59	7
137	2019-08-01	Marijuana Legalization Advocates	2.56	8
138	2019-08-01	Medicare Researchers	2.55	9
139	2019-08-01	Recently Retired Individuals	2.53	10

140 rows × 4 columns

In [75]:

Copied!





avg_comp_wide = avg_composition_bv_month_year.pivot(
    index="month_year", columns="interest_name", values="avg_composition"
)

avg_comp_wide.index = avg_comp_wide.index.astype(pd.StringDtype())
avg_comp_wide
avg_comp_wide = avg_composition_bv_month_year.pivot(
    index="month_year", columns="interest_name", values="avg_composition"
)

avg_comp_wide.index = avg_comp_wide.index.astype(pd.StringDtype())
avg_comp_wide

Out[75]:

interest_name	Alabama Trip Planners	Asian Food Enthusiasts	Chelsea Fans	Christmas Celebration Researchers	Cosmetics and Beauty Shoppers	Cruise Travel Intenders	Family Adventures Travelers	Furniture Shoppers	Gamers	Gym Equipment Owners	...	Nursing and Physicians Assistant Journal Researchers	PlayStation Enthusiasts	Readers of Catholic News	Readers of Honduran Content	Recently Retired Individuals	Restaurant Supply Shoppers	Solar Energy Researchers	Teen Girl Clothing Shoppers	Video Gamers	Work Comes First Travelers
month_year
2018-07-01	NaN	6.10	NaN	NaN	6.78	NaN	4.85	6.51	NaN	6.94	...	NaN	NaN	NaN	NaN	5.72	NaN	NaN	NaN	NaN	4.80
2018-08-01	4.83	5.68	NaN	NaN	6.28	NaN	NaN	6.30	NaN	6.62	...	NaN	NaN	NaN	NaN	5.58	NaN	NaN	NaN	NaN	5.70
2018-09-01	7.27	NaN	NaN	6.47	NaN	NaN	NaN	NaN	NaN	NaN	...	6.70	NaN	NaN	7.60	NaN	6.25	6.24	6.53	NaN	8.26
2018-10-01	7.10	NaN	NaN	6.72	NaN	NaN	NaN	NaN	NaN	NaN	...	7.02	NaN	NaN	7.02	NaN	NaN	6.50	6.78	NaN	9.14
2018-11-01	6.69	NaN	NaN	6.08	NaN	NaN	NaN	NaN	NaN	NaN	...	6.65	NaN	NaN	7.09	NaN	5.59	7.05	5.95	NaN	8.28
2018-12-01	6.68	NaN	5.86	6.09	NaN	NaN	NaN	NaN	NaN	NaN	...	6.96	NaN	NaN	6.58	NaN	NaN	6.55	6.38	NaN	8.31
2019-01-01	6.44	NaN	NaN	5.65	NaN	NaN	NaN	NaN	NaN	NaN	...	6.46	NaN	5.48	6.67	NaN	NaN	7.05	5.96	NaN	7.66
2019-02-01	6.65	NaN	NaN	5.98	NaN	NaN	NaN	NaN	NaN	NaN	...	6.84	6.23	NaN	6.24	NaN	NaN	6.58	6.29	NaN	7.66
2019-03-01	6.54	NaN	NaN	5.61	NaN	NaN	NaN	NaN	NaN	NaN	...	6.52	6.06	5.65	6.21	NaN	NaN	6.40	6.01	NaN	NaN
2019-04-01	6.21	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	6.01	5.52	5.30	6.02	NaN	5.07	6.28	5.39	NaN	NaN
2019-05-01	3.34	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.29	NaN	...	3.15	3.55	4.08	4.41	NaN	NaN	3.92	NaN	3.19	NaN
2019-06-01	NaN	2.52	NaN	NaN	2.55	2.2	NaN	2.39	NaN	2.55	...	NaN	NaN	NaN	NaN	2.27	NaN	NaN	NaN	NaN	NaN
2019-07-01	NaN	2.78	NaN	NaN	2.78	NaN	NaN	2.79	NaN	2.79	...	NaN	NaN	NaN	NaN	2.72	NaN	NaN	NaN	NaN	NaN
2019-08-01	NaN	2.68	NaN	NaN	2.73	NaN	NaN	2.59	NaN	2.72	...	NaN	NaN	NaN	NaN	2.53	NaN	2.66	NaN	NaN	NaN

14 rows × 30 columns

In [76]:

Copied!





plt.figure(figsize=(12, 8))
sns.heatmap(avg_comp_wide.T, cmap="coolwarm", annot=True, fmt=".1f", linewidths=0.5)
plt.title("Heatmap of Average Composition (Top 10 Interests)")
plt.xlabel("Month-Year")
plt.ylabel("Interest Name")
plt.tight_layout()
plt.show();
plt.figure(figsize=(12, 8))
sns.heatmap(avg_comp_wide.T, cmap="coolwarm", annot=True, fmt=".1f", linewidths=0.5)
plt.title("Heatmap of Average Composition (Top 10 Interests)")
plt.xlabel("Month-Year")
plt.ylabel("Interest Name")
plt.tight_layout()
plt.show();

Q2¶

For all of these top 10 interests - which interest appears the most often?

In [73]:

Copied!





query = """ 
WITH
    ranked_avg_comp AS (
        SELECT
            met.month_year,
            map.interest_name,
            ROUND((met.composition / met.index_value), 2) AS avg_composition,
            DENSE_RANK() OVER (
                PARTITION BY
                    met.month_year
                ORDER BY
                    (met.composition / met.index_value) DESC
            ) AS ranks
        FROM
            fresh_segments.interest_metrics AS met
                LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
    )
SELECT
    interest_name,
    COUNT(*) AS count
FROM
    ranked_avg_comp
WHERE
    1 = 1
    AND ranks <= 10
GROUP BY
    interest_name
ORDER BY
    count DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
WITH
    ranked_avg_comp AS (
        SELECT
            met.month_year,
            map.interest_name,
            ROUND((met.composition / met.index_value), 2) AS avg_composition,
            DENSE_RANK() OVER (
                PARTITION BY
                    met.month_year
                ORDER BY
                    (met.composition / met.index_value) DESC
            ) AS ranks
        FROM
            fresh_segments.interest_metrics AS met
                LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
    )
SELECT
    interest_name,
    COUNT(*) AS count
FROM
    ranked_avg_comp
WHERE
    1 = 1
    AND ranks <= 10
GROUP BY
    interest_name
ORDER BY
    count DESC;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[73]:

	interest_name	count
0	Luxury Bedding Shoppers	10
1	Alabama Trip Planners	10
2	Solar Energy Researchers	10
3	Readers of Honduran Content	9
4	New Years Eve Party Ticket Purchasers	9
5	Nursing and Physicians Assistant Journal Resea...	9
6	Work Comes First Travelers	8
7	Teen Girl Clothing Shoppers	8
8	Christmas Celebration Researchers	7
9	Asian Food Enthusiasts	5
10	Recently Retired Individuals	5
11	Luxury Retail Shoppers	5
12	Gym Equipment Owners	5
13	Cosmetics and Beauty Shoppers	5
14	Las Vegas Trip Planners	5
15	Furniture Shoppers	5
16	PlayStation Enthusiasts	4
17	Readers of Catholic News	4
18	Restaurant Supply Shoppers	3
19	Medicare Researchers	3
20	Medicare Provider Researchers	2
21	Gamers	1
22	Medicare Price Shoppers	1
23	Luxury Boutique Hotel Researchers	1
24	Marijuana Legalization Advocates	1
25	Chelsea Fans	1
26	Family Adventures Travelers	1
27	Video Gamers	1
28	Cruise Travel Intenders	1
29	HDTV Researchers	1

Q3¶

What is the average of the average composition for the top 10 interests for each month?

In [71]:

Copied!





query = """ 
WITH
    ranked_avg_comp AS (
        SELECT
            met.month_year,
            map.interest_name,
            ROUND((met.composition / met.index_value), 2) AS avg_composition,
            DENSE_RANK() OVER (
                PARTITION BY
                    met.month_year
                ORDER BY
                    (met.composition / met.index_value) DESC
            ) AS ranks
        FROM
            fresh_segments.interest_metrics AS met
                LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
    )
SELECT
    month_year,
    AVG(avg_composition) AS avg_of_avg_composition
FROM
    ranked_avg_comp
WHERE
    1 = 1
    AND ranks <= 10
GROUP BY
    month_year
ORDER BY
    month_year ASC;
"""

avg_of_avg_comp = athena.query(
    database=database, query=query, ctas_approach=ctas_approach
)
avg_of_avg_comp
query = """ 
WITH
    ranked_avg_comp AS (
        SELECT
            met.month_year,
            map.interest_name,
            ROUND((met.composition / met.index_value), 2) AS avg_composition,
            DENSE_RANK() OVER (
                PARTITION BY
                    met.month_year
                ORDER BY
                    (met.composition / met.index_value) DESC
            ) AS ranks
        FROM
            fresh_segments.interest_metrics AS met
                LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
    )
SELECT
    month_year,
    AVG(avg_composition) AS avg_of_avg_composition
FROM
    ranked_avg_comp
WHERE
    1 = 1
    AND ranks <= 10
GROUP BY
    month_year
ORDER BY
    month_year ASC;
"""

avg_of_avg_comp = athena.query(
    database=database, query=query, ctas_approach=ctas_approach
)
avg_of_avg_comp

Out[71]:

	month_year	avg_of_avg_composition
0	2018-07-01	6.038
1	2018-08-01	5.945
2	2018-09-01	6.895
3	2018-10-01	7.066
4	2018-11-01	6.623
5	2018-12-01	6.652
6	2019-01-01	6.399
7	2019-02-01	6.579
8	2019-03-01	6.168
9	2019-04-01	5.750
10	2019-05-01	3.537
11	2019-06-01	2.427
12	2019-07-01	2.765
13	2019-08-01	2.631

In [72]:

Copied!





fig, ax = plt.subplots(figsize=(16, 8))

ax.plot(
    avg_of_avg_comp["month_year"], avg_of_avg_comp["avg_of_avg_composition"], marker="o"
)
ax.set_title("Average of Average Composition Over Time")
ax.set_xlabel("Month-Year")
plt.tight_layout()
plt.show();
fig, ax = plt.subplots(figsize=(16, 8))

ax.plot(
    avg_of_avg_comp["month_year"], avg_of_avg_comp["avg_of_avg_composition"], marker="o"
)
ax.set_title("Average of Average Composition Over Time")
ax.set_xlabel("Month-Year")
plt.tight_layout()
plt.show();

Q4¶

What is the 3 month rolling average of the max average composition value from September 2018 to August 2019?

Base CTE¶

In [156]:

Copied!





query = """ 
SELECT
    met.month_year,
    map.interest_name,
    ROUND((met.composition / met.index_value), 2) AS avg_composition,
    DENSE_RANK() OVER (
        PARTITION BY
            met.month_year
        ORDER BY
            CAST(met.composition AS DOUBLE) / CAST(met.index_value AS DOUBLE) DESC
    ) AS ranks
FROM
    fresh_segments.interest_metrics AS met
    LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
SELECT
    met.month_year,
    map.interest_name,
    ROUND((met.composition / met.index_value), 2) AS avg_composition,
    DENSE_RANK() OVER (
        PARTITION BY
            met.month_year
        ORDER BY
            CAST(met.composition AS DOUBLE) / CAST(met.index_value AS DOUBLE) DESC
    ) AS ranks
FROM
    fresh_segments.interest_metrics AS met
    LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id;
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[156]:

	month_year	interest_name	avg_composition	ranks
0	2018-12-01	Work Comes First Travelers	8.31	1
1	2018-12-01	Nursing and Physicians Assistant Journal Resea...	6.96	2
2	2018-12-01	Alabama Trip Planners	6.68	3
3	2018-12-01	Luxury Bedding Shoppers	6.63	4
4	2018-12-01	Readers of Honduran Content	6.58	5
...	...	...	...	...
13074	2019-08-01	Small Business Employees	1.14	1072
13075	2019-08-01	Toronto Blue Jays Fans	1.11	1073
13076	2019-08-01	Price Conscious Home Shoppers	1.10	1074
13077	2019-08-01	Pet Store Goers	1.06	1075
13078	2019-08-01	Health & Fitness	0.96	1076

13079 rows × 4 columns

Window Computations - Previous Months Data Formatting¶

CASE 
    WHEN LAG(avg_composition, 1) OVER w IS NULL 
    THEN NULL 
    ELSE CONCAT_WS(': ', LAG(interest_name, 1) OVER w, CAST(LAG(avg_composition, 1) OVER w AS VARCHAR))
END AS one_month_ago,
CASE 
    WHEN LAG(avg_composition, 2) OVER w IS NULL 
    THEN NULL 
    ELSE CONCAT_WS(': ', LAG(interest_name, 2) OVER w, CAST(LAG(avg_composition, 2) OVER w AS VARCHAR))
END AS two_months_ago

This code creates two columns that show historical data in a formatted string:

one_month_ago: Shows data from the previous month
two_months_ago: Shows data from two months ago

For each row, the logic works as follows:

Check if there's missing data (NULL) from the previous period
If there is missing data, output NULL
If data exists, combine the interest name and its composition value into a formatted string like "Interest Name: 7.36"

The LAG function is what looks back in time - LAG(1) looks back one month, LAG(2) looks back two months. This gives us a rolling historical view of which interests were most popular in previous months.

For example, if we're looking at August's data:

one_month_ago shows July's top interest
two_months_ago shows June's top interest

In [160]:

Copied!





query = """ 
WITH ranked_avg_comp_by_month AS (
    SELECT
        met.month_year,
        map.interest_name,
        ROUND((met.composition / met.index_value), 2) AS avg_composition,
        DENSE_RANK() OVER (
            PARTITION BY
                met.month_year
            ORDER BY
                CAST(met.composition AS DOUBLE) / CAST(met.index_value AS DOUBLE) DESC
        ) AS ranks
    FROM
        fresh_segments.interest_metrics AS met
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
)
SELECT
    month_year,
    interest_name,
    avg_composition AS max_avg_composition,
    ROUND(AVG(avg_composition) OVER w_3_months, 2) AS three_month_moving_avg,
    CASE 
        WHEN LAG(avg_composition, 1) OVER w IS NULL 
        THEN NULL 
        ELSE CONCAT_WS(': ', LAG(interest_name, 1) OVER w, CAST(LAG(avg_composition, 1) OVER w AS VARCHAR))
    END AS one_month_ago,
    CASE 
        WHEN LAG(avg_composition, 2) OVER w IS NULL 
        THEN NULL 
        ELSE CONCAT_WS(': ', LAG(interest_name, 2) OVER w, CAST(LAG(avg_composition, 2) OVER w AS VARCHAR))
    END AS two_months_ago
FROM
    ranked_avg_comp_by_month
WHERE
    1 = 1
    AND ranks = 1
WINDOW
    w AS (ORDER BY month_year ASC),
    w_3_months AS (ORDER BY month_year ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW);
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)
query = """ 
WITH ranked_avg_comp_by_month AS (
    SELECT
        met.month_year,
        map.interest_name,
        ROUND((met.composition / met.index_value), 2) AS avg_composition,
        DENSE_RANK() OVER (
            PARTITION BY
                met.month_year
            ORDER BY
                CAST(met.composition AS DOUBLE) / CAST(met.index_value AS DOUBLE) DESC
        ) AS ranks
    FROM
        fresh_segments.interest_metrics AS met
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
)
SELECT
    month_year,
    interest_name,
    avg_composition AS max_avg_composition,
    ROUND(AVG(avg_composition) OVER w_3_months, 2) AS three_month_moving_avg,
    CASE 
        WHEN LAG(avg_composition, 1) OVER w IS NULL 
        THEN NULL 
        ELSE CONCAT_WS(': ', LAG(interest_name, 1) OVER w, CAST(LAG(avg_composition, 1) OVER w AS VARCHAR))
    END AS one_month_ago,
    CASE 
        WHEN LAG(avg_composition, 2) OVER w IS NULL 
        THEN NULL 
        ELSE CONCAT_WS(': ', LAG(interest_name, 2) OVER w, CAST(LAG(avg_composition, 2) OVER w AS VARCHAR))
    END AS two_months_ago
FROM
    ranked_avg_comp_by_month
WHERE
    1 = 1
    AND ranks = 1
WINDOW
    w AS (ORDER BY month_year ASC),
    w_3_months AS (ORDER BY month_year ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW);
"""

athena.query(database=database, query=query, ctas_approach=ctas_approach)

Out[160]:

	month_year	interest_name	max_avg_composition	three_month_moving_avg	one_month_ago	two_months_ago
0	2018-07-01	Las Vegas Trip Planners	7.36	7.36	<NA>	<NA>
1	2018-08-01	Las Vegas Trip Planners	7.21	7.29	Las Vegas Trip Planners: 7.36	<NA>
2	2018-09-01	Work Comes First Travelers	8.26	7.61	Las Vegas Trip Planners: 7.21	Las Vegas Trip Planners: 7.36
3	2018-10-01	Work Comes First Travelers	9.14	8.20	Work Comes First Travelers: 8.26	Las Vegas Trip Planners: 7.21
4	2018-11-01	Work Comes First Travelers	8.28	8.56	Work Comes First Travelers: 9.14	Work Comes First Travelers: 8.26
5	2018-12-01	Work Comes First Travelers	8.31	8.58	Work Comes First Travelers: 8.28	Work Comes First Travelers: 9.14
6	2019-01-01	Work Comes First Travelers	7.66	8.08	Work Comes First Travelers: 8.31	Work Comes First Travelers: 8.28
7	2019-02-01	Work Comes First Travelers	7.66	7.88	Work Comes First Travelers: 7.66	Work Comes First Travelers: 8.31
8	2019-03-01	Alabama Trip Planners	6.54	7.29	Work Comes First Travelers: 7.66	Work Comes First Travelers: 7.66
9	2019-04-01	Solar Energy Researchers	6.28	6.83	Alabama Trip Planners: 6.54	Work Comes First Travelers: 7.66
10	2019-05-01	Readers of Honduran Content	4.41	5.74	Solar Energy Researchers: 6.28	Alabama Trip Planners: 6.54
11	2019-06-01	Las Vegas Trip Planners	2.77	4.49	Readers of Honduran Content: 4.41	Solar Energy Researchers: 6.28
12	2019-07-01	Las Vegas Trip Planners	2.82	3.33	Las Vegas Trip Planners: 2.77	Readers of Honduran Content: 4.41
13	2019-08-01	Cosmetics and Beauty Shoppers	2.73	2.77	Las Vegas Trip Planners: 2.82	Las Vegas Trip Planners: 2.77

Final Results¶

In [162]:

Copied!





query = """ 
WITH ranked_avg_comp_by_month AS (
    SELECT
        met.month_year,
        map.interest_name,
        ROUND((met.composition / met.index_value), 2) AS avg_composition,
        DENSE_RANK() OVER (
            PARTITION BY
                met.month_year
            ORDER BY
                CAST(met.composition AS DOUBLE) / CAST(met.index_value AS DOUBLE) DESC
        ) AS ranks
    FROM
        fresh_segments.interest_metrics AS met
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
),

window_computations AS (
    SELECT
        month_year,
        interest_name,
        avg_composition AS max_avg_composition,
        ROUND(AVG(avg_composition) OVER w_3_months, 2) AS three_month_moving_avg,
        CASE 
            WHEN LAG(avg_composition, 1) OVER w IS NULL 
            THEN NULL 
            ELSE CONCAT_WS(': ', LAG(interest_name, 1) OVER w, CAST(LAG(avg_composition, 1) OVER w AS VARCHAR))
        END AS one_month_ago,
        CASE 
            WHEN LAG(avg_composition, 2) OVER w IS NULL 
            THEN NULL 
            ELSE CONCAT_WS(': ', LAG(interest_name, 2) OVER w, CAST(LAG(avg_composition, 2) OVER w AS VARCHAR))
        END AS two_months_ago
    FROM
        ranked_avg_comp_by_month
    WHERE
        1 = 1
        AND ranks = 1
    WINDOW
        w AS (ORDER BY month_year ASC),
        w_3_months AS (ORDER BY month_year ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
    )
SELECT
    *
FROM
    window_computations
WHERE
    1 = 1
    AND (one_month_ago IS NOT NULL AND two_months_ago IS NOT NULL)
ORDER BY
    month_year ASC;
"""

max_avg_comp_by_month_year = athena.query(
    database=database, query=query, ctas_approach=ctas_approach
)
max_avg_comp_by_month_year
query = """ 
WITH ranked_avg_comp_by_month AS (
    SELECT
        met.month_year,
        map.interest_name,
        ROUND((met.composition / met.index_value), 2) AS avg_composition,
        DENSE_RANK() OVER (
            PARTITION BY
                met.month_year
            ORDER BY
                CAST(met.composition AS DOUBLE) / CAST(met.index_value AS DOUBLE) DESC
        ) AS ranks
    FROM
        fresh_segments.interest_metrics AS met
        LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
),

window_computations AS (
    SELECT
        month_year,
        interest_name,
        avg_composition AS max_avg_composition,
        ROUND(AVG(avg_composition) OVER w_3_months, 2) AS three_month_moving_avg,
        CASE 
            WHEN LAG(avg_composition, 1) OVER w IS NULL 
            THEN NULL 
            ELSE CONCAT_WS(': ', LAG(interest_name, 1) OVER w, CAST(LAG(avg_composition, 1) OVER w AS VARCHAR))
        END AS one_month_ago,
        CASE 
            WHEN LAG(avg_composition, 2) OVER w IS NULL 
            THEN NULL 
            ELSE CONCAT_WS(': ', LAG(interest_name, 2) OVER w, CAST(LAG(avg_composition, 2) OVER w AS VARCHAR))
        END AS two_months_ago
    FROM
        ranked_avg_comp_by_month
    WHERE
        1 = 1
        AND ranks = 1
    WINDOW
        w AS (ORDER BY month_year ASC),
        w_3_months AS (ORDER BY month_year ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
    )
SELECT
    *
FROM
    window_computations
WHERE
    1 = 1
    AND (one_month_ago IS NOT NULL AND two_months_ago IS NOT NULL)
ORDER BY
    month_year ASC;
"""

max_avg_comp_by_month_year = athena.query(
    database=database, query=query, ctas_approach=ctas_approach
)
max_avg_comp_by_month_year

Out[162]:

	month_year	interest_name	max_avg_composition	three_month_moving_avg	one_month_ago	two_months_ago
0	2018-09-01	Work Comes First Travelers	8.26	7.61	Las Vegas Trip Planners: 7.21	Las Vegas Trip Planners: 7.36
1	2018-10-01	Work Comes First Travelers	9.14	8.20	Work Comes First Travelers: 8.26	Las Vegas Trip Planners: 7.21
2	2018-11-01	Work Comes First Travelers	8.28	8.56	Work Comes First Travelers: 9.14	Work Comes First Travelers: 8.26
3	2018-12-01	Work Comes First Travelers	8.31	8.58	Work Comes First Travelers: 8.28	Work Comes First Travelers: 9.14
4	2019-01-01	Work Comes First Travelers	7.66	8.08	Work Comes First Travelers: 8.31	Work Comes First Travelers: 8.28
5	2019-02-01	Work Comes First Travelers	7.66	7.88	Work Comes First Travelers: 7.66	Work Comes First Travelers: 8.31
6	2019-03-01	Alabama Trip Planners	6.54	7.29	Work Comes First Travelers: 7.66	Work Comes First Travelers: 7.66
7	2019-04-01	Solar Energy Researchers	6.28	6.83	Alabama Trip Planners: 6.54	Work Comes First Travelers: 7.66
8	2019-05-01	Readers of Honduran Content	4.41	5.74	Solar Energy Researchers: 6.28	Alabama Trip Planners: 6.54
9	2019-06-01	Las Vegas Trip Planners	2.77	4.49	Readers of Honduran Content: 4.41	Solar Energy Researchers: 6.28
10	2019-07-01	Las Vegas Trip Planners	2.82	3.33	Las Vegas Trip Planners: 2.77	Readers of Honduran Content: 4.41
11	2019-08-01	Cosmetics and Beauty Shoppers	2.73	2.77	Las Vegas Trip Planners: 2.82	Las Vegas Trip Planners: 2.77

Q5¶

Provide a possible reason why the max average composition might change from month to month? Could it signal something is not quite right with the overall business model for Fresh Segments?

Seasonality:
- The composition metric might reflect customer behaviors that vary with seasons. For instance, interest in "Vacation Rental Accommodation Researchers" could spike during holiday seasons or summer months.
Marketing Campaigns:
- Targeted advertising or promotional campaigns could drive temporary interest in specific segments, increasing their composition metrics.
Shifts in Business Focus:
- Fresh Segments might adjust their offerings or emphasis on particular interests, impacting customer interactions with specific segments.
Changes in Client Lists:
- Fluctuations in the client base (e.g., onboarding new clients or losing existing ones) could affect the overall average composition.
Measurement Issues:
- A sudden, unexplained change might indicate a data quality or calculation issue, especially if it doesn't align with known business activities.

Potential Red Flags for Business Model¶

If the changes in the maximum average composition cannot be explained by predictable factors like seasonality or campaigns, it might indicate:

Market Misalignment: The target audience for Fresh Segments' offerings may not consistently align with their clients' needs.
Client Retention Issues: Variability might reflect challenges in maintaining a steady client base.
Segment Relevance: Some interest segments may not be as relevant or valuable to the clients as anticipated.

	total_months	interest_counts
0	14	480
1	13	82
2	12	65
3	11	94
4	10	86
5	9	95
6	8	67
7	7	90
8	6	33
9	5	38
10	4	32
11	3	15
12	2	12
13	1	13

	total_months	interest_counts
0	14	480
1	13	82
2	12	65
3	11	94
4	10	86
5	9	95
6	8	67
7	7	90
8	6	33
9	5	38
10	4	32
11	3	15
12	2	12
13	1	13

	total_months	interest_counts
0	14	480
1	13	82
2	12	65
3	11	94
4	10	86
5	9	95
6	8	67
7	7	90
8	6	33
9	5	38
10	4	32
11	3	15
12	2	12
13	1	13