Fresh Segments
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import os
import sys
from typing import Dict, Optional, Tuple
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns
sys.path.append("../../../")
from src.athena import Athena
from src.utils import create_session
Global¶
boto3_session = create_session(
profile_name="dev",
role_arn=os.getenv("ATHENA_IAM_ROLE_ARN"),
)
wait = True
ctas_approach = False
storage_format = "PARQUET"
write_compression = "SNAPPY"
database = "fresh_segments"
tables = ["interest_map", "interest_metrics"]
athena = Athena(boto3_session=boto3_session, s3_output=os.getenv("ATHENA_S3_OUTPUT"))
athena
Athena(boto3_session=Session(region_name='us-east-1'), s3_output=s3://sql-case-studies/query_results)
Problem Statement¶
Fresh Segments is a digital marketing agency specializing in analyzing trends in online ad click behavior for businesses with targeted customer bases. The agency receives customer lists from clients and aggregates interest metrics to create a comprehensive dataset for further analysis.
The main output includes detailed composition and rankings of various interests, highlighting the proportion of customers who engaged with online assets related to each interest on a monthly basis.
The goal of this case study is to analyze these aggregated metrics for a sample client to provide high-level insights into their customer base and associated interests.
Entity Relationship Diagram¶
Tables¶
for table in tables:
athena.query(
database=database,
query=f"""
SELECT
*
FROM
{database}.{table} TABLESAMPLE BERNOULLI(30);
""",
ctas_approach=ctas_approach,
)
| id | interest_name | interest_summary | created_at | last_modified | |
|---|---|---|---|---|---|
| 0 | 1 | Fitness Enthusiasts | Consumers using fitness tracking apps and webs... | 2016-05-26 14:57:59 | 2018-05-23 11:30:12 |
| 1 | 5 | Brides & Wedding Planners | People researching wedding ideas and vendors. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12 |
| 2 | 6 | Vacation Planners | Consumers reading reviews of vacation destinat... | 2016-05-26 14:57:59 | 2018-05-23 11:30:13 |
| 3 | 8 | Business News Readers | Readers of online business news content. | 2016-05-26 14:57:59 | 2018-05-23 11:30:12 |
| 4 | 12 | Thrift Store Shoppers | Consumers shopping online for clothing at thri... | 2016-05-26 14:57:59 | 2018-03-16 13:14:00 |
| ... | ... | ... | ... | ... | ... |
| 366 | 47850 | Asheville Trip Planners | People researching attractions and accommodati... | 2019-03-15 22:00:02 | 2019-03-21 15:33:09 |
| 367 | 48154 | Elite Cycling Gear Shoppers | Consumers researching and shopping for elite c... | 2019-03-21 22:00:00 | 2019-03-22 15:31:46 |
| 368 | 48465 | HGTV Enthusiasts | People interested in HGTV shows and home remod... | 2019-03-27 22:00:01 | 2019-04-02 16:17:29 |
| 369 | 50860 | Food Delivery Service Users | Users of online food delivery services. | 2019-04-23 18:00:02 | 2019-04-24 18:30:04 |
| 370 | 51119 | Skin Disorder Researchers | People reading news and advice on preventing a... | 2019-04-26 18:00:00 | 2019-04-29 14:20:04 |
371 rows × 5 columns
| record_month | record_year | month_year | interest_id | composition | index_value | ranking | percentile_ranking | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7 | 2018 | 07-2018 | 32486 | 11.89 | 6.19 | 1 | 99.86 |
| 1 | 7 | 2018 | 07-2018 | 18923 | 10.85 | 5.29 | 3 | 99.59 |
| 2 | 7 | 2018 | 07-2018 | 6110 | 11.57 | 4.79 | 11 | 98.49 |
| 3 | 7 | 2018 | 07-2018 | 4 | 13.97 | 4.53 | 14 | 98.08 |
| 4 | 7 | 2018 | 07-2018 | 17 | 7.89 | 4.15 | 19 | 97.39 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4307 | <NA> | <NA> | <NA> | <NA> | 1.54 | 0.77 | 1181 | 1.09 |
| 4308 | <NA> | <NA> | <NA> | <NA> | 1.94 | 0.77 | 1181 | 1.09 |
| 4309 | <NA> | <NA> | <NA> | <NA> | 1.62 | 0.74 | 1187 | 0.59 |
| 4310 | <NA> | <NA> | <NA> | <NA> | 1.62 | 0.68 | 1191 | 0.25 |
| 4311 | <NA> | <NA> | <NA> | <NA> | 1.64 | 0.62 | 1194 | 0.00 |
4312 rows × 8 columns
query = """
WITH grouped_counts AS (
SELECT
record_month,
record_year,
month_year,
interest_id,
composition,
index_value,
ranking,
percentile_ranking,
COUNT(*) AS freq
FROM
fresh_segments.interest_metrics
WHERE
record_month IS NOT NULL
AND record_year IS NOT NULL
AND month_year IS NOT NULL
AND interest_id IS NOT NULL
GROUP BY
record_month,
record_year,
month_year,
interest_id,
composition,
index_value,
ranking,
percentile_ranking
)
SELECT
*
FROM
grouped_counts
WHERE
freq > 1
ORDER BY
freq DESC;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| record_month | record_year | month_year | interest_id | composition | index_value | ranking | percentile_ranking | freq |
|---|
query = """
SELECT
SUM(CASE WHEN record_month IS NULL THEN 1 ELSE 0 END) AS record_month_null,
SUM(CASE WHEN record_year IS NULL THEN 1 ELSE 0 END) AS record_year_null,
SUM(CASE WHEN month_year IS NULL THEN 1 ELSE 0 END) AS month_year_null,
SUM(CASE WHEN interest_id IS NULL THEN 1 ELSE 0 END) AS interest_id_null,
SUM(CASE WHEN composition IS NULL THEN 1 ELSE 0 END) AS composition_null,
SUM(CASE WHEN index_value IS NULL THEN 1 ELSE 0 END) AS index_value_null,
SUM(CASE WHEN ranking IS NULL THEN 1 ELSE 0 END) AS ranking_null,
SUM(CASE WHEN percentile_ranking IS NULL THEN 1 ELSE 0 END) AS percentile_ranking_null
FROM
fresh_segments.interest_metrics;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| record_month_null | record_year_null | month_year_null | interest_id_null | composition_null | index_value_null | ranking_null | percentile_ranking_null | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1194 | 1194 | 1194 | 1193 | 0 | 0 | 0 | 0 |
EDA & Data Preprocessing¶
Q1 & Q3¶
There are two preprocessing steps that can be helpful:
Update the
fresh_segments.interest_metricstable by modifying themonth_yearcolumn to be aDATEdata type with the start of the month as the day.Handling the null values in the
fresh_segments.interest_metricstable.
First, check if there are any records with non-missing interest ID but missing month_year value:
query = """
SELECT
interest_id
FROM
fresh_segments.interest_metrics
WHERE
1 = 1
AND month_year IS NULL
AND interest_id IS NOT NULL;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_id | |
|---|---|
| 0 | 21246 |
Missing Both ID & Date¶
For all records that have both missing month_year and interest_id values in the fresh_segments.interest_metrics table, we unfortunately cannot make any good business interpretations on their interest metrics since we do not know what interest they are associated with.
Missing Date¶
For the record where the interest_id is known (specifically interest_id = 21246) but the month_year value is missing, we want to determine if there are other records in the dataset with the same interest_id that have a month_year value that is not missing. This allows us to potentially infer or cross-reference information about the missing month_year value.
query = """
SELECT
*
FROM
fresh_segments.interest_metrics
WHERE
interest_id = 21246;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| record_month | record_year | month_year | interest_id | composition | index_value | ranking | percentile_ranking | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7 | 2018 | 07-2018 | 21246 | 2.26 | 0.65 | 722 | 0.96 |
| 1 | 8 | 2018 | 08-2018 | 21246 | 2.13 | 0.59 | 765 | 0.26 |
| 2 | 9 | 2018 | 09-2018 | 21246 | 2.06 | 0.61 | 774 | 0.77 |
| 3 | 10 | 2018 | 10-2018 | 21246 | 1.74 | 0.58 | 855 | 0.23 |
| 4 | 11 | 2018 | 11-2018 | 21246 | 2.25 | 0.78 | 908 | 2.16 |
| 5 | 12 | 2018 | 12-2018 | 21246 | 1.97 | 0.70 | 983 | 1.21 |
| 6 | 1 | 2019 | 01-2019 | 21246 | 2.05 | 0.76 | 954 | 1.95 |
| 7 | 2 | 2019 | 02-2019 | 21246 | 1.84 | 0.68 | 1109 | 1.07 |
| 8 | 3 | 2019 | 03-2019 | 21246 | 1.75 | 0.67 | 1123 | 1.14 |
| 9 | 4 | 2019 | 04-2019 | 21246 | 1.58 | 0.63 | 1092 | 0.64 |
| 10 | <NA> | <NA> | <NA> | 21246 | 1.61 | 0.68 | 1191 | 0.25 |
In this case, the options are:
Impute the missing
month_yearvalue with themonth_yearvalue of the lateset record with the sameinterest_idvalueImpute the missing
month_yearvalue with thecreated_atvalue from thefresh_segments.interest_maptableDrop the record
The first two options run the risk of introducing bias into the data, but dropping the record may result in a loss of valuable information. Given that there is only one record with a missing month_year value, we can argue that the risk of losing valuable information is low relative to the risk of introducing bias into the data by imputing date information.
# New ctas table with modified column and dropping all records with missing date information
query = """
SELECT
record_month,
record_year,
CASE WHEN month_year IS NULL THEN NULL ELSE DATE_PARSE(CONCAT_WS('-', '01', month_year), '%d-%m-%Y') END AS month_year,
interest_id,
composition,
index_value,
ranking,
percentile_ranking
FROM
fresh_segments.interest_metrics
WHERE
month_year IS NOT NULL;
"""
athena.create_ctas_table(
database=database,
query=query,
ctas_table="interest_metrics_modified",
s3_output="s3://sql-case-studies/fresh_segments/interest_metrics/",
storage_format=storage_format,
write_compression=write_compression,
wait=wait,
)
# Drop old table
athena.drop_table(database=database, table="interest_metrics")
{'ctas_database': 'fresh_segments',
'ctas_table': 'interest_metrics_modified',
'ctas_query_metadata': _QueryMetadata(execution_id='349c49e1-5136-4e2d-baad-7e7764a0c489', dtype={'rows': 'Int64'}, parse_timestamps=[], parse_dates=[], parse_geometry=[], converters={}, binaries=[], output_location='s3://sql-case-studies/fresh_segments/interest_metrics/tables/349c49e1-5136-4e2d-baad-7e7764a0c489', manifest_location='s3://sql-case-studies/fresh_segments/interest_metrics/tables/349c49e1-5136-4e2d-baad-7e7764a0c489-manifest.csv', raw_payload={'QueryExecutionId': '349c49e1-5136-4e2d-baad-7e7764a0c489', 'Query': 'CREATE TABLE "fresh_segments"."interest_metrics_modified"\nWITH(\n external_location = \'s3://sql-case-studies/fresh_segments/interest_metrics/interest_metrics_modified\',\n write_compression = \'SNAPPY\',\n format = \'PARQUET\')\nAS \nSELECT\n record_month,\n record_year,\n CASE WHEN month_year IS NULL THEN NULL ELSE DATE_PARSE(CONCAT_WS(\'-\', \'01\', month_year), \'%d-%m-%Y\') END AS month_year,\n interest_id,\n composition,\n index_value,\n ranking,\n percentile_ranking\nFROM\n fresh_segments.interest_metrics\nWHERE\n month_year IS NOT NULL', 'StatementType': 'DDL', 'ResultConfiguration': {'OutputLocation': 's3://sql-case-studies/fresh_segments/interest_metrics/tables/349c49e1-5136-4e2d-baad-7e7764a0c489'}, 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}}, 'QueryExecutionContext': {'Database': 'fresh_segments'}, 'Status': {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2025, 1, 20, 15, 8, 51, 96000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2025, 1, 20, 15, 8, 52, 488000, tzinfo=tzlocal())}, 'Statistics': {'EngineExecutionTimeInMillis': 1199, 'DataScannedInBytes': 95524, 'DataManifestLocation': 's3://sql-case-studies/fresh_segments/interest_metrics/tables/349c49e1-5136-4e2d-baad-7e7764a0c489-manifest.csv', 'TotalExecutionTimeInMillis': 1392, 'QueryQueueTimeInMillis': 70, 'ServicePreProcessingTimeInMillis': 96, 'QueryPlanningTimeInMillis': 146, 'ServiceProcessingTimeInMillis': 27, 'ResultReuseInformation': {'ReusedPreviousResult': False}}, 'WorkGroup': 'primary', 'EngineVersion': {'SelectedEngineVersion': 'AUTO', 'EffectiveEngineVersion': 'Athena engine version 3'}, 'SubstatementType': 'CREATE_TABLE_AS_SELECT'})}
Query executed successfully
# New ctas table with original name
query = """
SELECT
*
FROM
fresh_segments.interest_metrics_modified;
"""
athena.create_ctas_table(
database=database,
query=query,
ctas_table="interest_metrics",
s3_output="s3://sql-case-studies/fresh_segments/interest_metrics/",
storage_format=storage_format,
write_compression=write_compression,
wait=wait,
)
# Drop temporary ctas table
athena.drop_table(database=database, table="interest_metrics_modified")
{'ctas_database': 'fresh_segments',
'ctas_table': 'interest_metrics',
'ctas_query_metadata': _QueryMetadata(execution_id='cc381175-9e88-4338-8acc-44814a4465b5', dtype={'rows': 'Int64'}, parse_timestamps=[], parse_dates=[], parse_geometry=[], converters={}, binaries=[], output_location='s3://sql-case-studies/fresh_segments/interest_metrics/tables/cc381175-9e88-4338-8acc-44814a4465b5', manifest_location='s3://sql-case-studies/fresh_segments/interest_metrics/tables/cc381175-9e88-4338-8acc-44814a4465b5-manifest.csv', raw_payload={'QueryExecutionId': 'cc381175-9e88-4338-8acc-44814a4465b5', 'Query': 'CREATE TABLE "fresh_segments"."interest_metrics"\nWITH(\n external_location = \'s3://sql-case-studies/fresh_segments/interest_metrics/interest_metrics\',\n write_compression = \'SNAPPY\',\n format = \'PARQUET\')\nAS \nSELECT\n *\nFROM\n fresh_segments.interest_metrics_modified', 'StatementType': 'DDL', 'ResultConfiguration': {'OutputLocation': 's3://sql-case-studies/fresh_segments/interest_metrics/tables/cc381175-9e88-4338-8acc-44814a4465b5'}, 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}}, 'QueryExecutionContext': {'Database': 'fresh_segments'}, 'Status': {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2025, 1, 20, 15, 8, 54, 384000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2025, 1, 20, 15, 8, 55, 549000, tzinfo=tzlocal())}, 'Statistics': {'EngineExecutionTimeInMillis': 988, 'DataScannedInBytes': 85403, 'DataManifestLocation': 's3://sql-case-studies/fresh_segments/interest_metrics/tables/cc381175-9e88-4338-8acc-44814a4465b5-manifest.csv', 'TotalExecutionTimeInMillis': 1165, 'QueryQueueTimeInMillis': 57, 'ServicePreProcessingTimeInMillis': 105, 'QueryPlanningTimeInMillis': 131, 'ServiceProcessingTimeInMillis': 15, 'ResultReuseInformation': {'ReusedPreviousResult': False}}, 'WorkGroup': 'primary', 'EngineVersion': {'SelectedEngineVersion': 'AUTO', 'EffectiveEngineVersion': 'Athena engine version 3'}, 'SubstatementType': 'CREATE_TABLE_AS_SELECT'})}
Query executed successfully
query = """
SELECT
*
FROM
fresh_segments.interest_metrics TABLESAMPLE BERNOULLI(30);
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| record_month | record_year | month_year | interest_id | composition | index_value | ranking | percentile_ranking | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7 | 2018 | 2018-07-01 | 6344 | 10.32 | 5.10 | 4 | 99.45 |
| 1 | 7 | 2018 | 2018-07-01 | 69 | 10.82 | 5.03 | 6 | 99.18 |
| 2 | 7 | 2018 | 2018-07-01 | 17 | 7.89 | 4.15 | 19 | 97.39 |
| 3 | 7 | 2018 | 2018-07-01 | 6286 | 14.10 | 3.82 | 25 | 96.57 |
| 4 | 7 | 2018 | 2018-07-01 | 6184 | 13.35 | 3.75 | 29 | 96.02 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3925 | 8 | 2019 | 2019-08-01 | 33957 | 1.74 | 0.93 | 1130 | 1.65 |
| 3926 | 8 | 2019 | 2019-08-01 | 136 | 1.74 | 0.93 | 1130 | 1.65 |
| 3927 | 8 | 2019 | 2019-08-01 | 16198 | 1.82 | 0.90 | 1139 | 0.87 |
| 3928 | 8 | 2019 | 2019-08-01 | 15884 | 1.68 | 0.86 | 1144 | 0.44 |
| 3929 | 8 | 2019 | 2019-08-01 | 6225 | 2.15 | 0.86 | 1144 | 0.44 |
3930 rows × 8 columns
query = """
SELECT
COUNT(*) AS record_count
FROM
fresh_segments.interest_metrics;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| record_count | |
|---|---|
| 0 | 13079 |
Q2¶
What is count of records in the fresh_segments.interest_metrics table for each month_year value sorted in chronological order (earliest to latest) with the null values appearing first?
query = """
SELECT
month_year,
COUNT(DISTINCT interest_id) AS unique_interest_count
FROM
fresh_segments.interest_metrics
GROUP BY
month_year
ORDER BY
month_year ASC NULLS FIRST;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| month_year | unique_interest_count | |
|---|---|---|
| 0 | 2018-07-01 | 729 |
| 1 | 2018-08-01 | 767 |
| 2 | 2018-09-01 | 780 |
| 3 | 2018-10-01 | 857 |
| 4 | 2018-11-01 | 928 |
| 5 | 2018-12-01 | 995 |
| 6 | 2019-01-01 | 973 |
| 7 | 2019-02-01 | 1121 |
| 8 | 2019-03-01 | 1136 |
| 9 | 2019-04-01 | 1099 |
| 10 | 2019-05-01 | 857 |
| 11 | 2019-06-01 | 824 |
| 12 | 2019-07-01 | 864 |
| 13 | 2019-08-01 | 1149 |
Q4¶
How many interest_id values exist in the fresh_segments.interest_metrics table but not in the fresh_segments.interest_map table? What about the other way around?
Let A be the set of interest_id values in the fresh_segments.interest_metrics table and B be the set of interest_id values in the fresh_segments.interest_map table.
A \ B & B \ A¶
We can use a full outer join, which returns all rows from both tables regardless of whether there is a match or not:
query = """
SELECT
COUNT(DISTINCT met.interest_id) AS metrics_id_unique_count,
COUNT(DISTINCT map.id) AS map_id_unique_count,
SUM(CASE WHEN met.interest_id IS NULL AND map.id IS NOT NULL THEN 1 ELSE 0 END) AS in_map_not_in_metrics_count,
SUM(CASE WHEN met.interest_id IS NOT NULL AND map.id IS NULL THEN 1 ELSE 0 END) AS in_metrics_not_in_map_count
FROM
fresh_segments.interest_metrics AS met
FULL JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| metrics_id_unique_count | map_id_unique_count | in_map_not_in_metrics_count | in_metrics_not_in_map_count | |
|---|---|---|---|---|
| 0 | 1202 | 1209 | 7 | 0 |
All IDs in interest_metrics are also in interest_map: This means that there are no interest IDs in interest_metrics that are missing from interest_map. In set notation:
- interest_metrics ∖ interest_map = ∅ (empty set)
Some IDs in interest_map are not in interest_metrics: There are 7 interest IDs that are found in interest_map but do not appear in interest_metrics. In set notation:
- interest_map ∖ interest_metrics = {7 IDs}`
Q5¶
Summarise the id values in the fresh_segments.interest_map by its total record count in this table.
query = """
SELECT
COUNT(id) AS raw_count
FROM
fresh_segments.interest_map;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| raw_count | |
|---|---|
| 0 | 1209 |
Q6¶
What sort of table join would be most appropriate to combine the fresh_segments.interest_metrics and fresh_segments.interest_map tables?
A left join of the interest_map table onto interest_metrics is most appropriate because the asymmetric set difference shows that all interest_id values in interest_metrics are present in interest_map.
This ensures all rows from interest_metrics are retained, while relevant data from interest_map is included without unnecessary rows from interest_map that have no match in interest_metrics.
query = """
SELECT
met.interest_id,
met.record_month,
met.record_year,
met.month_year,
met.composition,
met.index_value,
met.ranking,
met.percentile_ranking,
map.interest_name,
map.interest_summary,
map.created_at,
map.last_modified
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
met.interest_id = 21246;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_id | record_month | record_year | month_year | composition | index_value | ranking | percentile_ranking | interest_name | interest_summary | created_at | last_modified | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 21246 | 7 | 2018 | 2018-07-01 | 2.26 | 0.65 | 722 | 0.96 | Readers of El Salvadoran Content | People reading news from El Salvadoran media s... | 2018-06-11 17:50:04 | 2018-06-11 17:50:04 |
| 1 | 21246 | 8 | 2018 | 2018-08-01 | 2.13 | 0.59 | 765 | 0.26 | Readers of El Salvadoran Content | People reading news from El Salvadoran media s... | 2018-06-11 17:50:04 | 2018-06-11 17:50:04 |
| 2 | 21246 | 9 | 2018 | 2018-09-01 | 2.06 | 0.61 | 774 | 0.77 | Readers of El Salvadoran Content | People reading news from El Salvadoran media s... | 2018-06-11 17:50:04 | 2018-06-11 17:50:04 |
| 3 | 21246 | 10 | 2018 | 2018-10-01 | 1.74 | 0.58 | 855 | 0.23 | Readers of El Salvadoran Content | People reading news from El Salvadoran media s... | 2018-06-11 17:50:04 | 2018-06-11 17:50:04 |
| 4 | 21246 | 11 | 2018 | 2018-11-01 | 2.25 | 0.78 | 908 | 2.16 | Readers of El Salvadoran Content | People reading news from El Salvadoran media s... | 2018-06-11 17:50:04 | 2018-06-11 17:50:04 |
| 5 | 21246 | 12 | 2018 | 2018-12-01 | 1.97 | 0.70 | 983 | 1.21 | Readers of El Salvadoran Content | People reading news from El Salvadoran media s... | 2018-06-11 17:50:04 | 2018-06-11 17:50:04 |
| 6 | 21246 | 1 | 2019 | 2019-01-01 | 2.05 | 0.76 | 954 | 1.95 | Readers of El Salvadoran Content | People reading news from El Salvadoran media s... | 2018-06-11 17:50:04 | 2018-06-11 17:50:04 |
| 7 | 21246 | 2 | 2019 | 2019-02-01 | 1.84 | 0.68 | 1109 | 1.07 | Readers of El Salvadoran Content | People reading news from El Salvadoran media s... | 2018-06-11 17:50:04 | 2018-06-11 17:50:04 |
| 8 | 21246 | 3 | 2019 | 2019-03-01 | 1.75 | 0.67 | 1123 | 1.14 | Readers of El Salvadoran Content | People reading news from El Salvadoran media s... | 2018-06-11 17:50:04 | 2018-06-11 17:50:04 |
| 9 | 21246 | 4 | 2019 | 2019-04-01 | 1.58 | 0.63 | 1092 | 0.64 | Readers of El Salvadoran Content | People reading news from El Salvadoran media s... | 2018-06-11 17:50:04 | 2018-06-11 17:50:04 |
Q7¶
Are there any records in the joined table where the month_year value is before the created_at value from the fresh_segments.interest_map table? Are these values valid and why?
query = """
SELECT
met.interest_id,
met.record_month,
met.record_year,
met.month_year,
met.composition,
met.index_value,
met.ranking,
met.percentile_ranking,
map.interest_name,
map.interest_summary,
map.created_at,
map.last_modified
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
met.month_year < map.created_at;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_id | record_month | record_year | month_year | composition | index_value | ranking | percentile_ranking | interest_name | interest_summary | created_at | last_modified | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32704 | 7 | 2018 | 2018-07-01 | 8.04 | 2.27 | 225 | 69.14 | Major Airline Customers | People visiting sites for major airline brands... | 2018-07-06 14:35:04 | 2018-07-06 14:35:04 |
| 1 | 33191 | 7 | 2018 | 2018-07-01 | 3.99 | 2.11 | 283 | 61.18 | Online Shoppers | People who spend money online | 2018-07-17 10:40:03 | 2018-07-17 10:46:58 |
| 2 | 32703 | 7 | 2018 | 2018-07-01 | 5.53 | 1.80 | 375 | 48.56 | School Supply Shoppers | Consumers shopping for classroom supplies for ... | 2018-07-06 14:35:04 | 2018-07-06 14:35:04 |
| 3 | 32701 | 7 | 2018 | 2018-07-01 | 4.23 | 1.41 | 483 | 33.74 | Womens Equality Advocates | People visiting sites advocating for womens eq... | 2018-07-06 14:35:03 | 2018-07-06 14:35:03 |
| 4 | 32705 | 7 | 2018 | 2018-07-01 | 4.38 | 1.34 | 505 | 30.73 | Certified Events Professionals | Professionals reading industry news and resear... | 2018-07-06 14:35:04 | 2018-07-06 14:35:04 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 183 | 49972 | 4 | 2019 | 2019-04-01 | 2.11 | 1.15 | 722 | 34.30 | Horseback Riding Enthusiasts | People reading horseback riding news and resou... | 2019-04-15 18:00:00 | 2019-04-24 17:40:04 |
| 184 | 49502 | 4 | 2019 | 2019-04-01 | 1.87 | 1.12 | 768 | 30.12 | Veterinarians | Veterinarians | 2019-04-08 18:00:05 | 2019-07-09 13:57:13 |
| 185 | 49974 | 4 | 2019 | 2019-04-01 | 2.00 | 1.10 | 799 | 27.30 | Agricultural and Food Issues Researchers | People researching organizations for food and ... | 2019-04-15 18:00:00 | 2019-04-24 17:40:04 |
| 186 | 49973 | 4 | 2019 | 2019-04-01 | 1.66 | 1.03 | 910 | 17.20 | Farm Finance Researchers | People researching financial institutions spec... | 2019-04-15 18:00:00 | 2019-04-24 17:40:04 |
| 187 | 51678 | 5 | 2019 | 2019-05-01 | 1.71 | 1.53 | 475 | 44.57 | Plumbers | Professionals reading industry news and resear... | 2019-05-06 22:00:00 | 2019-05-07 18:50:04 |
188 rows × 12 columns
Some month_year values are before the created_at values because the former is less granular than the latter. The month_year value is the month and year of the interest metrics, while the created_at value is the timestamp of when the interest was created. We also preprocessed the month_year to be the first day of the month, so it is expected that some month_year values will be before the created_at values.
We can verify that there are no records where the difference between the month_year and created_at values is greater than 1 month, which would suggest that there are data quality issues.
query = """
SELECT
met.interest_id,
met.record_month,
met.record_year,
met.month_year,
met.composition,
met.index_value,
met.ranking,
met.percentile_ranking,
map.interest_name,
map.interest_summary,
map.created_at,
map.last_modified
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
met.month_year < map.created_at
AND DATE_DIFF('DAY', met.month_year, map.created_at) > 30;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_id | record_month | record_year | month_year | composition | index_value | ranking | percentile_ranking | interest_name | interest_summary | created_at | last_modified |
|---|
Another way to verfiy is to use DATE_TRUNC function to truncate the created_at values to the month level (i.e., first day of each created_at month) and compare the month_year values with the truncated created_at values.
query = """
SELECT
met.interest_id,
met.record_month,
met.record_year,
met.month_year,
met.composition,
met.index_value,
met.ranking,
met.percentile_ranking,
map.interest_name,
map.interest_summary,
map.created_at,
map.last_modified
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
WHERE
met.month_year < DATE_TRUNC('MONTH', map.created_at);
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_id | record_month | record_year | month_year | composition | index_value | ranking | percentile_ranking | interest_name | interest_summary | created_at | last_modified |
|---|
query = """
SELECT
map.interest_name,
COUNT(DISTINCT met.month_year) AS total_monthS
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
GROUP BY
interest_name
HAVING
COUNT(DISTINCT met.month_year) = (
SELECT COUNT(DISTINCT month_year)
FROM fresh_segments.interest_metrics
)
ORDER BY
interest_name ASC;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_name | total_monthS | |
|---|---|---|
| 0 | Accounting & CPA Continuing Education Researchers | 14 |
| 1 | Affordable Hotel Bookers | 14 |
| 2 | Aftermarket Accessories Shoppers | 14 |
| 3 | Alabama Trip Planners | 14 |
| 4 | Alaskan Cruise Planners | 14 |
| ... | ... | ... |
| 475 | World Cup Enthusiasts | 14 |
| 476 | Yachting Enthusiasts | 14 |
| 477 | Yale University Fans | 14 |
| 478 | Yogis | 14 |
| 479 | Zoo Visitors | 14 |
480 rows × 2 columns
query = """
WITH count_month_year AS (
SELECT
interest_id,
COUNT(DISTINCT month_year) AS total_months
FROM
fresh_segments.interest_metrics
GROUP BY
interest_id
)
SELECT
total_months,
COUNT(DISTINCT interest_id) AS interest_counts
FROM
count_month_year
GROUP BY
total_months
ORDER BY
total_months DESC;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| total_months | interest_counts | |
|---|---|---|
| 0 | 14 | 480 |
| 1 | 13 | 82 |
| 2 | 12 | 65 |
| 3 | 11 | 94 |
| 4 | 10 | 86 |
| 5 | 9 | 95 |
| 6 | 8 | 67 |
| 7 | 7 | 90 |
| 8 | 6 | 33 |
| 9 | 5 | 38 |
| 10 | 4 | 32 |
| 11 | 3 | 15 |
| 12 | 2 | 12 |
| 13 | 1 | 13 |
Q2¶
Using this same total_months measure, calculate the cumulative percentage of all records starting at 14 months - which total_months value passes the $90\%$ cumulative percentage value?
query = """
WITH count_month_year AS (
SELECT
interest_id,
COUNT(DISTINCT month_year) AS total_months
FROM
fresh_segments.interest_metrics
GROUP BY
interest_id
),
count_interest_by_total_months AS (
SELECT
total_months,
COUNT(DISTINCT interest_id) AS interest_counts
FROM
count_month_year
GROUP BY
total_months
)
SELECT
total_months,
interest_counts,
ROUND(
100.0 * SUM(interest_counts) OVER (
ORDER BY total_months DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) / SUM(interest_counts) OVER(),
2
) AS cumulative_percentage
FROM
count_interest_by_total_months
ORDER BY
total_months DESC;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| total_months | interest_counts | cumulative_percentage | |
|---|---|---|---|
| 0 | 14 | 480 | 39.93 |
| 1 | 13 | 82 | 46.76 |
| 2 | 12 | 65 | 52.16 |
| 3 | 11 | 94 | 59.98 |
| 4 | 10 | 86 | 67.14 |
| 5 | 9 | 95 | 75.04 |
| 6 | 8 | 67 | 80.62 |
| 7 | 7 | 90 | 88.10 |
| 8 | 6 | 33 | 90.85 |
| 9 | 5 | 38 | 94.01 |
| 10 | 4 | 32 | 96.67 |
| 11 | 3 | 15 | 97.92 |
| 12 | 2 | 12 | 98.92 |
| 13 | 1 | 13 | 100.00 |
There are 90 distinct interests with data available for at least 6 of the total month_year values in the dataset. This corresponds to the point where the cumulative percentage surpasses $90\%$.
This indicates that the majority (i.e., $90\%$) of the interests in the dataset have data coverage across fewer than 6 months, emphasizing a significant drop-off in the number of interests with longer-term data availability since the max total_months value is 14.
Q3¶
If we were to remove all interest_id values which are lower than the total_months value we found in the previous question - how many total data points would we be removing?
query = """
WITH interest_ids_to_remove AS (
SELECT
interest_id
FROM
fresh_segments.interest_metrics
GROUP BY
interest_id
HAVING
COUNT(DISTINCT month_year) < 6
)
SELECT
COUNT(met.interest_id) AS remove_row_count
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN interest_ids_to_remove AS remove ON met.interest_id = remove.interest_id
WHERE
1 = 1
AND remove.interest_id IS NOT NULL;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| remove_row_count | |
|---|---|
| 0 | 400 |
Q4¶
Does it make sense to remove these data points from a business perspective? Consider an example where an interest has data for all 14 months compared to one with significantly fewer months, and evaluate what having fewer months of data implies from a segmentation perspective. Additionally, if we include all interests regardless of the number of months they appear, how many unique interests are represented in each month?
query = """
SELECT
month_year,
COUNT(DISTINCT interest_id) AS num_unique_interest
FROM
fresh_segments.interest_metrics
GROUP BY
month_year
ORDER BY
num_unique_interest DESC;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| month_year | num_unique_interest | |
|---|---|---|
| 0 | 2019-08-01 | 1149 |
| 1 | 2019-03-01 | 1136 |
| 2 | 2019-02-01 | 1121 |
| 3 | 2019-04-01 | 1099 |
| 4 | 2018-12-01 | 995 |
| 5 | 2019-01-01 | 973 |
| 6 | 2018-11-01 | 928 |
| 7 | 2019-07-01 | 864 |
| 8 | 2018-10-01 | 857 |
| 9 | 2019-05-01 | 857 |
| 10 | 2019-06-01 | 824 |
| 11 | 2018-09-01 | 780 |
| 12 | 2018-08-01 | 767 |
| 13 | 2018-07-01 | 729 |
Metrics computed for interests with limited availability during the sampling period may lead to less reliable or, even worse, biased results. Furthermore, conducting statistical hypothesis testing on smaller sample sizes may reduce the statistical power, making it more challenging to detect true effects.
query = """
WITH compositions AS (
SELECT
map.interest_name,
met.month_year,
met.composition,
DENSE_RANK() OVER (
PARTITION BY map.interest_name
ORDER BY met.composition DESC
) AS comp_ranks
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN
fresh_segments.interest_map AS map
ON
met.interest_id = map.id
),
top_10 AS (
SELECT
interest_name,
month_year,
composition,
'top_10' AS rank
FROM
compositions
WHERE
comp_ranks = 1
ORDER BY
composition DESC
LIMIT
10
),
bottom_10 AS (
SELECT
interest_name,
month_year,
composition,
'bottom_10' AS rank
FROM
compositions
WHERE
comp_ranks = 1
ORDER BY
composition ASC
LIMIT
10
),
row_bind AS (
SELECT * FROM top_10
UNION
SELECT * FROM bottom_10
)
SELECT
*
FROM
row_bind
ORDER BY
rank DESC,
composition DESC;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_name | month_year | composition | rank | |
|---|---|---|---|---|
| 0 | Work Comes First Travelers | 2018-12-01 | 21.20 | top_10 |
| 1 | Gym Equipment Owners | 2018-07-01 | 18.82 | top_10 |
| 2 | Furniture Shoppers | 2018-07-01 | 17.44 | top_10 |
| 3 | Luxury Retail Shoppers | 2018-07-01 | 17.19 | top_10 |
| 4 | Luxury Boutique Hotel Researchers | 2018-10-01 | 15.15 | top_10 |
| 5 | Luxury Bedding Shoppers | 2018-12-01 | 15.05 | top_10 |
| 6 | Shoe Shoppers | 2018-07-01 | 14.91 | top_10 |
| 7 | Cosmetics and Beauty Shoppers | 2018-07-01 | 14.23 | top_10 |
| 8 | Luxury Hotel Guests | 2018-07-01 | 14.10 | top_10 |
| 9 | Luxury Retail Researchers | 2018-07-01 | 13.97 | top_10 |
| 10 | Readers of Jamaican Content | 2018-07-01 | 1.86 | bottom_10 |
| 11 | Automotive News Readers | 2019-02-01 | 1.84 | bottom_10 |
| 12 | Comedy Fans | 2018-07-01 | 1.83 | bottom_10 |
| 13 | World of Warcraft Enthusiasts | 2019-08-01 | 1.82 | bottom_10 |
| 14 | Miami Heat Fans | 2018-08-01 | 1.81 | bottom_10 |
| 15 | Online Role Playing Game Enthusiasts | 2018-07-01 | 1.73 | bottom_10 |
| 16 | Hearthstone Video Game Fans | 2019-08-01 | 1.66 | bottom_10 |
| 17 | Scifi Movie and TV Enthusiasts | 2018-09-01 | 1.61 | bottom_10 |
| 18 | Action Movie and TV Enthusiasts | 2018-09-01 | 1.59 | bottom_10 |
| 19 | The Sims Video Game Fans | 2019-03-01 | 1.57 | bottom_10 |
Q2¶
Which 5 interests had the lowest average ranking value?
query = """
SELECT
map.interest_name,
AVG(met.ranking) AS avg_ranking,
COUNT(*) AS n
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
GROUP BY
map.interest_name
ORDER BY
avg_ranking DESC
LIMIT
5;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_name | avg_ranking | n | |
|---|---|---|---|
| 0 | Hearthstone Video Game Fans | 1141.0 | 1 |
| 1 | The Sims Video Game Fans | 1135.0 | 1 |
| 2 | Grand Theft Auto Video Game Fans | 1110.0 | 2 |
| 3 | Hair Color Shoppers | 1110.0 | 4 |
| 4 | Bigfoot Folklore Enthusiasts | 1078.0 | 2 |
Q3¶
Which 5 interests had the largest standard deviation in their percentile_ranking value?
query = """
SELECT
met.interest_id,
map.interest_name,
STDDEV(met.percentile_ranking) AS pct_ranking_std,
MIN(met.percentile_ranking) AS pct_ranking_min,
APPROX_PERCENTILE(met.percentile_ranking, 0.25) AS pct_ranking_25,
APPROX_PERCENTILE(met.percentile_ranking, 0.50) AS pct_ranking_50,
APPROX_PERCENTILE(met.percentile_ranking, 0.75) AS pct_ranking_75,
MAX(met.percentile_ranking) AS pct_ranking_max,
COUNT(*) AS n
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
GROUP BY
met.interest_id,
map.interest_name
ORDER BY
pct_ranking_std DESC
LIMIT
5;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_id | interest_name | pct_ranking_std | pct_ranking_min | pct_ranking_25 | pct_ranking_50 | pct_ranking_75 | pct_ranking_max | n | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6260 | Blockbuster Movie Fans | 41.273823 | 2.26 | 2.26 | 60.63 | 60.63 | 60.63 | 2 |
| 1 | 131 | Android Fans | 30.720768 | 4.84 | 4.96 | 5.62 | 10.82 | 75.03 | 5 |
| 2 | 150 | TV Junkies | 30.363975 | 10.01 | 37.29 | 37.79 | 49.82 | 93.28 | 5 |
| 3 | 23 | Techies | 30.175047 | 7.92 | 9.46 | 23.85 | 30.90 | 86.69 | 6 |
| 4 | 20764 | Entertainment Industry Decision Makers | 28.974920 | 11.23 | 11.53 | 18.67 | 22.12 | 86.15 | 6 |
Q4¶
For the 5 interests found in the previous question - what was minimum and maximum percentile_ranking values for each interest and its corresponding year_month value? Can you describe what is happening for these 5 interests?
query = """
WITH ranked_data AS (
SELECT
interest_id,
month_year,
composition,
ranking,
percentile_ranking,
DENSE_RANK() OVER (PARTITION BY interest_id ORDER BY percentile_ranking DESC) AS percentile_ranking_max,
DENSE_RANK() OVER (PARTITION BY interest_id ORDER BY percentile_ranking ASC) AS percentile_ranking_min
FROM
fresh_segments.interest_metrics
WHERE
1 = 1
AND interest_id IN (
SELECT
interest_id
FROM
fresh_segments.interest_metrics
GROUP BY
interest_id
ORDER BY
STDDEV(percentile_ranking) DESC
LIMIT
5
)
)
SELECT
interest_id,
month_year,
composition,
ranking,
percentile_ranking
FROM
ranked_data
WHERE
1 = 1
AND (
percentile_ranking_max = 1
OR percentile_ranking_min = 1
);
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_id | month_year | composition | ranking | percentile_ranking | |
|---|---|---|---|---|---|
| 0 | 6260 | 2018-07-01 | 5.27 | 287 | 60.63 |
| 1 | 6260 | 2019-08-01 | 1.83 | 1123 | 2.26 |
| 2 | 131 | 2018-07-01 | 5.09 | 182 | 75.03 |
| 3 | 131 | 2019-03-01 | 1.72 | 1081 | 4.84 |
| 4 | 150 | 2018-07-01 | 5.30 | 49 | 93.28 |
| 5 | 150 | 2019-08-01 | 1.94 | 1034 | 10.01 |
| 6 | 23 | 2018-07-01 | 5.41 | 97 | 86.69 |
| 7 | 23 | 2019-08-01 | 1.90 | 1058 | 7.92 |
| 8 | 20764 | 2018-07-01 | 5.85 | 101 | 86.15 |
| 9 | 20764 | 2019-08-01 | 1.91 | 1020 | 11.23 |
One clear pattern emerges: the maximum percentile ranking values for these five interests all occur in July 2018, while the minimum percentile values, with one exception, are observed in August 2019.
query = """
WITH compositions AS (
SELECT
map.interest_name,
composition,
DENSE_RANK() OVER (
PARTITION BY map.interest_name
ORDER BY met.composition DESC
) AS comp_ranks
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN
fresh_segments.interest_map AS map
ON
met.interest_id = map.id
),
top_10 AS (
SELECT
interest_name,
'top_10' AS rank
FROM
compositions
WHERE
comp_ranks = 1
ORDER BY
composition DESC
LIMIT
10
),
bottom_10 AS (
SELECT
interest_name,
'bottom_10' AS rank
FROM
compositions
WHERE
comp_ranks = 1
ORDER BY
composition ASC
LIMIT
10
),
row_bind AS (
SELECT * FROM top_10
UNION
SELECT * FROM bottom_10
)
SELECT
map.interest_name,
met.month_year,
met.composition,
met.ranking,
rank
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
LEFT JOIN row_bind AS rb ON map.interest_name = rb.interest_name
WHERE
1 = 1
AND rb.interest_name IS NOT NULL
ORDER BY
rank DESC,
interest_name ASC;
"""
top_bottom_10 = athena.query(
database=database, query=query, ctas_approach=ctas_approach
)
top_bottom_10
| interest_name | month_year | composition | ranking | rank | |
|---|---|---|---|---|---|
| 0 | Cosmetics and Beauty Shoppers | 2019-04-01 | 5.10 | 695 | top_10 |
| 1 | Cosmetics and Beauty Shoppers | 2019-02-01 | 6.50 | 584 | top_10 |
| 2 | Cosmetics and Beauty Shoppers | 2019-03-01 | 5.93 | 619 | top_10 |
| 3 | Cosmetics and Beauty Shoppers | 2018-08-01 | 7.98 | 389 | top_10 |
| 4 | Cosmetics and Beauty Shoppers | 2018-12-01 | 6.73 | 487 | top_10 |
| ... | ... | ... | ... | ... | ... |
| 147 | World of Warcraft Enthusiasts | 2018-11-01 | 1.63 | 902 | bottom_10 |
| 148 | World of Warcraft Enthusiasts | 2019-08-01 | 1.82 | 1139 | bottom_10 |
| 149 | World of Warcraft Enthusiasts | 2019-03-01 | 1.52 | 1130 | bottom_10 |
| 150 | World of Warcraft Enthusiasts | 2019-04-01 | 1.58 | 1076 | bottom_10 |
| 151 | World of Warcraft Enthusiasts | 2018-07-01 | 1.74 | 712 | bottom_10 |
152 rows × 5 columns
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
top_10_data = top_bottom_10.loc[top_bottom_10["rank"] == "top_10"].sort_values(
by="month_year"
)
bottom_10_data = top_bottom_10.loc[top_bottom_10["rank"] == "bottom_10"].sort_values(
by="month_year"
)
# Top 10 Interests
for name, group in top_10_data.groupby("interest_name"):
axes[0].plot(group["month_year"], group["composition"], label=name, marker="o")
axes[0].set_title("Top 10 Interests Over Time")
axes[0].set_xlabel("Month-Year")
axes[0].set_ylabel("Composition (%)")
axes[0].legend(title="Interest Name", fontsize=8)
axes[0].grid(True)
# Bottom 10 Interests
for name, group in bottom_10_data.groupby("interest_name"):
axes[1].plot(group["month_year"], group["composition"], label=name, marker="o")
axes[1].set_title("Bottom 10 Interests Over Time")
axes[1].set_xlabel("Month-Year")
axes[1].legend(title="Interest Name", fontsize=8)
axes[1].grid(True)
plt.tight_layout()
plt.show();
Risk Analysis for Top 10 Interests¶
To assess the risk associated with focusing on the top 10 interests, we can conduct a risk analysis by evaluating the variability in their composition values over time. One effective way to define risk is to analyze the distribution of composition values for each interest across the observed time period.
By quantifying the fluctuations in composition, we can construct a risk score for each of the top 10 interests. These scores reflect the degree of uncertainty or variability in customer engagement with each interest. A higher risk score indicates greater variability, signaling potential uncertainty in prioritizing that interest.
This approach allows us to interpret the risk scores as indicators of the stability of each interest’s engagement levels over time, providing actionable insights for prioritization and resource allocation.
def define_priors(data: pd.Series) -> Dict[str, float]:
"""
Define priors based on the provided data.
- A Normal prior is used for the mean, centered around the observed mean of the data
- A flexible Normal prior allows for uncertainty, with its standard deviation set to twice the observed standard deviation of the data
- An Inverse Gamma prior is used for the variance. This is a common choice for modeling uncertainty in scale parameters like variance
or standard deviation. The shape (alpha) is set to 2, and the scale (beta) is informed by the data's variance (`data.var()`).
Parameters
----------
data : pd.Series
The data series to calculate priors from.
Returns
-------
Dict[str, float]
A dictionary containing the calculated priors:
- mu_prior_mean: Prior mean for mu.
- mu_prior_std: Prior standard deviation for mu.
- sigma_prior_alpha: Alpha parameter for the InverseGamma prior on sigma.
- sigma_prior_beta: Beta parameter for the InverseGamma prior on sigma.
"""
mu_prior_mean = data.mean()
mu_prior_std = data.std() * 2
sigma_prior_alpha = 2
sigma_prior_beta = data.var()
return {
"mu_prior_mean": mu_prior_mean,
"mu_prior_std": mu_prior_std,
"sigma_prior_alpha": sigma_prior_alpha,
"sigma_prior_beta": sigma_prior_beta,
}
def run_bayesian_model(
data: pd.Series,
priors: Dict[str, float],
samples: Optional[int] = 2000,
tune: Optional[int] = 1000,
hdi_prob: Optional[float] = 0.95,
) -> Tuple[np.ndarray, np.ndarray, Dict[str, Tuple[float, float]]]:
"""
Run Bayesian inference on the data and return posterior samples.
Parameters
----------
data : pd.Series
The observed data series.
priors : dict
A dictionary containing the priors for the model.
samples : Optional[int]
The number of posterior samples to draw, by default 2000.
tune : Optional[int]
The number of tuning steps for the sampler, by default 1000.
hdi_prob: Optional[float]
Prob for which the highest density interval will be computed, by default 0.95.
Returns
-------
Tuple[np.ndarray, np.ndarray, Dict[str, Tuple[float, float]]]
- posterior_mu: Posterior samples for the mean (mu).
- posterior_sigma: Posterior samples for the standard deviation (sigma).
- hdi_mu: HDI for the posterior mean.
"""
with pm.Model() as model:
# Priors
mu = pm.Normal("mu", mu=priors["mu_prior_mean"], sigma=priors["mu_prior_std"])
sigma = pm.InverseGamma(
"sigma", alpha=priors["sigma_prior_alpha"], beta=priors["sigma_prior_beta"]
)
# Likelihood
composition_obs = pm.Normal(
"composition_obs", mu=mu, sigma=sigma, observed=data
)
# Posterior Sampling
trace = pm.sample(
samples, tune=tune, return_inferencedata=True, progressbar=False
)
# Calculate HDI for mu
hdi_mu = pm.hdi(trace.posterior["mu"], hdi_prob=hdi_prob)
posterior_mu = trace.posterior["mu"].values.flatten()
posterior_sigma = trace.posterior["sigma"].values.flatten()
return posterior_mu, posterior_sigma, hdi_mu
def compute_risk_score(posterior_mu: np.ndarray) -> float:
"""
Calculate the risk score as the standard deviation of the posterior mean.
Parameters
----------
posterior_mu : np.ndarray
The posterior samples for the mean (mu).
Returns
-------
float
The calculated risk score.
"""
return np.std(posterior_mu)
risk_scores = []
for interest_name in top_10_data["interest_name"].unique():
# Filter data for the current interest
interest_data = top_10_data[top_10_data["interest_name"] == interest_name][
"composition"
]
# Define priors
priors = define_priors(interest_data)
# Run Bayesian model and get HDI
posterior_mu, _, hdi_mu = run_bayesian_model(interest_data, priors, hdi_prob=0.97)
# Compute risk score as the standard deviations of the posterior means
risk_score = compute_risk_score(posterior_mu)
risk_scores.append(
{
"interest_name": interest_name,
"risk_score": risk_score,
"hdi_lower_mean_composition": hdi_mu["mu"].values[0],
"hdi_upper_mean_composition": hdi_mu["mu"].values[1],
}
)
risk_scores_data = pd.DataFrame(risk_scores).sort_values("risk_score", ascending=False)
risk_scores_data
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma] Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma] Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma] Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma] Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma] Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma] Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma] Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma] Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma] Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds. Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma] Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.
| interest_name | risk_score | hdi_lower_mean_composition | hdi_upper_mean_composition | |
|---|---|---|---|---|
| 7 | Work Comes First Travelers | 1.645555 | 13.992587 | 21.492166 |
| 9 | Luxury Bedding Shoppers | 1.208149 | 8.508197 | 13.899054 |
| 2 | Luxury Boutique Hotel Researchers | 1.100884 | 8.039092 | 13.009123 |
| 5 | Luxury Hotel Guests | 1.001667 | 7.134219 | 11.614973 |
| 3 | Gym Equipment Owners | 0.944078 | 8.352871 | 12.541372 |
| 1 | Luxury Retail Shoppers | 0.915064 | 7.708057 | 11.805010 |
| 6 | Furniture Shoppers | 0.847117 | 7.631271 | 11.361080 |
| 4 | Shoe Shoppers | 0.811842 | 4.634176 | 8.229593 |
| 8 | Cosmetics and Beauty Shoppers | 0.790816 | 4.579328 | 8.025522 |
| 0 | Luxury Retail Researchers | 0.741084 | 4.474628 | 7.776604 |
Index Analysis¶
The index_value is a measure that can be used to reverse calculate the average composition for Fresh Segments’ clients.
Average composition can be calculated by dividing the composition column by the index_value column rounded to 2 decimal places.
$$ \text{Average Composition} = \frac{\text{Composition}}{\text{Index Value}} $$
Q1¶
What is the top 10 interests by the average composition for each month?
query = """
WITH
ranked_avg_comp AS (
SELECT
met.month_year,
map.interest_name,
ROUND((met.composition / met.index_value), 2) AS avg_composition,
DENSE_RANK() OVER (
PARTITION BY
met.month_year
ORDER BY
(met.composition / met.index_value) DESC
) AS ranks
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
)
SELECT
month_year,
interest_name,
avg_composition,
ranks
FROM
ranked_avg_comp
WHERE
1 = 1
AND ranks <= 10
ORDER BY
month_year ASC,
avg_composition DESC;
"""
avg_composition_bv_month_year = athena.query(
database=database, query=query, ctas_approach=ctas_approach
)
avg_composition_bv_month_year
| month_year | interest_name | avg_composition | ranks | |
|---|---|---|---|---|
| 0 | 2018-07-01 | Las Vegas Trip Planners | 7.36 | 1 |
| 1 | 2018-07-01 | Gym Equipment Owners | 6.94 | 2 |
| 2 | 2018-07-01 | Cosmetics and Beauty Shoppers | 6.78 | 3 |
| 3 | 2018-07-01 | Luxury Retail Shoppers | 6.61 | 4 |
| 4 | 2018-07-01 | Furniture Shoppers | 6.51 | 5 |
| ... | ... | ... | ... | ... |
| 135 | 2019-08-01 | Luxury Retail Shoppers | 2.59 | 6 |
| 136 | 2019-08-01 | Furniture Shoppers | 2.59 | 7 |
| 137 | 2019-08-01 | Marijuana Legalization Advocates | 2.56 | 8 |
| 138 | 2019-08-01 | Medicare Researchers | 2.55 | 9 |
| 139 | 2019-08-01 | Recently Retired Individuals | 2.53 | 10 |
140 rows × 4 columns
avg_comp_wide = avg_composition_bv_month_year.pivot(
index="month_year", columns="interest_name", values="avg_composition"
)
avg_comp_wide.index = avg_comp_wide.index.astype(pd.StringDtype())
avg_comp_wide
| interest_name | Alabama Trip Planners | Asian Food Enthusiasts | Chelsea Fans | Christmas Celebration Researchers | Cosmetics and Beauty Shoppers | Cruise Travel Intenders | Family Adventures Travelers | Furniture Shoppers | Gamers | Gym Equipment Owners | ... | Nursing and Physicians Assistant Journal Researchers | PlayStation Enthusiasts | Readers of Catholic News | Readers of Honduran Content | Recently Retired Individuals | Restaurant Supply Shoppers | Solar Energy Researchers | Teen Girl Clothing Shoppers | Video Gamers | Work Comes First Travelers |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| month_year | |||||||||||||||||||||
| 2018-07-01 | NaN | 6.10 | NaN | NaN | 6.78 | NaN | 4.85 | 6.51 | NaN | 6.94 | ... | NaN | NaN | NaN | NaN | 5.72 | NaN | NaN | NaN | NaN | 4.80 |
| 2018-08-01 | 4.83 | 5.68 | NaN | NaN | 6.28 | NaN | NaN | 6.30 | NaN | 6.62 | ... | NaN | NaN | NaN | NaN | 5.58 | NaN | NaN | NaN | NaN | 5.70 |
| 2018-09-01 | 7.27 | NaN | NaN | 6.47 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.70 | NaN | NaN | 7.60 | NaN | 6.25 | 6.24 | 6.53 | NaN | 8.26 |
| 2018-10-01 | 7.10 | NaN | NaN | 6.72 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 7.02 | NaN | NaN | 7.02 | NaN | NaN | 6.50 | 6.78 | NaN | 9.14 |
| 2018-11-01 | 6.69 | NaN | NaN | 6.08 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.65 | NaN | NaN | 7.09 | NaN | 5.59 | 7.05 | 5.95 | NaN | 8.28 |
| 2018-12-01 | 6.68 | NaN | 5.86 | 6.09 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.96 | NaN | NaN | 6.58 | NaN | NaN | 6.55 | 6.38 | NaN | 8.31 |
| 2019-01-01 | 6.44 | NaN | NaN | 5.65 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.46 | NaN | 5.48 | 6.67 | NaN | NaN | 7.05 | 5.96 | NaN | 7.66 |
| 2019-02-01 | 6.65 | NaN | NaN | 5.98 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.84 | 6.23 | NaN | 6.24 | NaN | NaN | 6.58 | 6.29 | NaN | 7.66 |
| 2019-03-01 | 6.54 | NaN | NaN | 5.61 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.52 | 6.06 | 5.65 | 6.21 | NaN | NaN | 6.40 | 6.01 | NaN | NaN |
| 2019-04-01 | 6.21 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.01 | 5.52 | 5.30 | 6.02 | NaN | 5.07 | 6.28 | 5.39 | NaN | NaN |
| 2019-05-01 | 3.34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.29 | NaN | ... | 3.15 | 3.55 | 4.08 | 4.41 | NaN | NaN | 3.92 | NaN | 3.19 | NaN |
| 2019-06-01 | NaN | 2.52 | NaN | NaN | 2.55 | 2.2 | NaN | 2.39 | NaN | 2.55 | ... | NaN | NaN | NaN | NaN | 2.27 | NaN | NaN | NaN | NaN | NaN |
| 2019-07-01 | NaN | 2.78 | NaN | NaN | 2.78 | NaN | NaN | 2.79 | NaN | 2.79 | ... | NaN | NaN | NaN | NaN | 2.72 | NaN | NaN | NaN | NaN | NaN |
| 2019-08-01 | NaN | 2.68 | NaN | NaN | 2.73 | NaN | NaN | 2.59 | NaN | 2.72 | ... | NaN | NaN | NaN | NaN | 2.53 | NaN | 2.66 | NaN | NaN | NaN |
14 rows × 30 columns
plt.figure(figsize=(12, 8))
sns.heatmap(avg_comp_wide.T, cmap="coolwarm", annot=True, fmt=".1f", linewidths=0.5)
plt.title("Heatmap of Average Composition (Top 10 Interests)")
plt.xlabel("Month-Year")
plt.ylabel("Interest Name")
plt.tight_layout()
plt.show();
Q2¶
For all of these top 10 interests - which interest appears the most often?
query = """
WITH
ranked_avg_comp AS (
SELECT
met.month_year,
map.interest_name,
ROUND((met.composition / met.index_value), 2) AS avg_composition,
DENSE_RANK() OVER (
PARTITION BY
met.month_year
ORDER BY
(met.composition / met.index_value) DESC
) AS ranks
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
)
SELECT
interest_name,
COUNT(*) AS count
FROM
ranked_avg_comp
WHERE
1 = 1
AND ranks <= 10
GROUP BY
interest_name
ORDER BY
count DESC;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| interest_name | count | |
|---|---|---|
| 0 | Luxury Bedding Shoppers | 10 |
| 1 | Alabama Trip Planners | 10 |
| 2 | Solar Energy Researchers | 10 |
| 3 | Readers of Honduran Content | 9 |
| 4 | New Years Eve Party Ticket Purchasers | 9 |
| 5 | Nursing and Physicians Assistant Journal Resea... | 9 |
| 6 | Work Comes First Travelers | 8 |
| 7 | Teen Girl Clothing Shoppers | 8 |
| 8 | Christmas Celebration Researchers | 7 |
| 9 | Asian Food Enthusiasts | 5 |
| 10 | Recently Retired Individuals | 5 |
| 11 | Luxury Retail Shoppers | 5 |
| 12 | Gym Equipment Owners | 5 |
| 13 | Cosmetics and Beauty Shoppers | 5 |
| 14 | Las Vegas Trip Planners | 5 |
| 15 | Furniture Shoppers | 5 |
| 16 | PlayStation Enthusiasts | 4 |
| 17 | Readers of Catholic News | 4 |
| 18 | Restaurant Supply Shoppers | 3 |
| 19 | Medicare Researchers | 3 |
| 20 | Medicare Provider Researchers | 2 |
| 21 | Gamers | 1 |
| 22 | Medicare Price Shoppers | 1 |
| 23 | Luxury Boutique Hotel Researchers | 1 |
| 24 | Marijuana Legalization Advocates | 1 |
| 25 | Chelsea Fans | 1 |
| 26 | Family Adventures Travelers | 1 |
| 27 | Video Gamers | 1 |
| 28 | Cruise Travel Intenders | 1 |
| 29 | HDTV Researchers | 1 |
Q3¶
What is the average of the average composition for the top 10 interests for each month?
query = """
WITH
ranked_avg_comp AS (
SELECT
met.month_year,
map.interest_name,
ROUND((met.composition / met.index_value), 2) AS avg_composition,
DENSE_RANK() OVER (
PARTITION BY
met.month_year
ORDER BY
(met.composition / met.index_value) DESC
) AS ranks
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
)
SELECT
month_year,
AVG(avg_composition) AS avg_of_avg_composition
FROM
ranked_avg_comp
WHERE
1 = 1
AND ranks <= 10
GROUP BY
month_year
ORDER BY
month_year ASC;
"""
avg_of_avg_comp = athena.query(
database=database, query=query, ctas_approach=ctas_approach
)
avg_of_avg_comp
| month_year | avg_of_avg_composition | |
|---|---|---|
| 0 | 2018-07-01 | 6.038 |
| 1 | 2018-08-01 | 5.945 |
| 2 | 2018-09-01 | 6.895 |
| 3 | 2018-10-01 | 7.066 |
| 4 | 2018-11-01 | 6.623 |
| 5 | 2018-12-01 | 6.652 |
| 6 | 2019-01-01 | 6.399 |
| 7 | 2019-02-01 | 6.579 |
| 8 | 2019-03-01 | 6.168 |
| 9 | 2019-04-01 | 5.750 |
| 10 | 2019-05-01 | 3.537 |
| 11 | 2019-06-01 | 2.427 |
| 12 | 2019-07-01 | 2.765 |
| 13 | 2019-08-01 | 2.631 |
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(
avg_of_avg_comp["month_year"], avg_of_avg_comp["avg_of_avg_composition"], marker="o"
)
ax.set_title("Average of Average Composition Over Time")
ax.set_xlabel("Month-Year")
plt.tight_layout()
plt.show();
query = """
SELECT
met.month_year,
map.interest_name,
ROUND((met.composition / met.index_value), 2) AS avg_composition,
DENSE_RANK() OVER (
PARTITION BY
met.month_year
ORDER BY
CAST(met.composition AS DOUBLE) / CAST(met.index_value AS DOUBLE) DESC
) AS ranks
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id;
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| month_year | interest_name | avg_composition | ranks | |
|---|---|---|---|---|
| 0 | 2018-12-01 | Work Comes First Travelers | 8.31 | 1 |
| 1 | 2018-12-01 | Nursing and Physicians Assistant Journal Resea... | 6.96 | 2 |
| 2 | 2018-12-01 | Alabama Trip Planners | 6.68 | 3 |
| 3 | 2018-12-01 | Luxury Bedding Shoppers | 6.63 | 4 |
| 4 | 2018-12-01 | Readers of Honduran Content | 6.58 | 5 |
| ... | ... | ... | ... | ... |
| 13074 | 2019-08-01 | Small Business Employees | 1.14 | 1072 |
| 13075 | 2019-08-01 | Toronto Blue Jays Fans | 1.11 | 1073 |
| 13076 | 2019-08-01 | Price Conscious Home Shoppers | 1.10 | 1074 |
| 13077 | 2019-08-01 | Pet Store Goers | 1.06 | 1075 |
| 13078 | 2019-08-01 | Health & Fitness | 0.96 | 1076 |
13079 rows × 4 columns
Window Computations - Previous Months Data Formatting¶
CASE
WHEN LAG(avg_composition, 1) OVER w IS NULL
THEN NULL
ELSE CONCAT_WS(': ', LAG(interest_name, 1) OVER w, CAST(LAG(avg_composition, 1) OVER w AS VARCHAR))
END AS one_month_ago,
CASE
WHEN LAG(avg_composition, 2) OVER w IS NULL
THEN NULL
ELSE CONCAT_WS(': ', LAG(interest_name, 2) OVER w, CAST(LAG(avg_composition, 2) OVER w AS VARCHAR))
END AS two_months_ago
This code creates two columns that show historical data in a formatted string:
one_month_ago: Shows data from the previous monthtwo_months_ago: Shows data from two months ago
For each row, the logic works as follows:
- Check if there's missing data (
NULL) from the previous period - If there is missing data, output
NULL - If data exists, combine the interest name and its composition value into a formatted string like "Interest Name: 7.36"
The LAG function is what looks back in time - LAG(1) looks back one month, LAG(2) looks back two months. This gives us a rolling historical view of which interests were most popular in previous months.
For example, if we're looking at August's data:
one_month_agoshows July's top interesttwo_months_agoshows June's top interest
query = """
WITH ranked_avg_comp_by_month AS (
SELECT
met.month_year,
map.interest_name,
ROUND((met.composition / met.index_value), 2) AS avg_composition,
DENSE_RANK() OVER (
PARTITION BY
met.month_year
ORDER BY
CAST(met.composition AS DOUBLE) / CAST(met.index_value AS DOUBLE) DESC
) AS ranks
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
)
SELECT
month_year,
interest_name,
avg_composition AS max_avg_composition,
ROUND(AVG(avg_composition) OVER w_3_months, 2) AS three_month_moving_avg,
CASE
WHEN LAG(avg_composition, 1) OVER w IS NULL
THEN NULL
ELSE CONCAT_WS(': ', LAG(interest_name, 1) OVER w, CAST(LAG(avg_composition, 1) OVER w AS VARCHAR))
END AS one_month_ago,
CASE
WHEN LAG(avg_composition, 2) OVER w IS NULL
THEN NULL
ELSE CONCAT_WS(': ', LAG(interest_name, 2) OVER w, CAST(LAG(avg_composition, 2) OVER w AS VARCHAR))
END AS two_months_ago
FROM
ranked_avg_comp_by_month
WHERE
1 = 1
AND ranks = 1
WINDOW
w AS (ORDER BY month_year ASC),
w_3_months AS (ORDER BY month_year ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW);
"""
athena.query(database=database, query=query, ctas_approach=ctas_approach)
| month_year | interest_name | max_avg_composition | three_month_moving_avg | one_month_ago | two_months_ago | |
|---|---|---|---|---|---|---|
| 0 | 2018-07-01 | Las Vegas Trip Planners | 7.36 | 7.36 | <NA> | <NA> |
| 1 | 2018-08-01 | Las Vegas Trip Planners | 7.21 | 7.29 | Las Vegas Trip Planners: 7.36 | <NA> |
| 2 | 2018-09-01 | Work Comes First Travelers | 8.26 | 7.61 | Las Vegas Trip Planners: 7.21 | Las Vegas Trip Planners: 7.36 |
| 3 | 2018-10-01 | Work Comes First Travelers | 9.14 | 8.20 | Work Comes First Travelers: 8.26 | Las Vegas Trip Planners: 7.21 |
| 4 | 2018-11-01 | Work Comes First Travelers | 8.28 | 8.56 | Work Comes First Travelers: 9.14 | Work Comes First Travelers: 8.26 |
| 5 | 2018-12-01 | Work Comes First Travelers | 8.31 | 8.58 | Work Comes First Travelers: 8.28 | Work Comes First Travelers: 9.14 |
| 6 | 2019-01-01 | Work Comes First Travelers | 7.66 | 8.08 | Work Comes First Travelers: 8.31 | Work Comes First Travelers: 8.28 |
| 7 | 2019-02-01 | Work Comes First Travelers | 7.66 | 7.88 | Work Comes First Travelers: 7.66 | Work Comes First Travelers: 8.31 |
| 8 | 2019-03-01 | Alabama Trip Planners | 6.54 | 7.29 | Work Comes First Travelers: 7.66 | Work Comes First Travelers: 7.66 |
| 9 | 2019-04-01 | Solar Energy Researchers | 6.28 | 6.83 | Alabama Trip Planners: 6.54 | Work Comes First Travelers: 7.66 |
| 10 | 2019-05-01 | Readers of Honduran Content | 4.41 | 5.74 | Solar Energy Researchers: 6.28 | Alabama Trip Planners: 6.54 |
| 11 | 2019-06-01 | Las Vegas Trip Planners | 2.77 | 4.49 | Readers of Honduran Content: 4.41 | Solar Energy Researchers: 6.28 |
| 12 | 2019-07-01 | Las Vegas Trip Planners | 2.82 | 3.33 | Las Vegas Trip Planners: 2.77 | Readers of Honduran Content: 4.41 |
| 13 | 2019-08-01 | Cosmetics and Beauty Shoppers | 2.73 | 2.77 | Las Vegas Trip Planners: 2.82 | Las Vegas Trip Planners: 2.77 |
Final Results¶
query = """
WITH ranked_avg_comp_by_month AS (
SELECT
met.month_year,
map.interest_name,
ROUND((met.composition / met.index_value), 2) AS avg_composition,
DENSE_RANK() OVER (
PARTITION BY
met.month_year
ORDER BY
CAST(met.composition AS DOUBLE) / CAST(met.index_value AS DOUBLE) DESC
) AS ranks
FROM
fresh_segments.interest_metrics AS met
LEFT JOIN fresh_segments.interest_map AS map ON met.interest_id = map.id
),
window_computations AS (
SELECT
month_year,
interest_name,
avg_composition AS max_avg_composition,
ROUND(AVG(avg_composition) OVER w_3_months, 2) AS three_month_moving_avg,
CASE
WHEN LAG(avg_composition, 1) OVER w IS NULL
THEN NULL
ELSE CONCAT_WS(': ', LAG(interest_name, 1) OVER w, CAST(LAG(avg_composition, 1) OVER w AS VARCHAR))
END AS one_month_ago,
CASE
WHEN LAG(avg_composition, 2) OVER w IS NULL
THEN NULL
ELSE CONCAT_WS(': ', LAG(interest_name, 2) OVER w, CAST(LAG(avg_composition, 2) OVER w AS VARCHAR))
END AS two_months_ago
FROM
ranked_avg_comp_by_month
WHERE
1 = 1
AND ranks = 1
WINDOW
w AS (ORDER BY month_year ASC),
w_3_months AS (ORDER BY month_year ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
)
SELECT
*
FROM
window_computations
WHERE
1 = 1
AND (one_month_ago IS NOT NULL AND two_months_ago IS NOT NULL)
ORDER BY
month_year ASC;
"""
max_avg_comp_by_month_year = athena.query(
database=database, query=query, ctas_approach=ctas_approach
)
max_avg_comp_by_month_year
| month_year | interest_name | max_avg_composition | three_month_moving_avg | one_month_ago | two_months_ago | |
|---|---|---|---|---|---|---|
| 0 | 2018-09-01 | Work Comes First Travelers | 8.26 | 7.61 | Las Vegas Trip Planners: 7.21 | Las Vegas Trip Planners: 7.36 |
| 1 | 2018-10-01 | Work Comes First Travelers | 9.14 | 8.20 | Work Comes First Travelers: 8.26 | Las Vegas Trip Planners: 7.21 |
| 2 | 2018-11-01 | Work Comes First Travelers | 8.28 | 8.56 | Work Comes First Travelers: 9.14 | Work Comes First Travelers: 8.26 |
| 3 | 2018-12-01 | Work Comes First Travelers | 8.31 | 8.58 | Work Comes First Travelers: 8.28 | Work Comes First Travelers: 9.14 |
| 4 | 2019-01-01 | Work Comes First Travelers | 7.66 | 8.08 | Work Comes First Travelers: 8.31 | Work Comes First Travelers: 8.28 |
| 5 | 2019-02-01 | Work Comes First Travelers | 7.66 | 7.88 | Work Comes First Travelers: 7.66 | Work Comes First Travelers: 8.31 |
| 6 | 2019-03-01 | Alabama Trip Planners | 6.54 | 7.29 | Work Comes First Travelers: 7.66 | Work Comes First Travelers: 7.66 |
| 7 | 2019-04-01 | Solar Energy Researchers | 6.28 | 6.83 | Alabama Trip Planners: 6.54 | Work Comes First Travelers: 7.66 |
| 8 | 2019-05-01 | Readers of Honduran Content | 4.41 | 5.74 | Solar Energy Researchers: 6.28 | Alabama Trip Planners: 6.54 |
| 9 | 2019-06-01 | Las Vegas Trip Planners | 2.77 | 4.49 | Readers of Honduran Content: 4.41 | Solar Energy Researchers: 6.28 |
| 10 | 2019-07-01 | Las Vegas Trip Planners | 2.82 | 3.33 | Las Vegas Trip Planners: 2.77 | Readers of Honduran Content: 4.41 |
| 11 | 2019-08-01 | Cosmetics and Beauty Shoppers | 2.73 | 2.77 | Las Vegas Trip Planners: 2.82 | Las Vegas Trip Planners: 2.77 |
Q5¶
Provide a possible reason why the max average composition might change from month to month? Could it signal something is not quite right with the overall business model for Fresh Segments?
Seasonality:
- The composition metric might reflect customer behaviors that vary with seasons. For instance, interest in "Vacation Rental Accommodation Researchers" could spike during holiday seasons or summer months.
Marketing Campaigns:
- Targeted advertising or promotional campaigns could drive temporary interest in specific segments, increasing their composition metrics.
Shifts in Business Focus:
- Fresh Segments might adjust their offerings or emphasis on particular interests, impacting customer interactions with specific segments.
Changes in Client Lists:
- Fluctuations in the client base (e.g., onboarding new clients or losing existing ones) could affect the overall average composition.
Measurement Issues:
- A sudden, unexplained change might indicate a data quality or calculation issue, especially if it doesn't align with known business activities.
Potential Red Flags for Business Model¶
If the changes in the maximum average composition cannot be explained by predictable factors like seasonality or campaigns, it might indicate:
- Market Misalignment: The target audience for Fresh Segments' offerings may not consistently align with their clients' needs.
- Client Retention Issues: Variability might reflect challenges in maintaining a steady client base.
- Segment Relevance: Some interest segments may not be as relevant or valuable to the clients as anticipated.