I am so sorry.
This is a grab bag of utilities for interactive exploration of MVPD data.
The general idea is...
- Get all the source data.
- Turn all the source data from whatever nonsense it is into CSV.
- Use "standard" Unix-y text processing tools on the CSV data to figure out that thing you're trying to figure out.
If you want to try the happier path and work with this data like a normal, modern person, check out the Vantage Data Livebook Playground.
Again, I am so sorry.
This uses some random stuff.
- The
aws
command line tool: for getting the source files. xlsx2csv
(the Python one, not the Rust one. (Just because I found it first, not necessarily because it's better.)): does what it says on the tin.redo
: a dependency-based build tool, used for effeciently translating files.- Ruby: for random small custom utilities. (Recommended: install
through
asdf
.)
You can forge your own path without these things, if you like doing things the extra-hard way.
xsv
: really good for working with CSV data.ripgrep
: really good for finding stuff in text files.asdf
: for managing tool versions, mainly Ruby.fish
: because I use fish, and I've written some helpful tiny functions for working with the data.
These examples assuming I'm running in a fish shell session.
Always start by getting the latest data from S3 and translating it to CSV. This will be slow the first time:
# Get the latest data from S3:
$ redo sync
# Translate/extract/copy the data to CSV.
$ redo -j 9
This will take the source files from the mvpd
directory, and create
CSV files in the translated
directory. So for example,
mvpd/charter/VIACOM_CHARTER_DAILY_20211101_details.txt
will be
available as a CSV in
translated/mvpd/charter/VIACOM_CHARTER_DAILY_20211101_details.txt.csv
.
If you only care about a specific partner, you can build just the files for that partner like so:
$ redo -j 9 altice
You can also adjust the 9
in -j 9
to change the number of jobs that
will run in parallel. Maybe you have more cores and you want to run more
jobs, or maybe you want to remove the -j 9
entirely because you want
to wait literally all day.
Let's say we want to see which files have data for order 79535.
Since we have all the data as CSV, which is just text, we can just use
a standard text search tool, like ripgrep
:
$ rg -l O79535 translated
translated/mvpd/altice/Viacom_Altice_Daily_Report_20220115-20220116.xlsx.csv
translated/mvpd/charter/VIACOM_CHARTER_DAILY_20220131_summary.txt.csv
translated/mvpd/charter/VIACOM_CHARTER_DAILY_20220112_details.txt.csv
...
Key points:
- Prefix the order number with
O
(the letter, not the number). Otherwise, you'll also get results for unrelated fields that coincidentally contain the order number, e.g. reach percentages. - Remember that in rare cases, the order number doesn't start with the
letter
O
, so that previous tip might not apply! - This is just a full-text search. It doesn't respect the CSV
structure, so it doesn't just search the segment name. If you're
unlucky, you'll get some false positives here. In those cases, you
can do a more sophisticated search using
xsv
, like in the next example!
We want to look at the data for order 79535
from one file:
translated/mvpd/charter/VIACOM_CHARTER_DAILY_20220112_details.txt.csv
# It's usually handy to put these things in variables:
$ set order 79535
$ set file 'translated/mvpd/charter/VIACOM_CHARTER_DAILY_20220112_details.txt.csv'
# Now we'll use `xsv` to filter the data to this order, based on the
# fact that the order number is contained in the Audience Segment Name
# column.
$ xsv search -s 'Audience Segment Name' $order $file
Advertiser,Campaign Name,Campaign Start Date,Campaign End Date,Audience Segment Name,Segment Priority,Target Household Count,Event Date,Event Time,Hour,Daypart,ISCI/ADID,Network,Title,Impressions Delivered,Households Reached,Median Seconds Viewed,Mean Seconds Viewed,Completion Rate
VIACOM_CHARTER_DAILY,WALGREENS / WENDYS 1/10,2022-01-10,2022-01-23,O79535_CHARTER_VA_Walgreens_1PD_E1198117_A,1,832718,2022-01-10,01:19:25,01 AM - 02 AM,12 AM - 2 AM,WALG3151000VH,MTV,,174,173,27.030001,26.771034,99.1520%
VIACOM_CHARTER_DAILY,WALGREENS / WENDYS 1/10,2022-01-10,2022-01-23,O79535_CHARTER_VA_Walgreens_1PD_E1198117_A,1,832718,2022-01-10,02:19:04,02 AM - 03 AM,2 AM - 5 AM,WALG3151000VH,MTV,,123,122,27.030001,26.892195,99.6007%
...
# That's...not super easy to read. We can skip some of this data.
$ xsv search -s 'Audience Segment Name' $order $file |
xsv select 'Audience Segment Name','Event Date','Event Time','Network','Impressions Delivered'
Audience Segment Name,Event Date,Event Time,Network,Impressions Delivered
O79535_CHARTER_VA_Walgreens_1PD_E1198117_A,2022-01-10,01:19:25,MTV,174
O79535_CHARTER_VA_Walgreens_1PD_E1198117_A,2022-01-10,02:19:04,MTV,123
...
# Still could be nicer to read:
$ xsv search -s 'Audience Segment Name' $order $file |
xsv select 'Audience Segment Name','Event Date','Event Time','Network','Impressions Delivered' |
xsv table
Audience Segment Name Event Date Event Time Network Impressions Delivered
O79535_CHARTER_VA_Walgreens_1PD_E1198117_A 2022-01-10 01:19:25 MTV 174
O79535_CHARTER_VA_Walgreens_1PD_E1198117_A 2022-01-10 02:19:04 MTV 123
...
This is basically a combination of the previous two tactics:
$ set order 79535
$ rg -l O$order translated/mvpd/charter/ |
grep -v '_summary.txt' |
xargs xsv cat rows |
xsv search -s 'Audience Segment Name' $order |
xsv select 'Audience Segment Name','Event Date','Event Time','Network','Impressions Delivered' |
xsv table
Audience Segment Name Event Date Event Time Network Impressions Delivered
O79535_CHARTER_VA_Walgreens_1PD_E1198117_A 2022-01-10 01:19:25 MTV 174
O79535_CHARTER_VA_Walgreens_1PD_E1198117_A 2022-01-10 02:19:04 MTV 123
...
- We find all the files that might contain data for this order.
- We exclude Charter's
_summary.txt
files. Those are PCR data, and we just want spot-level data. - We jam all the data together into one megasheet using
xsv cat rows
. - We use
xsv search
,xsv select
, andxsv table
to filter the rows to our order, narrow down the columns, and make the result human-readable.
In some cases, this is what you want! For Charter, this isn't yet fully helpful, because earlier rows should be replaced by later rows for the same segment, date, time, and network. In that case, what you really want is...
$ set order 79535
$ rg -l O$order translated/mvpd/charter/ |
grep -v '_summary.txt' |
+ bin/sort-translations-by-upload-time |
xargs xsv cat rows |
xsv search -s 'Audience Segment Name' $order |
xsv select 'Audience Segment Name','Event Date','Event Time','Network','Impressions Delivered' |
+ bin/accumulate |
xsv table
This is the same as before, except with a couple fun little utilities.
bin/sort-translations-by-upload-time
takes a list of translated files
(i.e., our CSVs) and sorts them based on the original upload time of the
original files. (It figures this out using the modification time of the
source files, which aws s3 sync
helpfully sets. But this means you
shouldn't touch the source files!)
This is important because the order of upload dictates the order that data should be processed in.
bin/accumulate
takes in CSV data of the exact format we're selecting
here (Audience Segment Name, Event Date, Event Time, Network,
Impressions Delivered) and spits out the same CSV data, but with only
the latest row for each grouping of segment, date, time, and network.
This is specifically for working with Altice and Charter data, where
this rule applies.
In the end, the CSV data you get from bin/accumulate
should match
our Redshift data row-for-row.
We've just done all this the hard way. Lucky for you, someone has already done it the hard way, and so you can do it the slightly-less-hard way!
Use fish, and load these handy helper functions:
$ source lib/charter-help.fish
Now let's redo that previous monster pipeline:
$ set order 79535
-$ rg -l O$order translated/mvpd/charter/ |
- grep -v '_summary.txt' |
- bin/sort-translations-by-upload-time |
+$ rg -l O$order (sorted-charter-spot-files) |
xargs xsv cat rows |
- xsv search -s 'Audience Segment Name' $order |
+ filter-segments $order |
- xsv select 'Audience Segment Name','Event Date','Event Time','Network','Impressions Delivered' |
+ select-relevant-fields |
bin/accumulate |
xsv table
This does exactly what we did before. In more copy-and-pastable form:
$ set order 79535
$ rg -l O$order (sorted-charter-spot-files) |
xargs xsv cat rows |
filter-segments $order |
select-relevant-fields |
bin/accumulate |
xsv table
It's... "nicer"!
Anyway, now that we've produced all the rows that should be in Redshift, we can just take the impressions and add them up!
Most of the time, you're probably not getting the total total, but you're looking at our verification bot, which gives totals by week. We'll also filter to just a week's worth of events here.
$ set order 79535
$ rg -l O$order (sorted-charter-spot-files) |
xargs xsv cat rows |
filter-segments $order |
+ limit-dates 2022-01-17 2022-01-23 |
select-relevant-fields |
bin/accumulate |
- xsv table
+ just-impressions |
+ sum-nums
Hopefully, this matches the number you get from Redshift!
If you're still reading this, either you're just really interesting in sadistic data analysis (unlikely) or your number didn't match the number you got from Redshift (more likely).
If you don't know where to go from here, you can compare your data to the data in Redshift!
Put your rows into a file:
$ set order 79535
$ rg -l O$order (sorted-charter-spot-files) |
xargs xsv cat rows |
filter-segments $order |
limit-dates 2022-01-17 2022-01-23 |
select-relevant-fields |
bin/accumulate > my-probably-correct-data.csv
Then write a query to get the same data from Redshift.
Put something like this into, say, get-probably-incorrect-data.sql
:
SELECT
audience_segment_name,
event_date,
event_time,
network,
impressions_delivered
FROM mvpd.charter_spot_level
WHERE
audience_segment_name IN (
'O79535_CHARTER_VA_Walgreens_1PD_E1198117_A',
'O79535_CHARTER_VA_Walgreens_1PD_E1198117_A_2'
)
AND event_date >= '2022-01-17'
AND event_date <= '2022-01-23'
ORDER BY audience_segment_name, event_date, event_time, network;
In this particular case, we're "lucky" because this one order number contains multiple segment names. We need to account for them both in our query.
Now, run that query and get the result as CSV:
$ psql -h $REDSHIFT_HOST -U $REDSHIFT_USER -p $REDSHIFT_PORT -d $REDSHIFT_DB \
--csv -f get-probably-incorrect-data.sql -o probably-incorrect-data.csv
Now you have your "correct" data in my-probably-correct-data.csv
and Redshift's "incorrect" data in probably-incorrect-data.csv
.
We can diff
them!
$ diff probably-incorrect-data my-sorted-correct-data.csv
1c1
< audience_segment_name,event_date,event_time,network,impressions_delivered
---
> Audience Segment Name,Event Date,Event Time,Network,Impressions Delivered
1080c1080
< O79535_CHARTER_VA_Walgreens_1PD_E1198117_A_2,2022-01-17,19:21:06,MTV,1
---
> O79535_CHARTER_VA_Walgreens_1PD_E1198117_A_2,2022-01-17,19:21:06,MTV,2
1148c1148
< O79535_CHARTER_VA_Walgreens_1PD_E1198117_A_2,2022-01-17,23:18:38,TVL,1
---
> O79535_CHARTER_VA_Walgreens_1PD_E1198117_A_2,2022-01-17,23:18:38,TVL,4
1214c1214
< O79535_CHARTER_VA_Walgreens_1PD_E1198117_A_2,2022-01-18,08:11:06,MTV,1
---
> O79535_CHARTER_VA_Walgreens_1PD_E1198117_A_2,2022-01-18,08:11:06,MTV,2
Now we have something to go on! From here, maybe you could...
- Find the most recent file that these particular segment-date-time-network combinations occurred in, and make sure your data's right.
- Query the
file_name
for those rows in the Redshift table, and see if it's what you expect. - Go on a long, fun data adventure!
The basic rules here are...
- The
mvpd
directory is kept in sync with the S3 source files. (That is, updates to S3 are synced tomvpd
, but changes inmvpd
are not synced back.) - When files in the
mvpd
directory are added or changed, CSV copies of them are created in thetranslated
directory. - Sometimes the source files in
mvpd
are used for metadata (e.g., to sort them by upload time), so you should be able to figure out the source file for any translated file by removing thetranslated/
prefix and the.csv
suffix.
To manage the translation of files, we use redo
. Redo
serves the same purpose as Make. We started out using Make, but the
problem with Make is that it can't comprehend filenames with spaces.
With redo, each target lives in its own file. When you run just plain
redo
, it runs the all
target, and the code for that lives in
all.do
. .do
files are basically just shell scripts.
all
runs all the targets for the partners. Each partner target has
a corresponding .do
file in the root directory (e.g. altice.do
).
The top-level partner targets will...
- Find the source files we care about.
- Figure out what the translated files should be named.
- Run the targets for the translated files, if the sources have changed.
The rules for producing the translated files are in .do
files in the
translated directories. For instance, the rule to produce a .csv
file
from an Altice .xlsx
file is in
translated/mvpd/altice/default.xlsx.csv.do
: this is the rule that redo
uses to build a .xlsx.csv
file in the translated/mvpd/altice/
directory. A partner might have multiple default.*.csv.do
files,
since file formats have changed over time.
The important things to know about these .do
files are...
- They're basically just shell scripts.
- Standard output will be used to create the target file.
- The variable
$2
contains the target name. (So in the case of generating CSV data with adefault.something.csv.do
file,$2
would be the name of the translated file, without the directory, likesource-data-file-2022-04-01.something.csv
.) - If you don't want to just write to stdout, the variable
$3
contains a file that you can write to to create the target.