This draft skips over a lot of detail about how the implementation works. I’m planning to present it as literate programming, but until then you can read the linked sources.
At our 2016 Christmas party I found myself venting at our CEO, “we’re wasting the operations team’s time,” I said, “they spend half the month on a task I’m sure a computer could do.”
At the time I had no proof, but over the holidays I prototyped a solution and “operationalize commoditized processes”—legalese for “automate repetitive tasks”—appeared in our guiding principles for 2017.
Each month the four-person operations team spent a fortnight verifying that last month’s commissions were imported correctly; and in the all-too-common case that an error was found, they made the appropriate change in the database.
Errors were rarely reported to the integrations team responsible for developing the importers because waiting for a fix took much longer than the manual change, and so their workload grew every month.
My idea was to make operations responsible for developing the importers so that fixing bugs was the path of least resistance. They aren't developers, but they are technically savvy. If you can write a SQL query or a spreadsheet then a parser is well within your grasp.
Commissions
The core business is commissions, Skimlinks acts as a middleman between content publishers and affiliate networks. When a visitor clicks on an outbound link we transparently convert it into a tracking link and if that click leads to a conversion we pass the commission on to the publisher after taking a small cut.
Each affiliate network has a slightly different API for commissions: paginated or not; XML, JSON or CSV; by creation or modified date; etc. But ultimately they all contain the same content1: the ID we set in the affiliate link, the amount of commission earned, and the date of the transaction.
There are actually four types of commission APIs, those reporting baskets vs items, and those reporting states vs differences. Most bugs are the result of developers misidentifying which type they are dealing with, the solution is to have the user to choose the type and enforce appropriate tests.
DSLs and IDEs
Each parser is a single expression mapped over each row that transforms the columns into parameters for the Commission constructor2.
JSON and XML are simple to transform into a columnar format and we already have examples of all the transformations we would want to apply to a column, so I extracted those functions as the built-ins for the parser DSL.
And also how to fetch the report, and an expression that extracts the rows from the report which we deal with similarly.
We can use the types of the built-ins to limit where they are used; for example date is a datetime and thus the only reasonable candidates are compose_datetime and parse_datetime.
In Python we handled errors by raising exceptions, but they only show the first error so for the DSL we use something similar to Haskell’s AccValidation to report all the errors3.
The elegance of AccValidation was discovered when rewriting the prototype code for production, all the implementations here use exceptions.
An IDE for developing and testing parsers in our DSL is straightforward from here.
We use type information (with automatic insertion of <$> and <*>) to limit which functions the user can choose, and map the expression over the rows to support debugging.
Autocomplete
While developing the IDE I noticed that many of my choices were forced: only one column could be the tracking ID, or only one column is a date in the past, or a pair of columns that represented the commission and sale value.
Inspired by Barliman, I wanted the IDE to fill in those choices automatically.
Taking Commission as the root type I searched the graph of all possible programs for implementations that successfully return a commission for each row.
While the assert-based autocomplete eliminated some of the manual work it wasn’t able to decide which date was the transaction date (probably the earliest one) or which number was the commission amount (probably the smallest positive one).
To address this I introduced a variant of tests that return confidence values (i.e. between zero and one) instead of booleans.
When faced with multiple possible solutions these tests could break ties if one solution scored significantly higher than the rest; without a clear winner a human was still needed.
Benefits
Verification time is reduced because operations implicitly trust themselves and fix bugs instead of changing data.
“Realtime” payments are possible (instead of monthly) because manual verification has been eliminated.
Most of the integrations team are able to move to other areas of the business because only the IDE needs maintenance.
Faster and less buggy integration of new affiliate networks because of autocomplete and tests respectively.
Better quality data because special cases are not expressible and autocomplete finds patterns that developers might miss.
Integrations with unstable APIs self-heal by running the autocomplete code on failure (with additional tests that verify the commissions are similar to the previous ones).