Thursday, November 21, 2013

Break.com Case Study on ad serving.


Guillaume Roels of the Anderson UCLA created a 2008 case study on ad serving used in his MBA courses.

pdf_GR07.pdf

The case studies a 90 day period in the life of an adserver matching supply and demand.  There are three associated data elements:

  • Traffic data for the four inventory zones.  This gives expected "supply" on a daily basis in the form of ad unit requests (impressions) per zone per day on average.
  • A set of booked orders with flight dates, budget and impression allocation limits per zone.
  • A set of proposed orders with the same data elements as orders.
For convenience I have put the data from the case study in an excel file as well as CSV files.


There are many interesting questions to ask with the case:
  1. How much daily or weekly surplus/unsold inventory is available per zone?
  2. How many of the booked orders will be completed at a given level of supply?
  3. How much total revenue results from fully completed orders?
  4. How much total revenue results if partial credit is given for incomplete orders?
  5. How many of the proposals should be accepted such that they will complete?
  6. If the orders and proposals were treated equally which of the total set would complete?
  7. How much revenue results given either completed and/or partially complete orders?
  8. (advanced) Is there an optimal serving schedule (other than price order) that would result in more total revenue or more completed orders? 
While the MBA student might approach this with a spreadsheet analysis, writing code is likely a better approach. When implementing the simulation in code one has to make a few assumptions:
  • ASAP pacing of orders versus even pacing of volume per day across the flight dates.
  • Which traffic level to use per day (low/baseline/high).
  • The mechanism to decide the priority of orders to be served in a given day/zone.
This problem relates to both "matchmaking" and "mechanism design" in economics.  It is also a simple example of the problems one might solve in software in "computational advertising".  See below for a course with lecture slides touring the discipline.


Thursday, December 13, 2012

Von Neumann on Empirics vs Hayek on Abstraction

Recently Carson Chow posted a short quote on the musings of Von Neumann on straying into pure mathematics when an idea originated in an empirical context.  [Hat tip Daniel Lemarie for retweeting it around.]

“[M]athematical ideas originate in empirics, although the genealogy is sometimes long and obscure. But, once they are so conceived, the subject begins to live a peculiar life of its own and is better compared to a creative one, governed by almost entirely aesthetic considerations, than to anything else, and, in particular, to an empirical science. There is, however, a further point which, I believe, needs stressing. As a mathematical discipline travels far from its empirical source, or still more, if it is a second and third generation only indirectly inspired by ideas coming from ‘reality’, it is beset with very grave dangers. It becomes more and more purely aestheticising, more and more purely l’art pour l’art. This need not be bad, if the field is surrounded by correlated subjects, which still have closer empirical connections, or if the discipline is under the influence of men with an exceptionally well-developed taste. But there is a grave danger that the subject will develop along the line of least resistance, that the stream, so far from its source, will separate into a multitude of insignificant branches, and that the discipline will become a disorganised mass of details and complexities. In other words, at a great distance from its empirical source, or after much ‘abstract’ inbreeding, a mathematical subject is in danger of degeneration.”

The quote was unsourced in the post.  It originally appeared in J.V.M.'s 1947 essay The Mathematician which is republished here:
Part 1 and Part 2

Recently I was reading Emanuel Derman's book Models.Behaving.Badly.: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life


Friedrich Hayek, the Austrian economist who received the 1947 Nobel Memorial Prize in Economics, pointed out that in the physical sciences we know the macroscopic through concrete experience and proceed from to the microscopic by abstraction. [...] In economics, Hayek argued, the order of abstraction should be reversed: we know the individual agents and players from concrete personal experience, and the macroscopic "economy" is the abstraction. If the correct way to proceed is from the concrete to abstract, he argued, in economics we should begin with agents and proceed to economies and markets rather than vice versa.

These are highly contrasting views!  

Hayek argued that the market economy was not designed, rather emerged from the fairly simple actions of actors within the economy (negotiating on price etc).  Hayek is essentially saying that we should be able to reproduce the complexity of a given economy if only we can capture an accurate description of all of the 'rules' the actors are obeying.

Looking back on my own intellectual development I was first drawn to these emergent ideas within the 'bottom-up' branch of Artificial Intelligence.  It was fascinating to set up a simple rule in cellular automata and watch it generate amazing complexity.  Similarly the basic Darwinian model of evolution can be duplicated in a very short bit of code that given an objective function can solve problems indirectly via a massive number of randomized trials directed via the function.  

Lately I'm a 'data scientist' which means that I'm attempting to extract models from data.  Common methods attempt to wash out enough complexity in the data to derive a model that is inspect-able and understandable by a person.  This is fundamentally grounded in empirics as the extracted model is tested for accurate prediction against other data.  The inspected models are used mostly for 'story telling', helping users of the system to understand what is being learned from the data.. yet we leave specific predictions exclusively to the algorithms.

Of course one can't resolve these abstract differences in a blog post, that's a tall order.   

I would however assert that every learning algorithm should have two fundamentals. 
  1. It accurately predicts outcomes given input data
  2. It emits a model that is inspectable and can inform the building of better 'toy models' of the system under study. 
No one disputes the first, yet many seem all to happy to ignore the second and are OK with building increasingly powerful black-boxes with 'big data' machinery.

Posted via email from aicoder - nealrichter's blog

Saturday, January 07, 2012

Software pattern for proportional control of QPS in a webservice

Problem Statement

Imagine you are writing a webservice that must call a back-end service, such as a data store. Let's assume (with out loss of generality) that the data store (and your hardware supporting it) has some limit in QPS that it can handle. We'd like the client system (your web service) to impose a limit on the QPS to the back-end service. Also assume that this is a distributed webservice, lots of worker threads on lots of different machines.

Requirements

  1. Given a goal in QPS manage the maximum outgoing requests per second to that goal.
  2. Be fast. Maintain a fast controller settling time when the goal or queries change.
  3. Be adaptive. Respond to swings of incomming requests that need to be queried against the service.
  4. Be distributed. Locally active against global numbers without knowing the number of workers.
  5. Be robust. Handle additions and subtractions of worker/clients to the system without coordination. Minimize overshoot.

Design

Assume that querying this backend service, while valuable and mostly needed, is optional under duress. It's far more important for your front-end service to be responsive and return some error or 'No Content' than hang on a busted back-end. As result we'll use a sampling rate 'r' to denote the % of time that the web service should query the back-end. Under normal conditions this rate is 1 (100%). Also assume that the goal in QPS to the back-end is set in some configuration area in your system. Under duress the rate r will be adaptively tuned to obey the QPS goal. Also assume you have smartly implemented some monitoring system like Ganglia/Nagios/Cacti and are emitting events to it when you call the back-end service.

Inputs

  • G = Goal QPS
  • M = Current measured QPS (from Ganglia).
  • r = Current sampling rate [0,1]

Outputs

  • r_new = A sampling rate [0,1]

Adaptive Mechanism

  • r_new = r * (G/M)
  • r_new = MAX(0,MIN(1,r_new)) //clamp r_new between [0,1]

Benefits

  • Needs only global G and M as inputs.
  • No-coordination needed between workers/servers other than the globally observed M.
  • Adaptively moves the per-worker sampling rate independent of all other worker's rates.
  • Workers can have different incoming QPS rates from a load balancer, the controller will adapt.

Failure Modes

  • If the sensor for M fails to be updated then the controller is blind
  • If the Goal is set or re-set to zero, then the controller will stop traffic
  • Both of these can be addressed easily.
Desired Response
  • The overshoot/undershoot is called 'ringing'
  • The time to approach the goal is called the 'settling time'

Implementation Notes

  • Emit a ganglia/graphite/XXX counter for both requests sent and skipped
  • Use a smoothed average of the measured QPS to smooth out controller jitter.
  • (Optional) Each worker should do a bit of randomization of when it performs its sampling-rate-update to smooth out any startup/restart ringing.

Invertability

This design could be inverted for a server implementation. If your webservice has a set of APIs that are heavy to execute, then this controller could be used to control the incomming QPS that are delegated to the heavy work.

  1. Receive request
  2. Submit to sampling rate
  3. If Yes then delegate the request to the executor
  4. If No then respond to the client with 'HTTP 204 No Content' or equivalent empty reponse.

Tuesday, December 27, 2011

Standford's Introduction to Computational Advertising course

Andrei Broder and Vanja Josifovski, of Yahoo! Research, have wrapped up their Stanford course again this fall.  As always the lecture slides are a great intro to the area.

MS&E 239: Introduction to Computational Advertising

Computational advertising is an emerging new scientific sub-discipline, at the intersection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, classification, optimization, and microeconomics. The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. The context could be a user entering a query in a search engine ("sponsored search"), a user reading a web page ("content match" and "display ads"), a user watching a movie on a portable device, and so on. The information about the user can vary from scarily detailed to practically nil. The number of potential advertisements might be in the billions. Thus, depending on the definition of "best match" this problem leads to a variety of massive optimization and search problems, with complicated constraints, and challenging data representation and access problems. The solution to these problems provides the scientific and technical foundations for the $20 billion online advertising industry.

This course aims to provide a good introduction to the main algorithmic issues and solutions in computational advertising, as currently applied to building platforms for various online advertising formats. At the same time we intend to briefly survey the economics and marketplace aspects of the industry, as well as some of the research frontiers. The intended audience are students interested in the practical and theoretical aspects of web advertising.

Posted via email from aicoder - nealrichter's blog

Summer Program on Computational Advertising

Deepak Agarwal of Yahoo Research is organizing a summer program on Computational Advertising.  It appears to be geared for grad students.

http://www.samsi.info/programs/summer-program-august-6-17-2012-computational-advertising

This two-week program will run from August 6 to August 17, 2012. The first week will be at the Radisson RTP in the Research Triangle Park, NC. The location is in close proximity to SAMSI. The first three days will be spent on technical presentations by leading researchers and industry experts, to bring everyone up to speed on the currently used methodology. On the fourth day, the participants will self-organize into working groups, each of which will address one of the key problem areas (it is permitted that people join more than one group, and the organizers will try to arrange the working group schedules to faciliate that).  The second week will be spent at SAMSI headquarters in Research Triangle Park.

Posted via email from aicoder - nealrichter's blog

Tuesday, November 15, 2011

What Software Engineers should know about Control Theory

Over the years I've noticed an interesting lack of specific domain knowledge among CS and software people. Other than the few co-workers that majored in Electrical Engineering, almost no one has heard of the field of 'Control Theory'.

Control theory is an interdisciplinary branch of engineering and mathematics, that deals with the behavior of dynamical systems. The desired output of a system is called the reference. When one or more output variables of a system need to follow a certain reference over time, a controller manipulates the inputs to a system to obtain the desired effect on the output of the system.
Let's imagine that you write internet web services for a living. Some Rest or SOAP APIs that take an input and give an output.

Your boss walks up to you one day and that asks for a system that does the following:
  • Create a webservice that calls another (or three) for data/inputs, then does X with them.
  • Meters the usage of the other web services.
  • Your webservice must respond within Y milliseconds with good output or a NULL output.
  • Support high concurrency, ie not use too many servers.

The problem is that these other third-party webservices are not your own. What is their response time? Will they give bad data? How should your webservice react to failures of the others?

Does this sound familiar? It should to many. This is the replicated-and-shared-connector problem (MySQL, memcached), the partitioned-services problem (federated search, and large scale search engines) and the API-as-a-service problem (Mashery, etc).

There are two basic types of controls relevant here:
  • Open Loop, Feed-forward: Requires good model of system inputs and response of the system.
Feedforward

  • Closed Loop, Feed-back

Types of adaptive control are as follows:
  • Linear Feedback
  • Stability Analysis
  • Frequency response
  • response time
Adaptive Schemes
  • Gain Scheduling
  • Model Reference Adaptive Systems
  • Self-tuning regulators
  • Dual Control

Here's one survey deck from a lecture. Unfortunately for software engineers, most of the presentations of the above are in linear system form rather than an algorithmic form.

Dr. Joe L Hellerstein of Google and co-workers taught a course at U of Washington in 2008 that was more software focused. He's also written a textbook on it and a few papers.

I'd like to see a 'software patterns' set created for easier use by software engineers. I'll attempt to present a couple common forms as patterns in a future blog post.


Thursday, November 10, 2011

Open RTB panel - IAB Ad Ops Summit 2011

Monday November 7th I was on an IAB Ops panel on OpenRTB.



The clip shows an exchange after Steve from the IAB asked a question about how webpage inventory is described in RTB. I described an example of differentiating a simple commodity, barley.

Two of the major uses of barley in the US are animal feed and malting for making beer. Malting barley has specific requirements in terms of moisture content, protein percentage and other factors. Farmers don't always know what quality their crop will finish at. They count on having two general markets, if the tested quality meets malting standards then the premium over feed prices can be healthy. A 2011 report noted that malting barley provided a 70% premium over feedstock barley. Growing specific varieties and/or using organic farming methods can provide additional premiums over generic feed barley. The curious can follow the links below.
How does this relate to publishers and advertising and OpenRTB? In my opinion we need several things standardized:

1) Inventory registration and description API. Allows publishers influence on how their inventory is exposed in various demand-side and trading-desk platforms. Publishers should fully describe their inventory in a common format. Buy-side GUIs and algorithms will benefit from increased annotation and categorization. This can also harmonize the brand-safety ratings that are not connected between the sell and buy sides.

2) Standardization of the emerging 'Private Marketplace' models in RTB. A set of best practices and trading procedures for PM needs to be defined such that the market can grow properly.

While the main bid request/response API of OpenRTB has been criticized as being 'too late' given the large implementations in production, it is not too late to define standards for the above. These things will help the buy-side better differentiate quality inventory.

Tuesday, July 12, 2011

SchemaMgr - MySQL schema management tool

SchemaMgr is a simple perl tool to manage schema change in MySQL DBs. I wrote this in 2007/2008 and it has been used in production for many years.

https://github.com/nealrichter/schemamgr

Each change is assigned a version number and placed in a file. When the SQL in the file is executed successfully, a special table is set with that version number. Subsequent runs install only the higher versioned files.

It can also be used to reinstall views and stored procedures.

The best practice is to copy the file and change X in the filename and in the $DB_NAME variable.

$ ./bin/schemamgr_X.pl
Usage: either create or upgrade X database
schemamgr_X.pl -i -uUSERNAME -pPASSWORD [-vVERSION] [-b]
updates DB of to current (default) or requested version
schemamgr_X.pl -s -uUSERNAME -pPASSWORD
reinstalls all stored procedures
schemamgr_X.pl -w -uUSERNAME -pPASSWORD
reinstalls all views
schemamgr_X.pl -q -uUSERNAME -pPASSWORD
Requests and prints current version
Optional Params
-vXX -- upgrades upto a specific version number XX
-b -- backs up the database (with data) before upgrades
-nYY -- runs the upgrades against database YY - default is X
By convention you create ONE create_X_objects_v1 file with a date.
All other files are update files with greater than v1 numbers.
build/
|-- create_X_objects_v1_20110615.sql
|-- update_X_objects_v2_20110701.sql
`-- update_X_objects_v3_20110702.sql

Posted via email from aicoder - nealrichter's blog