Empirical Bayes Estimation of Binomial Proportion using R

Binomial distribution pops up in our problems daily, given that the number of occurrences k of events with probability p in a sequence of size n can be described as

k \sim Binomial(n,p)

Question that naturally arises in this context is - given observations of k and n , how do we estimate p ? One might say that simply computing p = {k \over n} should be enough, since that's both uniformly minimum variance and a maximum likelihood estimator. However, such estimator might be misleading when n is small (as anyone trying to estimate clickthrough rates from small number of impressions can testify).

The question is - what can we do ? One thing that naturally comes to mind is incorporating any prior knowledge about the distribution we might have. A wise choice for prior of binomial distribution is usually Beta distribution, not just because of it's convenience (given that it's conjugate prior), but also because of flexibility in incorporating different distribution shapes

Screen Shot 2013-09-10 at 3.36.46 PM

In this context, we can express our model as:

k_i \sim Binomial(n_i, p_i)

p_i \sim Beta(a, b), i = 1...N

where N is total number of observations and a and b are parameters to be estimated. Such model is also called Empirical Bayes. Unlike traditional Bayes, in which we pull prior distribution and it's parameters out of the thin air, Empirical Bayes estimates prior parameters from the data.

In order to estimate parameters of the prior, we calculate marginal distribution as

m(k|a,b) = \int \prod_{i=1}^{N} f(k_i|p)\pi(p|a,b)dp = \prod_{i=1}^N {n_i \choose k_i} {{\Gamma(a+b)\Gamma(a+k_i)\Gamma(n_i-k_i+b)}\over{\Gamma(a)\Gamma(b)\Gamma(a+b+n_i)}}

where f and \pi are density functions of binomial and beta distributions, respectively. Parameter estimates \hat{a} and \hat{b} can be obtained by maximizing the log likelihood of the marginal distribution.

Finally, Empirical Bayes estimator can be constructed as expectation of posterior distribution:

\hat{p_i} = E(p_i|k_i, \hat{a}, \hat{b}) = {{\hat{a}+k_i}\over{\hat{a} + \hat{b} + n_i}}

Pretty easy. The real question is - how do we do this in practice ? It turns out that there is no off-the-shelf package in R for doing this, so we have built one. It relies pretty heavily on fitdistrplus package and there are certainly a number of things to be improved, but it's a start. You can grab it at our Github repository.

 

In SupplyFrame we started to use Riemann as our stream processing framework to do system and application monitoring. Riemann is a lightweight clojure based DSL operates on event streams. The power of Clojure expressiveness gives it ability to encode the whole event handling logic into one config file. Compare to generic stream framework like Esper or Storm, it is cleaner, simpler, but still highly programmable.
For example, If we want to send the mean data of last 10 minutes to graphite and alert sysops if the median of metric is more than 1000, we can express the idea in riemann like this:

Pretty clean, right?

 

How to install & deploy Riemann

On the homepage of riemann website, there are some prebuilt packages available. You can simply download the deb package and $ sudo dpkg -i riemann.0.2.x.deb to install it. The package will install the configuration file into /etc/riemann/riemann.config and you can use sudo service riemann start/stop/restart/reload to run or to stop it. If you prefer to use it as an instance instead of a system service, you can clone the project and use leiningen to install the dependencies and compile the project. You can either run Riemann in a screen or use nohup to force it to run as a daemon. If you run Riemann as an instance, remember to assign each Riemann with different ports that it listens to.

Riemann DSL structure

Events

Riemann is designed to operate on flows of events sent over protocol buffers. An event is basically a map of these fields (copied from riemann documentation):

host A hostname, e.g. "api1", "foo.com"
service e.g. "API port 8000 reqs/sec"
state Any string less than 255 bytes, e.g. "ok", "warning", "critical"
time The time of the event, in unix epoch seconds
description Freeform text
tags Freeform list of strings, e.g. ["rate", "fooproduct", "transient"]
metric A number associated with this event, e.g. the number of reqs/sec.
ttl A floating-point time, in seconds, that this event is considered valid for. Expired states may be removed from the index.

You can think that the events in Riemann are nouns, and we're going to use verbs, streams and stream functions to process them.

Streams and stream functions

(streams ...) is the magic macro that forms the unique DSL of processing events. The expressions in (streams ...) are treated as rules that process the same event simultaneously; while the nested s-expressions in (streams...) will pipe the results from parent s-expression to child s-expressions.

You can tell from the previous example, (folds/mean ...) and (folds/median ...) are at the same level, which means they're processing the same event. And the event pipeline handling logic is expressed in nested s-expressions.

How to handle an event manually 1

Every event handler is a function that takes an event or list of events as input. If you want to deal with an event on the most nested s-expression, its easy. An anonymous function will do the job:

For more example of the basic usage of Riemann and standard stream processing functions, you can find it at Riemann's HOWTO page.

How to handle an event manually 2

In most of the use cases, Riemann's built-in stream aggregation functions solve the problems. However, in some rare cases you might also want to write nested stream processing function similar to the built-in functions. Here's how you can do it:

Case study

Here's an use case. We want to detect errors and those events which exceeds defined threshold. In most alerting systems, they either notify sysops with all peak events or smoothen the data which is even more error prone. We want something like this: in time period if the ratio of overshoot or error events is more than x, pipe the events to handler functions. Here's how we implemented it in Riemann:

With this function, now we can define our event logic as


Further reading

For common Riemann tasks and functions, the official Riemann's HOWTO page is a great reference. For Clojure's syntax and semantics, there's a lot of resources and tutorials in the web; Full disclojure video series. is a good start to get familiar with it.

This year, few of us visited Open Hardware Summit at MIT and got the chance to show some of our efforts in making datasheet content more accessible and fun to use. Project is currently hosted at beta.datasheet.net and is written by Ben Delarre, a "man who hates datasheets" (and is committed to making them better). We got great feedback from a lot of participants and are heading back with a plate full of bugs and features we need to fix/implement over the next month. Soon we'll start handing out invites and sending a call for early testers. Stay tuned !

beta.datasheet.net

For a great overview of some of the OHS demos, check our friend Eric Evenchick's post on Hackaday.

We're proud to announce that this year we're sponsoring the Open source hardware summit! This is a new move for us as we haven't sponsored an event like this before, but since we're getting more involved in the Open Hardware scene we thought it would be a great place to go and meet more members of the community and find out whats going on. Myself and a couple of others will be attending the event which starts on September 6th in Boston, MA.

I'm really looking forward to getting my grubby paws on one of these awesome conference badges and hacking up some fun projects while we're there.

BADGEr BADGEr BADGEr!

Check out the other sponsors of the event, and get yourself some tickets before they run out!

 

For the past couple of months I've been working on an art project for a certain dusty art festival this year. We're building an 18ft diameter geodesic dome with hundreds of LEDs lighting up a cover to make lots of triangles of color. Inside we'll be having a 3 foot wide multi-touch dome surface that participants can interact with to play with the animations being shown all around them. We're also having people from around the world write animations in JavaScript to run on the dome.

Because of some miss communication with our suppliers in China we didn't get the LEDs made up the way we needed them to be. So we now need to make up lots of small wires, and connect everything up with screw terminals. This is obviously a major undertaking and I really needed some help. So on Friday we decided to run a little build afternoon in our Hacklab. We got some food in and began the long and arduous process of cutting, stripping and tinning hundreds of wires. Nearly everyone pitched in at some point throughout the afternoon and we managed to knock out almost 1/3rd of the parts we needed.

I have to say a big thanks to everyone who pitched in, amazing stuff!

This is what we're making the parts for

This is what we're making the parts for

We've got a huge reel of cable...

We've got a huge reel of cable...

Snip snip, cut cut

Snip snip, cut cut

Ryan's a tinning master now...

Ryan's a tinning master now...

Jasmine keeping a watchful eye...

Jasmine keeping a watchful eye...

Adam, Jasmine, Felix and Ryan

Adam, Jasmine, Felix and Ryan

Xinkai, Jasmine, Minh and Chungwei having a good time...

Xinkai, Jasmine, Minh and Chungwei having a good time...

Stringing up the terminal blocks

Stringing up the terminal blocks

Lots of terminal blocks...

Lots of terminal blocks...

The fruits of our labour!

The fruits of our labour!

We recently got a Makerbot Replicator Dual, we've had a lot of fun printing little chotskies and cases for raspberry pi and arduino unos. But we thought it was time we did something a little bit more involved. One thing we printed early on was this fantastic 3d model of the classic Joy Division Unknown Pleasures cover. This is right up our street, and it inspired us to see what else could be represented this way....enter Physical Analytics!

We dumped a month worth of our server response time data from our monitoring system and wrote a small nodejs script to massage the data. We then ran the data through an OpenSCAD program to produce this 3d repesentation of our server stats!

Physical Analytics....

Physical Analytics....one month of server response times at 15minute samples.

Here's a little time-lapse of the printer in action:

We're thinking of other ways we can represent our statistics in 3d form now, there's a variety of ideas being bounced around like making 'slides' for projectors which show different stats depending on the angle that you hold the slide, or globes that show our regional data mapped onto the countries that it relates to. Does having a 3-dimensional view of your data make it more visceral, more real? Let us know what you think.

Functional programming is becoming more and more popular these days, especially given the use of multi-core and distributed architectures. A purely functional data structure is thread safe in nature and it can be easily shared across threads or used for low-cost large scale parallel programming. And Clojure is one of the functional languages that really shines.

Different programming styles have different objectives; Clojure focuses on solving practical problems. Programmers can easily switch between strict and lazy evaluation, immutable data structures and mutable ones. Because there's no silver bullet to solve parallel programming problems, a luxury that Clojure programmer has, is that he can combine all the goodies from different parts of the world and make great use of it. In order to demonstrate the flexibility of Clojure, we'll build a persistent queue as both strict data structure and in lazy evaluated flavor.

Strict Persistent Queue

A strict persistent queue is formed by two list: front and rear; the element is enqueued at the rear, and dequeued at the front. After doing the enqueue/dequeue operation, the queue has to balance its internal data structure; If the front list is empty, reverse the rear list and append it to the front.

(defn queue-balance [[f r :as this]]
(if (empty? f)
[(reverse r) '()]
this))
(defn q-head [[[x & f] _]] x)
(defn q-tail [[[x & f] r]] (queue-balance [f r]))
(defn q-snoc [[f r] x] (queue-balance [f (cons x r)]))

We can try this on REPL:

user> (reduce q-snoc [nil nil] (range 1 10))
[(1) (9 8 7 6 5 4 3 2)]
user> (q-head (reduce q-snoc [nil nil] (range 1 10)))
1
user> (q-tail (reduce q-snoc [nil nil] (range 1 10)))
[(2 3 4 5 6 7 8 9) ()] ;; balanced

However, what if other thread also want to read (q-tail ...)? An unnecessary calculation on (queue-balance .. is a waste. No worries, given that this is a purely functional data structure, all the returned results only depend on the input, and we can cache the input-output map in a memoize function:

(def queue-balance
(memoize (fn [[f r :as this]]
(println this) ;; print if not cached
(if (empty? f)
[(reverse r) '()]
this))))

user> (q-tail (reduce q-snoc [nil nil] (range 1 9)))
[nil (1)]
[(1) (2)]
[(1) (3 2)]
[(1) (4 3 2)]
[(1) (5 4 3 2)]
[(1) (6 5 4 3 2)]
[(1) (7 6 5 4 3 2)]
[(1) (8 7 6 5 4 3 2)]
[nil (8 7 6 5 4 3 2)]
[(2 3 4 5 6 7 8) ()]
user> (q-tail (reduce q-snoc [nil nil] (range 1 9)))
[(2 3 4 5 6 7 8) ()]

Lazy Persistent Queue

Clojure code evaluation and data structures (lists, persistent array, persistent map, etc.) are strict, but most of the utility functions return lazy sequences. Here we are going to use two of these functions lazy-cat and rest. lazy-cat takes two or more sequences, yields a lazy sequence, and returns the concatenation of the sequences. rest take a sequence and returns the lazy sequence of the items after the first.

Instead of reversing the rear sequence and appending to the front when the front list is empty, we can reverse the list when rear list count is greater than one in the front. Thus the peak calculation of reverse is separated to parts. Here is the implementation:

(def lazy-queue
(memoize (fn [[f len-f r len-r :as q]]
(println q)
(if (<= len-r len-f) q
[(lazy-cat f (reverse r)) (+ len-f len-r) '() 0]))))
(defn lazy-snoc [[f len-f r len-r :as q] x]
(lazy-queue [f len-f (cons x r) (inc len-r)]))
(defn lazy-head [[f & _]] (first f))
(defn lazy-tail [[f len-f r len-r]]
(lazy-queue [(rest f) (dec len-f) r len-r]))

Applying memoize to lazily evaluated version is also trivial, just use the function to wrap the actual body. The calculation would be something like this:

user> (def q (reduce lazy-snoc [nil 0 nil 0] (range 15)))
[nil 0 (0) 1]
[(0) 1 (1) 1]
[(0) 1 (2 1) 2]
[(0 1 2) 3 (3) 1]
[(0 1 2) 3 (4 3) 2]
[(0 1 2) 3 (5 4 3) 3]
[(0 1 2) 3 (6 5 4 3) 4]
[(0 1 2 3 4 5 6) 7 (7) 1]
[(0 1 2 3 4 5 6) 7 (8 7) 2]
[(0 1 2 3 4 5 6) 7 (9 8 7) 3]
[(0 1 2 3 4 5 6) 7 (10 9 8 7) 4]
[(0 1 2 3 4 5 6) 7 (11 10 9 8 7) 5]
[(0 1 2 3 4 5 6) 7 (12 11 10 9 8 7) 6]
[(0 1 2 3 4 5 6) 7 (13 12 11 10 9 8 7) 7]
[(0 1 2 3 4 5 6) 7 (14 13 12 11 10 9 8 7) 8]
#'user/q
user> (def q (reduce lazy-snoc [nil 0 nil 0] (range 15)))
#'user/q
user> (lazy-head q) ;; cached
0
user> (lazy-head (lazy-tail q))
[(1 2 3 4 5 6 7 8 9 10 11 12 13 14) 14 () 0]
user> (lazy-head (lazy-tail q)) ;; cached
1

Although Clojure itself has a Persistent Queue, implemented in Java, the idea is the same. Functional structures can not be only applied to inter-thread or inter-process data sharing, but also to distributed data architectures (especially interesting challenges are in dealing with large-scale data stores). For getting deeper into the subject, we recommend reading "Purely Functional Data Structures" by Chris Okasaki.

After graduating from Soldering 101, I quickly became addicted to the smell of solder — which is kind of like a mix of wet newspaper and wet dog. Given that explanation the addiction makes little sense, but those of you who are likewise afflicted will understand.

The business end

Input/Output

Anyway, I fired up another DIY audio kit though this one is quite a bit less complicated. It's an 8-channel passive summing mixer from a little outfit in Florida called Sofiteque and it's very simple. Eight channels of audio come in where their amplitude is reduced via some resistors before combining into two channels, left and right. Viola!

The basic idea of analog summing is that doing the final mixdown to stereo in the analog realm yields more ear-pleasing results than having the computer "crunch the numbers", so to speak. In any case, this is all the rage in the audio engineering community and the latest form of backlash against perceived harshness of digital audio. Worth a try, right?

Anyhoo ... Here are a couple pics of the construction of this kit. No "hacks" on this one just yet. At some point I may cobble two kits together for 16 channels of summing, but for the moment I'll let the soldering iron cool, having satisfied my flux fix for now.

The kit

The Sofiteque kit

Pretty simple PCB

Pretty simple PCB

The guts

The guts

Link:  http://www.sofiteque.com/shop/pb8t-passive-summing-mixer-kit/

So, I'm the guy from Marketing. I've never heard of a film capacitor (is it anything like a flux capacitor?) and to me "resistance" is the thing that keeps shoppers from clicking "Add to Cart".

But Ben was kind enough to offer a course in soldering that even I could understand. After going over the basics, he bravely handed me the keys to a 800°F instrument of death soldering iron. And I was off!

Ok, so I'm not as technically inept as I led you to believe. I'm actually an engineer of sorts – audio engineer to be exact. And I wasn't learning to solder for my health. I wanted to build a new piece of gear for my home studio – the Seventh Circle Audio B16 VCA Compressor, to be exact.

SCA B16 VCA Compressor Kit

Ain't she pretty?

While I dig this piece, I actually think it can be improved ... which is why it's a great project for the Hacklab. We're going to modify the design to include a sub-board that runs an LED gain reduction meter on the front panel. We'll be using this design note from THAT, maker of the VCA chip. Stay tuned for more on this project.

Thankfully soldering isn't rocket science, and I'm picking it up pretty quickly. Ben showed me what a good solder joint looks like, and mine are getting better with each component. The keys for me so far are 1) converging the iron tip and the solder as close to the board as possible, and 2) using just enough solder.

Building my own audio gear will save tons of money in my studio. Plus, there's the satisfaction of recording and mixing on gear that I built. You too can SAVE MONEY and BE HAPPIER with soldering lessons from the SupplyFrame Hacklab! Sorry, marketing reflexes.

Here are some pics of my progress:

Getting started at our humble workstation

Getting started at our humble workstation

Going 3D

Going 3D

OK so they're not exactly "straight".

OK so they're not exactly "straight".

Always room for improvement ......

I'll be back with an update on the B16 mod in the weeks to come.

One of very interesting and potentially quite useful HTML5 APIs is the Web Speech API. At the moment there are some disagreements regarding actual implementation and only Google Chrome supports it in full, but we have decided to do go ahead and build a small test app anyway. You can find the source at:

 https://github.com/SupplyFrame/rvc

Open the demo page at http://supplyframe.github.io/rvc/ allow microphone usage and start speaking. Commands are basic: up, down, left, right. Hopefully, you should get the ball to move 🙂

However, the speech recognition doesn't have that much of an accuracy. In order to address that - we compare the recognized text with each of available commands using Levenshtein distance. By doing this and increasing distance threshold, we're able to significantly increase the command recognition accuracy. Try moving the slider in order to change the threshold on the demo page and see what happens. Have fun !

For additional reference on Web Speech API, check out:

https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

http://updates.html5rocks.com/2013/01/Voice-Driven-Web-Apps-Introduction-to-the-Web-Speech-API