## How To Identify Patterns in Time Series Data: Part I - Discrete Fourier Transform

In many real-world applications,  "Signals" are typically recorded and represented as time dependent functions. For example, we can record the daily trading price in the stock market, or measuring the hourly temperature in Los Angeles. In order to extract meaningful characteristics of the data, many different transforms have developed to decompose the data into simpler pieces.

The question is: How to design a good transform on the signal we want to analyze? It becomes a classical problem in signal representation- We want to define a transform which provides a "sparse" representation of the signal which capture most or all information of a signal. "Sparseness" is one of the reasons for the extensive use of popular transforms, because they discover the structure of the singal and provide a "compact" representation. In this post, we provide an example that how to analyze the web traffic by Discrete Fourier Transform (DFT).

Website traffic

In every hour, we record the total number of user actions in our website and the data is shown in figure 1. This signal seems to be periodic, but the question is: How could we explore the structure? DFT provides a nice tool to represent the discrete time signal into periodic Fourier series. The original signal $x(t_n)$ can be represented by

\label{eq:1}
x(t_n) = \frac{1}{\sqrt{N}} \sum_{k=0}^{N-1} X_k \cdot e^{i 2 \pi k n / N},

where the coefficients of Fourier series $X_k$ are defined as

\label{eq:2}
X_k = \frac{1}{\sqrt{N}} \sum_{n=0}^{N-1} x(t_n) \cdot e^{-i 2 \pi k n / N}, \quad k\in\mathbb{Z}\

After DFT, we can analyze the energy in each frequency component. The frequency component is given by $f_k = \frac{k}{N~T}$ , where $T$ is the sampling frequency. We plot the power spectrum of the data $|X(f_k)|$ in figure 2. The result shows the web traffic is mostly on low-frequency components.

Energy Spectrum

Let's look at different frequency components of the data: We compare different frequency components with the original data: the original signal is shown in figure 3(a), the components with frequency $|f| < 0.46$ is in figure 3(b), and components with frequency $0.46 \leq |f| < 1.40$ in figure 3(c). Note that the traffic in weekdays is much higher than weekend, and the first components successfully capture the traffic change between weekdays and weekend. Also, the traffic is much higher for working hours than the rest periods, which can be clearly identified for the daily change in the second components.

Signal Components

Based on sparse representation, we can select significant'' components for many different applications, i.e, signal de-noising and or pattern recognition. We will introduce the wavelet transform in the next blog.

## Caveat emptor : Amazon EC2 Reserved Instance Marketplace

AWS EC2 Reserved Instances sound like a great idea. You spend a lot with EC2, keep instances running for a long time and have predictable workloads throughout the year ? Reserved Instances to the rescue ! You pay a flat reservation fee upfront and reap the benefit of reduced computing cost over time.

All sounds great and makes perfect sense, until the day comes when you realize that the original forecast was off. For whatever reason. Perhaps you have improved the performance of your infrastructure and don't need as much resources. Unused reservations start piling up and you start bleeding money. Naively you say "not a problem ! we'll just sell these instances on EC2 Reserved Instance Marketplace and get our money back !". Well, that is where this story begins ...

There are two main problems with this kind of reasoning. First one is being able to accurately forecast future demand for resources from very limited historical information. EC2 reservations are usually sold at 1 and 3 year terms, while EC2 has really only been around for 6-7 years. And let's be honest, most of infrastructure you're trying to buy reserved instances for haven't been on AWS for more than couple of years. So, at best, you're really trying to forecast 1-3 year demand from 1-3 year worth of data ? Maybe you have a solid business, good plan that you won't deviate from and a strong reason to believe that engineering team won't be able to make things faster/cheaper anytime soon. But maybe you should think twice.

The second and more severe problem is liquidity risk you're running when dealing with EC2 Reserved Instance Marketplace. When people hear the world "marketplace" they imagine zillions of transactions a second, hordes of buyers, sellers and arbitrage-opportunity-seekers clearing the market in a blink of an eye, forces of supply and demand ensuring everyone gets the right price. But there is a huge difference between just saying something is a 'marketplace' and actually running a liquid one. In an inefficient market, it's very easy to end up on the short end ... and these reservations are not cheap.

RI Marketplace problems are not limited to simply not having enough buyers and sellers around. Market is highly fragmented - you can only buy and sell instances within the same region and can't really convert instance types. It is really a ecosystem of region-specific marketplaces, all of which highly illiquid, with each unique (region, instance type, duration, offering type) tuple representing a different commodity. There is no market maker either. You would expect that that Amazon might want to buy back your unused reservations at a discounted price, but that's not an option, at least not for now. After all, why give someone a fraction of their money back when you can keep all of it. To be fair, Amazon is making some efforts to 'improve' things - until recently you were only able to buy and sell instances within particular (region, availability zone) and now it's only (region) but the whole thing is still a long way from a real marketplace you can rely on.

So we went ahead and collected some data on EC2 Reserved Instance Markeplace activity over the last two months in order to get at least a bit more insight into how good or bad things really are.

## Making of Datasheet.net Demo Video

No project launch is complete without a making-of video. So here it goes ...

## Datasheet.net is live!

Our latest product, Datasheet.net, is now finally available for anyone to use. After a few months in private beta we feel ready to let the world see what we've been playing with. Here's a quote from the going live post on the Datasheet.net blog.

We think datasheets suck. Plain and simple this is a format that hasn’t changed since the invention of the PDF. We’re trying to change that, trying to make datasheets a ‘live’ document, something that actually helps make our lives as engineers better. In the long term this means us working closely with manufacturers to help improve the datasheet format, we’re not sure where this leads yet but its something we’re working towards.

But changing the way manufacturers work is a long and slow road, so in the meantime as a first step we’ve put together some useful features on top of datasheets as they currently stand.

We'll be promoting the project more over the coming weeks, and hopefully extending it to support all manner of new features and cool things. But for now signup to Datasheet.net to start making datasheets a little bit better!

## Supplyframe at BigDataCamp LA 2013

If you have been following LA tech scene lately - you might have noticed a lot of activity around Hadoop ecosystem & related projects.  While it's easy to discredit this as just a reflection of omnipresent buzz around "Big Data" and "Data Science" one can argue that in this environment - it's much more than just that. Area's traditional focus on "hard science", both in industry and academia seems to have given the community quite an edge - when talking Data Science there is a lot more appreciation for actual research topics (rather than just black box tools) and when talking Distributed Systems - a huge passion for 'exotic' subjects such as Haskell and Clojure.

Big Data Camp LA 2013 will be the key event of the season bringing different sides of this community together. SupplyFrame is one of the sponsors and a bunch of our engineers will be attending. We have our fair share of challenges, experiences and ideas in this space and if you're the same way, we would love to chat ! That's what unconferences are for.

See you at #BigDataCampLA !

## Anatomy of a hack

Last couple of days we have been goofing around with different design ideas for new supplyframe t-shirt and here is what we finally came up with

Hope the great W. Shockley will forgive us for canibalising such an awesome hack

## A Visit to Shanghai Hackerspace

On our recent trip to China, we were lucky enough to get the chance to visit one of local Hackerspaces in Shanghai. The place is called Xinchejian and it looks like something straight off William Gibson novel. Tons of cardboard boxes, tiny robots scattered all over the place, art pieces, 3d printers, hydroponics hacks (?) ... The kind of organized chaos all of us happily relate with. We met with David Li, a foreman of the Hackerspace who introduced us to the local scene and challenges of running this kind of environment in the place where passtime is not a commodity.

A factor that makes all the difference here is easy access to serious production-grade fabrication resources. The limit seems to be only in creativity and novelty of ideas (and ability to get some bootstrap funding). Everything else is easy. And that is exactly what this kind of place is all about - enabling creative people with tools for expressing their ideas.  David had a great analogy with BSD and Linux communities in the 90ies - while BSD community was somewhat elitist and not really welcome to noobs, Linux crowd always had patience for even the most boring and repetitive questions. And Linux eventually enabled so much change, while BSD for the most part ended up on the shelf. It seems that the same holds nowadays in terms of the gap between "true" Electrical Engineers and the new hacker/maker crowd.

That is why David tries to attract as many people from diverse areas. A beautiful example is the one of a girl who was a professional piano player with no knowledge of electronics whatsoever. She joined the Hackerspace and within couple of months ended up picking up the necessary skills and building a spectacular interactive "singing tree" installation art piece (for which she also composed an original score). And that is what this 'hardware revolution' is all about ...

Sadly, our time was limited but next time around we'll try to hang out more and participate in the community. If you happen to find yourself around Shanghai, by all means don't forget to visit Xinchejian !

## Empirical Bayes Estimation of Binomial Proportion using R

Binomial distribution pops up in our problems daily, given that the number of occurrences $k$ of events with probability $p$ in a sequence of size $n$ can be described as

$k \sim Binomial(n,p)$

Question that naturally arises in this context is - given observations of $k$ and $n$ , how do we estimate $p$ ? One might say that simply computing $p = {k \over n}$ should be enough, since that's both uniformly minimum variance and a maximum likelihood estimator. However, such estimator might be misleading when $n$ is small (as anyone trying to estimate clickthrough rates from small number of impressions can testify).

The question is - what can we do ? One thing that naturally comes to mind is incorporating any prior knowledge about the distribution we might have. A wise choice for prior of binomial distribution is usually Beta distribution, not just because of it's convenience (given that it's conjugate prior), but also because of flexibility in incorporating different distribution shapes

In this context, we can express our model as:

$k_i \sim Binomial(n_i, p_i)$

$p_i \sim Beta(a, b), i = 1...N$

where $N$ is total number of observations and $a$ and $b$ are parameters to be estimated. Such model is also called Empirical Bayes. Unlike traditional Bayes, in which we pull prior distribution and it's parameters out of the thin air, Empirical Bayes estimates prior parameters from the data.

In order to estimate parameters of the prior, we calculate marginal distribution as

$m(k|a,b) = \int \prod_{i=1}^{N} f(k_i|p)\pi(p|a,b)dp = \prod_{i=1}^N {n_i \choose k_i} {{\Gamma(a+b)\Gamma(a+k_i)\Gamma(n_i-k_i+b)}\over{\Gamma(a)\Gamma(b)\Gamma(a+b+n_i)}}$

where $f$ and $\pi$ are density functions of binomial and beta distributions, respectively. Parameter estimates $\hat{a}$ and $\hat{b}$ can be obtained by maximizing the log likelihood of the marginal distribution.

Finally, Empirical Bayes estimator can be constructed as expectation of posterior distribution:

$\hat{p_i} = E(p_i|k_i, \hat{a}, \hat{b}) = {{\hat{a}+k_i}\over{\hat{a} + \hat{b} + n_i}}$

Pretty easy. The real question is - how do we do this in practice ? It turns out that there is no off-the-shelf package in R for doing this, so we have built one. It relies pretty heavily on fitdistrplus package and there are certainly a number of things to be improved, but it's a start. You can grab it at our Github repository.

## Getting started with Riemann

In SupplyFrame we started to use Riemann as our stream processing framework to do system and application monitoring. Riemann is a lightweight clojure based DSL operates on event streams. The power of Clojure expressiveness gives it ability to encode the whole event handling logic into one config file. Compare to generic stream framework like Esper or Storm, it is cleaner, simpler, but still highly programmable.
For example, If we want to send the mean data of last 10 minutes to graphite and alert sysops if the median of metric is more than 1000, we can express the idea in riemann like this:

Pretty clean, right?

### How to install & deploy Riemann

On the homepage of riemann website, there are some prebuilt packages available. You can simply download the deb package and \$ sudo dpkg -i riemann.0.2.x.deb to install it. The package will install the configuration file into /etc/riemann/riemann.config and you can use sudo service riemann start/stop/restart/reload to run or to stop it. If you prefer to use it as an instance instead of a system service, you can clone the project and use leiningen to install the dependencies and compile the project. You can either run Riemann in a screen or use nohup to force it to run as a daemon. If you run Riemann as an instance, remember to assign each Riemann with different ports that it listens to.

### Riemann DSL structure

#### Events

Riemann is designed to operate on flows of events sent over protocol buffers. An event is basically a map of these fields (copied from riemann documentation):

host A hostname, e.g. "api1", "foo.com" e.g. "API port 8000 reqs/sec" Any string less than 255 bytes, e.g. "ok", "warning", "critical" The time of the event, in unix epoch seconds Freeform text Freeform list of strings, e.g. ["rate", "fooproduct", "transient"] A number associated with this event, e.g. the number of reqs/sec. A floating-point time, in seconds, that this event is considered valid for. Expired states may be removed from the index.

You can think that the events in Riemann are nouns, and we're going to use verbs, streams and stream functions to process them.

#### Streams and stream functions

(streams ...) is the magic macro that forms the unique DSL of processing events. The expressions in (streams ...) are treated as rules that process the same event simultaneously; while the nested s-expressions in (streams...) will pipe the results from parent s-expression to child s-expressions.

You can tell from the previous example, (folds/mean ...) and (folds/median ...) are at the same level, which means they're processing the same event. And the event pipeline handling logic is expressed in nested s-expressions.

#### How to handle an event manually 1

Every event handler is a function that takes an event or list of events as input. If you want to deal with an event on the most nested s-expression, its easy. An anonymous function will do the job:

For more example of the basic usage of Riemann and standard stream processing functions, you can find it at Riemann's HOWTO page.

#### How to handle an event manually 2

In most of the use cases, Riemann's built-in stream aggregation functions solve the problems. However, in some rare cases you might also want to write nested stream processing function similar to the built-in functions. Here's how you can do it:

### Case study

Here's an use case. We want to detect errors and those events which exceeds defined threshold. In most alerting systems, they either notify sysops with all peak events or smoothen the data which is even more error prone. We want something like this: in time period if the ratio of overshoot or error events is more than x, pipe the events to handler functions. Here's how we implemented it in Riemann:

With this function, now we can define our event logic as