1 Sep 2010

Pushing Hadoop In The Life Sciences

At the start of this year, I began a personal/professional journey, to become a catalyst, a voice, an advocate for big data processing in the life sciences.  Speaking with peers outside the domain, most recognized the volumes of data they were (or would be) consuming was growing at alarming rates.  Speaking with informaticians (specifically those in the life sciences), in addition to wrestling with this rapidly expanding data footprint, the challenges of working with data sourced from highly diverse, often dynamic, and incredibly interconnected and intertwined sources was often mentioned.  Regardless of industry it's clear, no company operating at any significant scale is immune the to the challenges of consuming, anlayzing, and making sense of the massive amounts of the data we're amassing.

As a logical starting point, I've invested and continue to invest my energies on educating the informatics community at my day job.  The focus, at this stage, is on tooling, the bulk of which is Hadoop and the ecosystem building up around it.  I've presented Hadoop to a variety of audiences: operations engineers, bioinformaticas, drug designers, modelers.  Each group brings a unique set of constraints and challenges to ovecome, most of which Hadoop is well suited for.  Earlier this week, I took the roadshow to the masses and presented the topic more broadly to a large and diverse audience composed of developers, IT, corporate engineers, directors, operations, analysts, business heads, data junkies, savvy scientists, and informaticians.  What I presented follows:

Two quick observations I recognized while preparing for the talk:

  1. The Hadoop ecosystem is growing in all directions so rapidly it's becoming quite the challenge to talk about the landscape in any meaningul way without dedicating a substantial amount of time to it. My talk was an hour and a half and I talked non-stop; all I could do in that time was scratch surfaces.
  2. It's dense material thus difficult to present to a wide audience.  If you're presenting Hadoop, you're touching on topics (some more closely than others) that require a bit of understanding which may not be the standard knowledge-base of an average developer or analyst: distributed systems, parallel processing, file systems, elements of functional programming, algorithms.

 

27 Jul 2010

Random Thoughts On Rob Pike's OSCON Keynote

I just finished watching Rob Pike's presentation at OSCON entitled "Public Static Void" and his talk resonated with me on multiple levels.

First, I disagree with his (and for that matter Norvig's) thoughts on patterns being a symptom of a language's weakness.  I understand the point he's attempting to make and while I concur a subset of patterns may serve as plugs (especially those tied to a particular language, for example, the prevalence of patterns in C++ as evidence of its lack of expressiveness), the vast majority are language agnostic, addressing common problems and not a language's weaknesses.

Second, his point about industrial programming is spot on. In fact, his choice of words helped me to clarify one of the points I tried to articulate in my comments on corporate programmers. Re-reading my post and listening to Rob, I think a more appropriate label would be industrial programmer.

Lastly, I'm beginning to feel this renaissance in language design is just as much about the rise of polyglot programming than it is about the languages themselves.  In the same way the NoSQL movement is re-shaping the data storage landscape to include both relational and non-relational systems to coexist for a given application, the polyglot programming movement is prompting organizations to rethink the languages they embrace to efficiently and effectively get a job done.

24 Jun 2010

Notes From NoSQL Boston

In general, NoSQL Boston was a solid event. Lots of interesting and vocal hackers, geeks, and companies attended. The vibe was friendly (with the exception of the Hbase-Hypertable melee; who doesn't love a good debate). Overall I think 10gen put on a good show.

Keynote: Crossroads, Inroads, Pitfalls & Bylaws: Peering into NoSQL's Conceivable Future

The key takeaway from Tim was the community needs to do a better job of educating others: on the movement, on the different data stores, on architectures, etc. I couldn't agree with him more. Working in an industry extremely cautious of new technologies (pharmaceuticals and the life sciences), the biggest impediment is often a lack of knowledge, with lack of tooling and vendors to support said technologies being a close second. The former is something that is in our hands and impacts the later; where there is demand, companies will rise to fulfill the need.

Panel 1: Scaling With NoSQL

Bradford did a solid job moderating.  Mark Atwood was a bit of a curmudgeon but I feel the root of that was more the makeup of the panel and less him. Once we veered off the scaling topic, we were left with two key-value stores and three column based stores which wasn't very conducive for a cohesive discussion amongst the panelists. Nonetheless, there were a few nuggets that came out which were very insightful. Mark's statement that "Memcached should be integrated into all nosql stores" was awesome. As was Bradford's question regarding operations and the general answer from the panel of it being engineer (versus administrator) driven; a by-product I suspect of the developer being "closer" to the data. Going back to the make up of the panel, what might have worked better is if the panel was split in two. One focused on key-value stores, perhaps discussing common use cases given their ease of deployment, with representatives from at least Memcached, Project Voldemort, and Tokyo Cabinet (surprised by their lack of representation). A second focused on large column stores focusing on the scaling side of the equation (where it's a much more interesting question) with the Hbase, Cassandra, and Hypertable representatives.

Panel 2: NoSQL In The Cloud

Adam's panel was the highlight of the day. The diversity and spirit of the speakers was phenomenal. Most interesting for me was first, the debate with regards to whether the community ought to be wrapping these stores in an ORM like model to avoid lock-in, and second, the contrarian position John took on what data may not belong in the cloud With regards to the first, Jonathon's remark "I see application database independence as an anti-pattern" was killer and a perspective I've never considered. The reference to the Vietnam of Computer Science was also poignant (and spawned some interesting Twitter conversations after the fact). Not sure where I stand on the ORM debate. I get Jonathon's point and completely agree with the concept of limiting yourself with abstraction. However, for the masses, the abstraction that the ORM does provide works and works well, easing adoption. With regards to the second, John shared an interesting post that delves a bit deeper on the topic. Regardless of opinion, both points were thought provoking.

Lightning Talks

Good stuff, albeit a bit off topic at times (Jim Wilson's talk was fantastic but it pushed the limit on what I'd consider relevant content). I would have liked to have seen more talks at this intimate level, though I understand the limitations of a one day event.

Panel 3: Schema Design With Document-Oriented Databases

While the panel itself was not superbly moderated (Durran seemed emotionless which is not what I expected in the least), the content was solid and the panelists well spoken, diverse, and on point. At this stage of the NoSQL movement, definitions are so important hence having the distinction made right off the bat differentiating document and key value stores was helpful; even if the panelists were not in complete accord. The short of it, Riak, Mongo, and CouchDB approach storage slightly differently, each with pros and cons, each with their own way of accomplishing common tasks. I especially loved the comments about using the right tool for the job (in response to a question about building an inverted index off a document-oriented database).

Panel 4: The Evolution of the Graph Data Structure from Research to Production

While modeling data in graph form is incredibly intuitive and natural, the storage end of the equation still feels academic. The panel title suggested the focus would be on real world implementations. Unfortunately, the moderator and panelists focused elsewhere. No question about it though, this is certainly a space to watch in the coming years.

My Guess What The Landscape Will Look Like 12-18 Months Out

  • Cassandra is poised to be the front runner in the distributed column storage space (with Hbase at a close second). Maybe it's the big name companies adopting it (Facebook, Twitter, Digg, Reddit, Rackspace) but it seems to be grabbing the most attention right now. Lots of activity both on the core code base and, as important if not more in terms of adoption, the tooling around it; lots of open source libraries being released into the wild. I would not be surprised to see Rackspace offer Cassandra as a service. It seems like a good fit and opportunity for them. If not Rackspace, then a 3rd party altogether in the same vein as MongoHQ and Cloudant offering Mongo and CouchDB services respectively.
  • Mongo is maturing very rapidly. 10gen is doing a fantastic job advocating. The Ruby community is embracing it wholeheartedly. I would not be surprised to see it become the forerunner in the document-oriented database world if it isn't already.
  • 2010 will be a very good year for document-oriented databases. The technology is mature today and will only get better. Installation and adoption costs are minimal as compared with other NoSQL stores. Hosted services are emerging (Cloudant, MongoHQ) lowering the barrier to entry as well.
  • Debate aside, I suspect we will see numerous ORMs surface around a lot of these stores. The Hashrocket guys showcased MongoDoc which has an ActiveRecord feel. I'm sure we'll see at least one for Cassandra open sourced from Digg or the like.
24 Jun 2010

New Rules For Corporate Programmers

First and foremost, what do I mean by the term corporate programmer.  In my original tweet from which this post stems from, I used the term corporate engineer; that tweet spurred some replies (thanks Brian) which got me thinking and ultimately helped me refine my point.

Initially I thought the label corporate was the inaccurate point for me and quickly wrote it off as incorrect. My intent was not to imply a particular language (e.g. .NET or Java) nor was it to imply a certain type of company (e.g. Fortune 500).  Having sat with it a bit more, it's less the word corporate and more the word engineer that was incorrect.  By my definition, an engineer is a much broader role than a programmer, potentially encompassing design, development, and architectural responsibilities. Really what I was referring to in my original tweet was a programmer.  And frankly, I think the corporate label can still apply.

A corporate programmer is a programmer who...

...typically (though not always) works in a large company or on large development team.
...is familiar with a single programming language.
...may often be certified in said language.
...rarely looks beyond said language for tackling a problem.
...rarely looks beyond a common set of tools for tackling a problem.
...often plays it safe.
...fears the unknown.

You can argue that the size of a company or development team has nothing to do with the other points and while I understand your argument, in my eyes it does have at least some relevance. One of the biggest assets a small company or team has is agility.  The biggest enabler a programmer has to foster that agility is a broad knowledge-base, including the tools and languages (beyond their core competencies) that get a job done.  A complacent, uninitiated programmer can get by in a large company or on a large team, typically that is not the case in an environment where there are fewer places to hide.  I stress the word typically as I've personally witnessed poor programmers survive in small companies; not sure what that says about the company but that's besides the point.

So, definition out of the way, here are some rules that I feel "corporate" programmers ought to embrace and moreover, aspects I will strive to seek in programmers where I am involved in the hiring process:

  • Become familiar with a language other than your core language.  I'm not suggesting you become a 'foo' developer if you're already a 'bar' developer.  What I am suggesting is you learn 'foo' because it will make you better at 'bar'.  If you're a Java Programmer, take a look at JRuby or Scala; you're still on the JVM!  If you're a .NET Programmer, try IronPython or F#; you're still on the CLR!  If your craft is programming, you need to constantly nurture it.  To put it differently, invest in learning how to address a problem with a different set of constraints.
  • Care about something other than Java or .NET. It doesn't necessarily matter what but invest your time in learning something outside the core of .NET or Java. If for nothing else, it differentiates you from the masses.  Do not discard breadth of knowledge, it's a close second to knowing your core competency.
  • (mostly for .NET developers) Learn to appreciate and leverage patterns. My suspicion is since the Java community further embraces open source, patterns become an integral part of their vocabulary. Regardless of root cause, it's generally not the case for .NET developers and it's a problem.  For me personally, language semantics come second, patterns and how one technically addresses and thinks through a problem comes first.
  • (mostly for .NET developers) Learn to appreciate and leverage open source. There's more to life than Microsoft.  Take a moment to stop drinking Microsoft's Kool-Aid and explore what's out there in the open source world of .NET.  If nothing else, explore the major ports from the Java world (e.g. NHibernate, Spring.NET, etc.).
  • While on the topic of open source, contribute to an open source project in your core language. I mentioned breadth of knowledge earlier but you also need to nurture your depth of knowledge in what you proclaim you know best. Most prospective employers (or at least respectable ones) would love to see your code.  Odds are, the code you produced for former clients may not be shareable.  Sure you can talk to it abstractly but at the end of the day, code speaks volumes louder than concepts.  There's no better way to showcase your craft (and your passion for your craft) than contributing to open source.

Normally I let a post marinade for some time before I publish it.  For some reason, I did not in this case so I may very well come back to it and shred it to pieces.  In general, the point I want to get across is that if you are a programmer (and not an engineer), I'm starting to feel it's no longer enough to say that you know .NET and nothing but .NET.  Breadth of knowledge, while it remains second (at least in the case for a role of this nature), is just as vital.

24 Jun 2010

Heroku Hacks - SSL

From what I've gathered, SSL is a barrier preventing many (outspoken) developers from adopting Heroku as their platform of choice. Heroku offers "three options for SSL":http://docs.heroku.com/ssl : piggybacking on their wildcard certificate (free), SNI ($5 a month), or Custom ($100 a month).  Piggybacking is not an option for most for obvious reasons. SNI lacks broad acceptance (thanks Microsoft, more on that later) so that's also not a viable option.  And Custom comes at a substantial price.

"Adam Wiggins recently shared a post":http://adamblog.heroku.com/past/2009/9/22/sni_ssl/ that eloquently describes the SSL predicament in greater detail. In a perfect world, we'd all be using SNI.  It's the cost effective, cloud friendly, and will be the ubiquitous way of implementing SSL. The holdup boils down to Vista's failure and the long tail of XP that remains in the market as a result of said failure.  SNI is not supported by IE (any version) on XP.  That's a huge slice of the browser pie.

The next question you're probably asking is, "OK, so we're not there yet when it comes to SNI. If that's the case, then why the hell is it so damn expensive to go the Custom route?". Simple, in order for Heroku to support this they need to spin up a dedicated EC2 instance for nothing other than to host your certificate. Crazy huh!?  So therein lies the problem.  If you go the future forward path that is SNI you'll isolate users using IE on XP. If you go the Custom route (and the monthly $100 add-on cost is something you can't afford) then you're back to square one.

Luckily my situation was different. For my needs, piggybacking on Heroku's certificate was adequate however, what was not acceptable was using a domain other than my custom one (e.g. foo.heroku.com versus foo.com) at *all times*. To hack around this, I made a few tweaks to the "ssl_required plugin":http://github.com/rails/ssl_requirement such that SSL requests are redirected to the Heroku application name and redirected back to my custom domain for actions that did not require SSL. As you'd expect, as soon as you have an application that starts hopping across domains you're going to run into problems you did not account for as Rails by default uses cookies to store session.  For me that problem was authentication.

I am a fan of "Authlogic":http://github.com/binarylogic/authlogic and typically employ it in most of my applications.  In this particular instance, I was required to persist the user's session so that repeat visitors would not need to re-login.  As cookies are tied directly to a domain, in the event an action required SSL and the request was redirected to my application's SSL protected Heroku domain, alas, we no longer have an authenticated user.

Two ways you can address this:

  1. You pass a boat load of arguments in the URL so that the encrypted request can pry them and use them as required. Verdict: Overly complex, unsecure, and plain ugly.
  2. You switch to persisting session using Active Record, pass the session key, and rehydrate the session on the other end. Verdict: Moderately clean, quasi-elegant, and secure.

Clearly option two is the way to go.  While some view code is dirtied a bit (for example having to append the session key to links for actions being routed through SSL is a pain), I did implement it in a way that's minimal for controllers.  In short (and the code below speaks for itself), I added a method on ApplicationController that takes care of grabbing the session key from the query string, finding that session via ActiveRecord, and copying it the local store (which is for my secured Heroku application URL). For actions requiring SSL, I simply append them to a filter that invokes this logic permitting that action to proceed without having to do anything different.  While I could have further tweaked the ssl_required plugin and incorporated this logic there and not have even worried about the filter, I try to minimize my changes to open sources projects I employ and was actually drawn to how explicit this solution was.

It goes without saying, if the SNI landscape changes, either as a result of IE on XP market share dropping (not likely) or if Amazon were to permit a single EC2 instance to have multiple IP addresses pointing to it (which the Heroku guys have been rather vocal about), I'll be the first to switch. For the time being however, this hack gets the job done in a moderately clean manner.

24 Jun 2010

Heroku Hacks - Cron Jobs

Heroku's cron jobs are just that, jobs that run on a periodic basis. The free add-on runs daily, the premium add-on (at a whopping $3 a month) runs hourly. You'd be doing yourself and your application a disservice if you didn't take advantage of at least the free version; there's always some cleanup or background housekeeping you ought to do to ensure everything continues to hum along nicely in your application. I have at least half a dozen other daily tasks I plan on rolling in but for the time being I'm using the hook to clear out stale sessions and back up my application's database to S3.

Clearing Out Stale Sessions

Obviously if you're using Rails' default cookie session store then the following does not apply. If however for specific reasons you're using active record to persist your application's session, it's good practice to clear stale sessions from the database. This is good for two reasons. First, less data bloat pleases the application performance gods. Second, a smaller data footprint ensures you're paying the least amount possible for hosting as included in Heroku's pricing model is database storage size and demand. There's a handful of ways to accomplish this but I like to keep things simple hence what follows is about as straightforward as you can get. Looking at this now I'm going to make a tweak to pass in the stale time as an argument to the rake task but nonetheless the snippet below gets the point across:

Backing Up An Application’s Database (To S3)

Heroku provides a few ways to backup your database. There’s bundles which effectively creates a tarball of your code and data. The free version of the add-on permits you to store a single tarball at a time while the premium version (at $20 a month) enables you to store an unlimited number of tarballs. The assumption is you’ll have some process outside of Heroku to do something with this bundle. Alternatively, you can focus entirely on the data as it’s most likely the case your managing your code elsewhere and not relying solely on Heroku’s Git instance. For this, Heroku provides a nice hook to pull your database down into YAML and again, the assumption is you’ll have some process outside of Heroku to take advantage of this.

Both options are fine but not having an automated process to facilitate backing up your application is a bit cumbersome. Sounds like we have another task for cron! In this instance, what I’d like to do is dump the data from my database into YAML, then push that compressed YAML file to S3. Below is the rake task that handles this. This was inspired by another developer’s task I stumbled on which I forked add made some minor tweaks of my own:

So now our Cron rake tasks looks like this (I run the database to S3 dump weekly):


24 Jun 2010

NoSQL, NoSQL, NoSQL, Stop!

Over the last two months I’ve become a non-relational database groupie. The deeper down the distributed rabbit hole I go, the more I agree with rants like this and this. I’m not as far out as some who tout the relational database is doomed (I’m openly drinking the Kool-Aid on a lot of this stuff but let’s not get carried away), I just think the time where relational databases make up the only component of a data tier is fading away.

Engineers (“old me” included) have grown lazy in thinking about how they store data. All this time is spent designing software, making it flexible, easy to use, easy to extend, why is data and specifically the model in which it';s stored not being approached with the same rigor? Perhaps it’s the pervasiveness of ORMs that’s shifted our attention away from the data. And before you say data modeling, that’s not what I am referring to. However, while on the topic, let’s face it, when was the last time you encountered a data model crafted by an engineer that significantly veered away from the class model sitting atop of it! The point I am trying to make is why not question the very structure we’re modeling the data in?

Outside of the constraints imposed of a particular language, designing software is limitless to a certain degree. There are proven patterns you tend to stick with but at the end of the day, it’s wide open. Databases are different. Most of a database’s constraints (relational or not) are not as easily circumvented by design (at the data level). Even if you could work around them at a more abstract level, the cost of doing so typically comes at a substantial price sooner or later.

I spent the past month preparing for a talk I gave to the broader informatics community at work on non-relational databases. I was pleased with the the final product (slides below). The bits of my talk that appeared most widely accepted were regarding data stores that did not depart radically from existing infrastructures and filled a niche need, Memcached and it’s ubiquitous use in caching being the best example. As soon as we veered a bit further off the well worn relational path most engineers follow and started talking about areas other than caching the tone (or lack thereof) was telling.

What I find most interesting, the problems a lot of these different stores address (scalability and facilitating analysis of big data being two most applicable here) are exactly the problems staring the life science’s industry in the face. The next wave of instruments to trickle in the lab in the coming years generate amounts of data orders of magnitude more than what we’re seeing today. Storage implications aside, scientists are not used to consuming this much data and are going to rely more on the tools our community provides to help deal with it. And it’s not just storing it and making it accessible, it’s making it useful which at these scales is very challenging; ad hoc analysis, combining and aggregating disparate data sets, annotating data, attempting these tasks on data in the gigabyte and terabyte ranges, is a very different monster than what we’re dealing with today.

My (unfortunate) assessment, outside of web properties who have no choice but to invest in and build atop of alternative data stores that fit their business demands and needs more specifically, most companies are just not there yet. Most don’t realize that Google’s challenges today (of scalability, of storage, of processing big data, etc.) are everyone else’s problems tomorrow. Some companies get it (and they will be rewarded for it in the future), but the vast majority either disagree with the notion that they are on a similar trajectory as the big web companies but at a slower pace or that they would be better served using an “alternative” solution.

Slides via SlideShare

24 Jun 2010

Why Azure Will Succeed

As much as it pains me to say, Azure is going to be a wildly successful (and pervasive) platform for Microsoft. So much so that it I’d argue it will become the primary revenue generator for the company for the foreseeable future. Not that I necessarily agree with their approach nor do I feel it is superior to Amazon or Google but Azure has one thing neither platform can boast, an existing footprint in the enterprise that will be extremely difficult (if not impossible) to shake which Microsoft undoubtedly will take advantage of.

Forward thinking companies, as you’d expect, turned the corner long ago in terms of embracing the cloud. The enterprise is not far behind. Large companies in all industries are flirting with it in more and more instances. It’s simply a matter of time (years not decades) before the lion share of software development is done leveraging the cloud, whether leveraging means hosting entire applications, storing massive data sets, running computational tasks on said data sets, or hosting development and test environments, it’s happening sooner rather than later.

Big Pharma is traditionally slow to adopt new technologies, yet even here you are seeing more and more success stories pop up from small, forwarding thinking research groups. I’m experiencing this first hand exploring cloud computing as a viable platform for a number of R&D initiatives. In my opinion, these efforts mark the beginning of a shift: in what infrastructure a corporation needs, in data transparency, in collaboration (internal to a company and external with others).

So, back to the point of this post, why do I think Azure has a competitive advantage over other cloud providers, two reasons:

  1. Microsoft already has a foot in everyone’s door by way of Office. Office 2010 brings with it Office Web Applications. Once the fear of storing corporate data in the cloud subsides (and it will, in the same way people needed time to become comfortable with online banking), most companies who live and breathe Office will go the way of the web. There’s no reason not to. It’s cheaper and better enables your employees.

  2. Decision makers will not get fired for choosing Microsoft. Microsoft is a brand most large companies have bought into since day one. Startups and smaller companies not so much, but big ‘ol Fortune XXX companies, they’re definitely drinking the Microsoft Kool-Aid.

Once companies are bought into the cloud and are fully leveraging those aspects of Office 2010, there’s less incentive to chose another cloud provider. Moreover, since the data produced on a day to day basis is already being pushed to a particular cloud platform, there is less perceived risk in keeping the corporate environment to a single platform versus an environment more heterogeneous in nature. Microsoft is a name many large companies bought in to from the start. This makes a difference. When a breach happens (and it will, ask the guys at Twitter), there will be greater comfort it happened on Microsoft’s (versus another cloud provider’s) watch.

It’s going to be a very interesting few years to see how Amazon, Google, Microsoft and others such as Rackspace carve out their cloud niche and work to gain market share.

23 Jun 2010

Don't Get Drunk On Kool-Aid

I’m not a language zealot. I’m a firm believer in using the right tool for the job. If a particular language or framework results in walking a path of lesser resistance, yielding an elegant product as a result, that’s what I aim to use. Obviously I’m not architecting or hacking up applications in obscure languages, factors always need to be considered:

  • The limitations imposed, implicitly or explicitly, by your company
  • Theability to invest the effort in a language or framework you’re less proficient in
  • The willingness of your peers and managers in accepting that additional investment
  • The overhead in supporting an application in the long term

Point being, if an opportunity presented itself where the benefits outweighed these and other such considerations, it’s the path architects and engineers ought to chose. The Pragmatic Programmer gives a very popular bit of advice: learn a new language every year. When you learn a new language, you learn a new way to think. You discover new strategies and solutions unique to the nuances of a particular language’s culture. While learning a new language every year is impractical in my mind, it hammers home the point.

One of the facets of the group I belong to is to provide architectural expertise to the broader informatics community throughout the organization. Lately, our group has been working towards establishing reference architectures for the .NET and Java stacks. In going through the exercise, ripe with debate and dialog, I left pleased but concerned.

What’s the problem with reference architectures? In my mind, they potentially limit exploration. It removes a developer’s temptation to look outside the box, to explore different approaches, to leverage different tools. What’s worse, those impacted most are the exact sort of developers you want on your team, individuals who challenge the status quo. Don’t get me wrong, I see the benefits a reference architecture provides:

  • Lowering the barrier of sharing resources
  • Expanding the applicability of lessons learned
  • Easing support over the full life of an application
  • Building (more) dynamic teams (in theory though I’m not entirely sold on this aspect)

I get it and I’m fully on board. I just think every references architecture needs a giant foot note at the bottom that reads “don’t be afraid to challenge everything you just read”.

Twitter
LinkedIn
Facebook
RSS


I am a software architect, hacker, and informatician living outside Princeton, NJ. I'm technologically agnostic, a pragmatic thinker, and a believer in keeping it simple.