What next?
In the previous posts about our migration, we asked ourselves:
- Why do we want to move to the cloud?
- What do we want our day 1 architecture to look like?
- How do we get there?
- How do we handle our bandwidth constraints?
The last and most exciting questions are, “What comes next”? “How are we going to re-engineer to take advantage of being on GCP?”.
The technology
While our migration plan erred on the side of directly translating our existing infrastructure onto GCP, there are many GCP tools we are using on day 1:
Scaling: This sounds obvious, but the ability to provision instances when we need them — and power them off when we don’t need them — is a game-changer. Our largest batch graph-building applications require a lot of resources when they run, but only run half the time. On-premise, this means we have to provision machines for peak capacity. On GCP, we can automatically power machines off when we don’t need them:
Google Kubernetes Engine (GKE) + Google Container Registry (GCR): By far the largest re-architecting we did while migrating to GCP was eliminating our on-premise Chef environment in favor of Kubernetes (via GKE), hosting images in GCR. While this was a large undertaking, it was 100% worth the time spent.
Cloud Storage (GGS): We are replacing our HDFS environment entirely with GCS. Shifting to object storage has given us the flexibility to iterate on computational infrastructure without risk of data loss.
BigTable: We have replaced our in-house key-value datamarketplace with BigTable.
CloudSQL: LiveRamp uses MySQL heavily. In addition to all the normal reasons we use a database — data specification, state coordination — we internally use databases as queues, as statistics warehouses, and more. This work is split between a few dozen databases many owned by application teams. By moving lightweight databases into CloudSQL, application teams can completely own a MySQL instance without being DBAs; teams can think about the schema and not worry about backups or replication.
BigQuery: Although analytical queries are not a large percent of the computation we do at LiveRamp, they serve an important purpose — exposing our log data to Business Intelligence analysts without involving engineers.
FileStore: On-premise, NFS is the glue we use to share configurations between VMs, and an environment where support engineers can stage data before ingesting it into HDFS. FileStore behaves perfectly as a high-performance NFS replacement which we don’t have to manage ourselves.
What’s next?
Short answer: we don’t know, and we will update here when we find out : )
Slightly longer answer: we don’t know, but we are pretty sure it will involve more of a few exciting technologies:
BigTable (take 2): we use BigTable to power our pixel server, but we know we are just scratching the surface of what we could do with scalable real-time query stores. One possibility is a true real-time pipeline to complement (or even replace) our batch processing pipeline.
DataFlow: LiveRamp is big on Hadoop; we’ve built sophisticated tools which help us process hundreds of terabytes of data a day, and Hadoop powers all of them. But we don’t want to approach every problem with the same hammer; if we can cheaply replace self-maintained ETL workflows with DataFlow, we will.
Cloud Functions, Cloud Run: LiveRamp runs a lot of services; many of these services function mainly as wrappers over team-specific databases, providing configuration logic or enqueuing requests. These services usually incur highly variable load, sometimes processing hundreds of requests, sometimes none:
Kubernetes autoscaling can help to a degree, but the native ability of cloud functions to scale with load would remove these concerns entirely.
The people
Migrating to the cloud is often discussed in the context of technical tradeoffs. The cloud provides native services, easier regional replication, disaster replication and more (at a price, of course).
What is less commonly discussed is that a cloud migration is an opportunity to rethink your own engineering practices. While I believe LiveRamp has done an amazing technical job on this migration, I am most proud of how the team is using this opportunity to rethink team responsibilities now that we have shed the constraints of an on-premise environment. Our engineering team has made three important changes I’ll mention below.
Stop building on shared infrastructure
Because provisioning new hardware takes months, on-premise it was impractical to create entirely new infrastructure for new application teams; instead, the same infrastructure was shared across application teams. Some of our most sensitive shared services were:
- Our Hadoop environment: both YARN and HDFS were shared across application teams.
- Databases: some teams ran independent databases, but all databases were provisioned by the infrastructure team
- Kubernetes: applications across teams were deployed to a central Kubernetes cluster
- Virtual Machines: VMs were provisioned from a central pool of resources
Unfortunately, the convenience of central services comes at a heavy cost to agility. If a dozen application teams are using the same Hadoop environment, it eventually becomes impossible to upgrade the infrastructure; something critical will break with every change. Even simple performance adjustments are difficult when different teams use the same infrastructure in different ways.
Likewise, shared infrastructure is always vulnerable to (often accidentally) abusive behavior by a single application or team. Kubernetes can run out of resources; NameNodes can be overwhelmed by RPC calls. Teams become defensive rolling out changes, if their changes can impact applications written by unrelated teams.
Because provisioning new, lightweight copies of this infrastructure on GCP takes minutes, not weeks, teams no longer need to bootstrap off a common pool of services. This has already improved confidence and iteration speed.
Teams own their spending
Because many applications used shared services, on-premise it became difficult to accurately attribute costs back to individual teams or applications. Optimization projects were often infrastructure-driven; when infrastructure exhausted a pool of shared resources, we drove the initiative to reclaim those resources. Teams often did not know their applications “cost” too many resources until infrastructure told them so.
Because on GCP teams run resources within their own projects, it becomes trivially easy to attribute costs. Application teams are closest to the customer, and are now empowered to make cost tradeoff choices — if spending 2x the money to deliver data 2x as fast is worth the price, they can do it.
Infrastructure guides, teams decide
Shared resources and costs put the infrastructure team in the uncomfortable position of technology arbiters; since the infrastructure team would end up owning whatever infrastructure was provisioned, we had to say “no” to experimental technology more often than anyone wanted.
We have redefined where the infrastructure team fits into the engineering team, in a cloud-migrated world:
- Infrastructure still handles security and data permissions.
- Unavoidably core infrastructure (for example, the shared VPC network) is still managed by infrastructure.
- Everywhere else, infrastructure helps by building tools and suggesting best practices; teams are free to use the tools they feel are best for them, whether those tools are in-house or sourced externally.
Nobody (at least, nobody on our team) likes being a central planner; by decomposing our shared infrastructure, we can get back to the fun part of being an infrastructure team — building tools and helping other engineers.
Going forward
These are all big changes. It’s easy to write a blog post which talks about the future, but we fully understand it will take us years to get there. But no matter which technologies we end up using a year down the road, we are confident that we are headed in the right direction. If this all sounds exciting, remember — we’re always hiring. Help us figure out what comes next.