Recently, here at Hyla, my team and I pushed forward and got the entire trade in and processing platform in AWS. Along the way we made few application related improvements to leverage the benefits from the target environment. A recent LinkedIn post resulted in few folks asking me about the learning. Thought, I will provide a brief write up on our experience pushing through this change. Hoping that this helps others who are on a similar journey
#1. Change buy in – Moving to any cloud environment from legacy environment may not be a trivial activity. Hence, it is important that the change is not treated as a hobby project, has support from the senior leadership, has acceptance from business that it is an important strategic initiative and is strategically aligned with the business goals, has recognition from corporate finance on CapEx to OpEx shift. With the above in place, on the execution side, it is important to ensure that there is excellent project management team, commitment from the execution staff and, most importantly, good old fashioned grit to see things through.
#2: Leading change – The mechanics of pushing through the change boils down to (1) leading the change from the front (2) influencing the team to push through the change (3) issuing top down edict to execute the change. However, any change implementation comes with a degree of Fear, Uncertainty and Doubt (FUD). The leadership stamina to see through the change is directly proportional to intensity of FUD. Accordingly, appropriate strategy needs to be applied to push through the change. Leading from the front involves getting dirt under the nails. So if the leader has ability and bandwidth to embrace this strategy, the probability of success is relatively high. However, if the change execution is by influence, or if it is being pushed top down, then the leader(s) must ensure that team is adequately staffed and trained to successfully manage through the change.
#3. Architecture – While it sounds trite, it is important to pen down a high level and next level architectural diagrams that includes the low level details like VPC, Subnets for various tiers, ACL for the subnets, Security groups for EC2 including ports. It is important to keep in mind various services that may be needed as well – for example – services like SSH (22), SMTP (25), Tomcat (8080) etc to design the architecture. Using the architecture as blueprint, cloud formation or other scripting, needs to be written to build the infrastructure.
#4. Application State – When porting over legacy applications this is one area that mostly likely is going to cause a lot of heartburn. The underlying issue is what Martin Fowler calls as “Snowflake Server”. This is where folks needs to spend energy to decouple application state from the environment. One of the long pole happens to be property files. The best way to tackle this would be pivot to something like Cloud Config or Zookeeper or Consul. However, due to timelines pressure, it may be hard to pivot, and in those cases S3 could be leveraged to store the application state and configuration files.
#5. AWS Accounts – Before building anything it is important to think through account and hierarchies. One could design a fine grained hierarchy or stay coarse grained, and the final design needs to be driven by department or company objectives. In our case, we just needed four separate accounts for each environment – prod, uat, qa and dev. However, in larger organization it may be a wise idea to put deeper thoughts into account organization. This enables the ability to get billing information by account (it is also possible to get billing information using tags, and hence the reason to think through before hand as to how it needs to be set up)
#6. VPC and CIDR Ranges – It is equally important to put thoughts on segregating CIDR ranges based on environment and business domains. In our case, we had to go through few iterations to pick the right CIDR range for dev, qa, uat and prod (the few iteration could have been avoided if time was spent early on)
#7. Building up infrastructure – Building, by hand, through console is great for learning. However, folks need to invest time and energy to build up the infrastructure using CloudFormation or TerraForm or CloudFormation Template Generator (from Monsanto Engineering). In our case, we ended using CloudFormation (after a very brief evaluation of the other two products). Once the scripts are in place, it is important to start treating these scripts as code, specifically, infrastructure as code. This idea need to get ingrained as part of the software organization culture that infrastructure is no different from the rest of the code base. In our case, the cloud formation scripts are in Git and, going forward, changes to environment will get no different treatment than changes to code supporting our product suites.
#5. Service Limits – It is a good idea to be aware of what the limits are and make requests for adds ahead of time. It may not be ideal when an application under load, trying to scale up, hits the limits and breaks down. That may not yield optimal experience would it?
#6. Accessing EC2 – If set up right, only few (very few) will be needing SSH access to EC2. In fact, in a well automated state, even SSH access may not be required. One of the reasons developers need access to EC2 instances is to view application logs. The logs, however, can be piped to CloudWatch Logs and if the IAM is set up correctly, this should address the need for accessing the logs for debugging purposes. Another strategy would be send all the log data to ElasticSearch, which is actually the most ideal solution. This would not only enable enhanced search capabilities, but also opens up opportunities to perform log analytics through Kibana.
#7. Static IPs – In the cloud environment, there is limited need or no need for static IPs . However, this idea requires a little bit of getting used to, especially, when we are used to fixed IPs throughout our software life. In our case, only NAT Gateways have Elastic IPs. Pretty much every thing else in our environment have virtual IPs and almost all of them are private too. The SSH Bastions have public IP but are not static. So if the cloud formation that was used to build up the bastion were to be deleted and redone, the bastions will get new IPs. We felt that is OK given the fact that only few had access.
#8. Private IPs – Almost all of our IPs are private and none of them is visible to the outside world. The public IPs are for NAT Gateways, external facing ELBs and Bastions. One can access the private IPs only from the bastion. Initially, this process caused bit of pain because every time we needed to SSH to our EC2 resource we had to figure out what the IP was. This meant logging into the console to see what the private IP. This process required few more clicks than earlier. However, with automation leveraging AWS cli this problem is being aggressively tackled by our capable DevOps team.
#9. ASG – To scale we had set up CPU High and Low Alarms. Here too, it is a good idea to put some thought into what the high threshold and low threshold should look like. This one we learnt by experience. At one point our application servers were trashing pretty bad. In the middle of debugging, the server will just power off. The shutdown felt arbitrary with no apparent reasons. We went chasing our tail, thinking that the environment was “unstable”, suspecting something was wrong with the UserData part of the EC2. In the end, it turned out that the High CPU Alarm threshold was not right. The bar was too low, and when the application hit the low bar for the high threshold, ASG terminated the instance and replaced with new instance which then terminated promptly. Resetting the High CPU Alarms for Auto Scaling brought stability and relief.
#10. Tags – Putting thoughts into tags is extremely important. Tags are free form text and hence it is important to establish a solid naming convention for all the resources and diligently stick to it. This has potential to become run away and chaotic if not controlled from the get go.
#11. SSL Termination – Terminating SSL in ELB offloads the SSL overhead away from the webservers. In addition, AWS provides nicely packaged predefined policies for security which makes security a breeze (example turning off TLS V1.0 is a walk in the park)
#12. RDS – Going down this route takes away lot of freedom that comes with, say, setting up Postgres on EC2 (or MySQL on EC2). AWS retains the rights of true “superuser” and the admin user is limited to restricted set of privileges. For legacy application this is another area where people may have to spend time cleaning up. Another neat thing about RDS is that encrypting data at rest is a breeze. However, it might be a good idea to generate key from KMS and use it rather than use the default one.
#13. IAM Groups and Users – Time need to be put in to design and build out of IAM groups with appropriate set of permissions. The users can be assigned to the groups which gives better control over limiting permissions as well as achieving well thought out separation of responsibilities.
#14. Getting Help – The free support through AWS Forums is totally useless. Questions goes unanswered. Ponying up $ for support is well worth it (because of reasons mentioned in #15)
#15. Still Not Perfect – AWS is not yet perfect. For instance, during our production DB build out, Read Only Replica failed for unknown reason. It took multiple attempts with some help to AWS support to get rid of the zombie read only replica that sat in a limbo state for 12+ hours. During another time, we encountered an issue with the Cloudformation script. Specifically, we ran into situation where we were unable to delete a script because it relied on another script that was deleted successfully during an earlier time. The error message indicated that the script couldn’t be deleted because it used an export from the other script that was long gone (but managed to stick around behind the curtains in a phantom state).
#16. /var/log/cloud-init-output – During the build out phase, reviewing the output log in this location makes debugging UserData a breeze. The output clearly tells what went wrong.
#17. CodeDeploy woes. We used the “AutoScalingGroups” bit in the “AWS::CodeDeploy::DeploymentGroup”. However, every now and then, the ASG went into a weird state. To fix this state, meant we had to clean things up manually, which involved getting a list of ASG life cycle hooks and then identifying the one for CodeDeploy and then manually delete it using CLI. When this became a recurring pain, we switched over to Ec2TagFilters which made life a lot easier.
#18. CloudFormation – Keeping the scripts small and building one bit on top of another keeps the scripts organized, manageable and error free. We started with monolithic scripts with thousands and thousands of lines of code. We rapidly realized this was going to be problematic, and pivoted over to breaking it apart. So we built the core infrastructure (VPC,Internet Gateway, Nat Gateway, Route Tables, Routes etc), followed by web infrastructure (ELB, SG etc), webserver (ASG, Alarms, etc), appserver etc. We build up one after another using exports from the previous script.
#19. Lambda – We used Lambdas to execute custom logic in CodePipeline. The custom logic involved executing shell scripts in EC2 instances and moving files from one S3 bucket to another. The shell script were executed from CodePipeline through Lambda and SSM (it is bit complex that we like it). In addition, we utilized Lambda to send EC2, ASG and RDS Alarms and CodePipeline Approvals to HipChat Room. We think Lambda’s provides solid potential in AWS environment to automate many manual tasks.
#20. AWS Lock In – AWS provides amazing set of tools (CLI) and SDK (Java, Python etc) that makes automation a breeze. In addition, AWS is also starting to offer neat solutions for Code Build, Deploy etc that seamlessly inter operates with other AWS services and technology stacks. Leveraging more and more of these, means we are tightly coupling the applications and processes to the “virtual” environment. Such coupling means, moving to another cloud provider like Azure or GCP in the future will be lot harder to execute. So before digging deeper, it is important to evaluate the long term cloud strategy and have a crisp view on the path being taken. (same logic holds true for reserved instances)
Note: There were areas that we just couldn’t get to prior to production push but plan to tackle soon (1) Evaluate the ELB health checks instead of EC2 to make auto scaling determination (2) Evaluate federation option in lieu of of clustering to avoid network partition issue which seem to happen every now and then (3) Evaluate custom metrics instead of the free ones (4) Use Stacked Set for Cloudformation (5) CloudTrail for Audit (6) Granular Billing Alerts (7) Evaluate the use of reserved instances to save some more money (8) Explore Cloudian to reduce the cost even further