Authors – Nayan Bhanushali & Shefali Kamble, Cloud Engineers.
What is Azure Well-Architecture Framework
There are many aspects where you need to instill confidence in your customers, where you assure them security and availability of their data, where you provide them an architecture that can handle any challenges faced by the customer and also fulfill all the requirements. The Azure Well-Architected Framework is a set of guiding tenets to build high-quality solutions on Azure.
Azure Well-Architecture helps to design, build and continuously improve a secure, reliable and efficient application. There are five pillars that are essential for a great Azure Architecture.
Azure Well-Architected Framework pillars
The Azure Well-Architected Framework consists of five pillars:
- Operational excellence
- Cost optimization
- Performance efficiency
- Reliability
- Security
Operational excellence
Operational excellence is about ensuring that you have full visibility into how your application is running and ensuring the best experience for your users. You can use several principles when driving operational excellence through your architecture.
Design, build and orchestrate with modern practices
Architecture nowadays should be designed considering DevOps and continuous integration. It helps to automate deployments by using infrastructure as code, automate application testing, and build new environments as needed. Whether your project is an application that uses full continuous integration and continuous deployment (CI/CD) and containers, or if it’s a legacy application that you’re continuing to service, there are DevOps practices you can bring into your organization.
Use monitoring and analytics to gain operational insights
By creating an effective system for monitoring what’s going on in your architecture, you can ensure that you’ll know when something isn’t right before your users are affected. With a comprehensive approach to monitoring, you’ll be able to identify performance issues and cost inefficiencies, correlate events, and gain a greater ability to troubleshoot issues.
Use automation to reduce effort and error
You should automate as much of your architecture as possible. Human intervention is costly, injecting time and error into operational activities. This increased time and error will result in increased operational costs. You can use automation to build, deploy, and administer resources. By automating common activities, you can eliminate the delay in waiting for a human to intervene.
Test
A good testing strategy will help you identify issues in your application before it’s deployed, and ensure that dependent services can properly communicate with your application. A good testing strategy can also help identify performance issues and potential security vulnerabilities in both pre-production and production deployments.
Cost Optimization
Cost optimization is ensuring that the money spent by the organization is being used to the maximum effect. All the services in the cloud are provided under a service model, to be consumed on demand. You pay only for what you use and are categorized as an operating expense (OpEx), because of their consumption model. Under this scheme, there is no asset to amortize.
Planning and Estimating Cost
Whether it is planning about your new application for development in cloud or migrating an existing datacenter, it’s important to get an estimate of your cost. Estimates involve understanding business objectives, appropriate services along with their sizes. Once the requirements are identified, cost estimation tools can be used to get a more concise estimate of the required resources.
Provision with Optimization
Ensure that you are selecting appropriate service level for your workload and are taking advantage of the services that let you adjust the service level. Other ways that can help in saving your spend is by opting for reserved instances and also considering bring-your-own-license offers. Also moving from IaaS to PaaS services would be worth considering since PaaS services typically cost less and they also reduce operational costs. With PaaS services all the patching and maintaining activities of the VMs are handled by the cloud providers.
Monitoring and Analytics to Gain Cost Insights
It is equally important to monitor all your spending in order to save cost. This can be done by various methods like taking advantage of the cost management tools, regularly review your billing statements to understand where the money is being spent.
Conducting frequent cost review is important to understand if the expenditure that is being spent across all the resources is appropriate or not according to the workload. Identify any anomalies in the billing statements or look for cost saving through alert. After identifying all the errors and suggestions adjust the expenditure as necessary.
Maximize Efficiency of Cloud Spend
Cloud is based on the pay-as-you-go service where you avoid unnecessary expenses. Identifying and eliminating unnecessary expenses results into an efficient environment. Operational costs can also lead to unnecessary or inefficient costs and this may result into waste of time and increased errors.
Waste can show up in several ways. Let’s look at a few examples:
- A virtual machine that’s always 90 percent idle
- Paying for a license included in a virtual machine when a license is already owned
- Retaining infrequently accessed data on a storage medium optimized for frequent access
- Manually repeating the build of a non-production environment
- In each of these cases, more money is being spent than it should. Each case presents an opportunity for cost reduction.
In each of these cases, more money is being spent than it should. As you evaluate your cost, take the opportunity to optimize environments.
Performance Efficiency
Before Moving on to Performance Efficiency Let us discuss a case study related to Performance efficiency, for this case study we will consider Flipkart.
Flipkart is an e-commerce company with 200M traffic monthly, they often have Flash sales or offers every Three Months.
In the beginning, when Flipkart had its first-ever sale that time there was a huge traffic flow to their website due to which Flipkart servers went down.
In this scenario, Flipkart was not able to deliver what they promise to its customer and it failed.
The ideal Solution was Flipkart would have planned and scale up their backend servers to meet the requirement and scale down when it is not required.
Even there are some cases where the demands of applications change over time, so the right amount of resources you will need can be harder to predict. If you are lucky, that change will be predictable or seasonal, but that is not typical of all scenarios. Ideally, you want to provision the right amount of resources to meet demand and then adjust the amount as demand changes.
Performance efficiency is for architecture to perform well and be scalable, it should properly match resource capacity to demand. Traditionally, cloud architectures accomplish this balance by scaling applications dynamically based on activity in the application. Demand for services changes, so it is important for your architecture to be able to adjust to demand. By designing your architecture with performance and scalability in mind, you will provide a great experience for your customers while being cost-effective.
What is Scaling Up or Down?
When you use a single instance of a service, such as a virtual machine, you might need to scale the number of resources that are available to your instance.
- Scaling up is the process where you increase the capacity of a given instance. For example, 1 vCPU and 3.5 GB of RAM to 2 vCPUs and 7 GB of RAM.
- Scaling down is the process where you decrease the capacity of a given instance. For example, from 2 vCPUs and 7 GB of RAM to 1 vCPU and 3.5 GB of RAM. In this way, you reduce capacity and cost.
What is Scaling Out or In?
You now know that scaling up and down adjusts the amount of resources a single instance has available. Scaling out and in adjusts the total number of instances.
- Scaling out is the process of adding more instances to support the load of your solution. For example, you could increase the number of virtual machines if the level of load increased.
- Scaling in is the process of removing instances that are no longer needed to support the load of your solution. If your website front ends have low usage, you might want to lower the number of instances to save cost.
Optimize network performance
In complex architectures with many different services, minimizing the latency at each network hop can affect the overall performance.
Follow this step to optimize network
Reduce Latency between Azure Resource
- Select your datacenter location closer to your users Location.
- In case of multiple user base location add more Instance according to business requirement
- Adding a messaging layer between services can have a benefit to performance and scalability.
Optimize storage performance
In organization due to large data Partition on data must be done wisely to maximize the benefits while minimizing adverse effects.
Follow this step to optimize Storage Performance:
- Use caching in your architecture to help improve performance.
- You can use caching between your application servers and a database to decrease data retrieval times.
- You can also use caching between your users and your web servers, by placing static content closer to users and decreasing the time it takes to return webpages to the users.
Identify performance bottlenecks in your application
Follow this step to optimize network:
- Monitor all your resource by enabling Diagnostic Settings to get the detailed report of resource consumption.
- Performance optimization will include understanding how the applications themselves are performing.
- Look across all layers of your application and identify and remediate performance bottlenecks.
Reliability
Factors like these measure the reliability of your cloud offerings. In a perfect world, your system would be 100% reliable. But that is probably not an attainable goal. In the real world, things will go wrong. You will see faults from things such as server downtime, software failure, security breaches, user errors, and other unexpected incidents.
Build a highly available architecture:
- For availability, identify the service-level agreement (SLA) for your Organization.
- Use Clustering and load balancing for high availability design.
- Clustering replaces a single VM with a set of coordinated VMs. When one VM fails or becomes unreachable, services can fail over to another one that can service the requests.
- Load balancing spreads requests across many instances of a service, detecting failed instances and preventing requests from being routed to them.
Build an architecture that can recover from failure:
For recoverability, you should perform an analysis that examines your possible data loss and major downtime scenarios. With RPO and RTO defined, you can design backup, restore, replication, and recovery capabilities into your architecture to meet these objectives.
- Recovery point objective (RPO): The maximum duration of acceptable data loss. RPO is measured in units of time, not volume. Examples are “30 minutes of data,” “four hours of data,” and so on. RPO is about limiting and recovering from data loss, not data theft.
- Recovery time objective (RTO): The maximum duration of acceptable downtime, where “downtime” is defined by your specification. For example, if the acceptable downtime duration is eight hours in the event of a disaster, then your RTO is eight hours.
Security
Security is the prime concerns when it comes to cloud for example banks store account numbers, balances, and transaction history, Ecommerce business stores purchase history, account information, and demographic details of customers. All this data is important and need to secure if not which can cause financial harm.
Defense in depth
Defense in Depth (DiD) is an approach to cybersecurity in which a series of defensive mechanisms are layered to protect valuable data and information.
A multilayered approach to securing your environment will increase the security posture of your environment.
- Data
- Applications
- VM/compute
- Networking
- Perimeter
- Policies and access
- Physical security
Protect from common attacks:
At each layer, there are some common attacks that you’ll want to protect against. The following list isn’t all-inclusive, but it can give you an idea of how each layer can be attacked and what types of protections you might need.
- Data layer: Exposing an encryption key or using weak encryption can leave your data vulnerable if unauthorized access occurs.
- Application layer: Malicious code injection and execution are the hallmarks of application-layer attacks. Common attacks include SQL injection and cross-site scripting (XSS).
- VM/compute layer: Malware is a common method of attacking an environment, which involves executing malicious code to compromise a system. After malware is present on a system, further attacks that lead to credential exposure and lateral movement throughout the environment can occur.
- Networking layer: Unnecessary open ports to the internet are a common method of attack. These might include leaving SSH or RDP open to virtual machines. When these protocols are open, they can allow brute-force attacks against your systems as attackers attempt to gain access.
- Perimeter layer: Denial-of-service (DoS) attacks often happen at this layer. These attacks try to overwhelm network resources, forcing them to go offline or making them incapable of responding to legitimate requests.
- Policies and access layer: This layer is where authentication occurs for your application. This layer might include modern authentication protocols such as OpenID Connect, OAuth, or Kerberos-based authentication such as Active Directory. The exposure of credentials is a risk at this layer, and it’s important to limit the permissions of identities. We also want to have monitoring in place to look for possible compromised accounts, such as logins coming from unusual places.
- Physical layer: Unauthorized access to facilities through methods such as door drafting and theft of security badges can happen at this layer.
Hope this will help you to get a better insight about the Azure Well-Architecture Framework. You can post your thoughts/queries in the comment section.