Archives

now browsing by author

 

Differences between MySQL 5.6 and 5.7

IT

MySQL 5.7 has added a variety of new features that might help you with your development. MySQL 5.7 updated nearly 40 of the defaults compared to 5.6. Some of these changes could severely impact your server performance, while others might go unnoticed. We’re going to have a look over each of the changes and what they mean.

The change that can have the largest impact on your server is likely sync_binlog. Sync_binlog controls how MySQL flushes the binlog to disk. The new value of 1 forces MySQL to write every transaction to disk prior to committing. Previously, MySQL did not force flushing the binlog, and trusted the OS to decide when to flush the binlog. Apparently MySQL 5.7 with binlog is by default 37-45% slower than MySQL 5.6 with binlog when otherwise using the default MySQL settings.

There have been a couple of variable changes surrounding the binlog. MySQL 5.7 updated the binlog_error_action so that if there is an error while writing to the binlog, the server aborts. This kind of incidents is rare, but causes a big impact to your application and replication when they occur, as the server will not perform any further transactions until corrected.

The binlog default format was changed to ROW, instead of the previously used statement format. Statement writes less data to the logs. However there are many statements that cannot be replicated correctly, including “update … order by rand()”. These non-deterministic statements could result in different result-sets on the master and slave. The change to Row format writes more data to the binlog, but the information is more accurate and ensures correct replication.

Variables 5.6.29 5.7.11
sync_binlog 0 1

The performance schema variables stand out as unusual, as many have a default of -1. MySQL uses this notation to call out variables that are automatically adjusted. The only performance schema variable change that doesn’t adjust itself is performance_schema_max_file_classes. This is the number of file instruments used for the performance schema. It’s unlikely you will ever need to alter it.

Variables 5.6.29 5.7.11
performance_schema_accounts_size

100

-1

performance_schema_hosts_size

100

-1

performance_schema_max_cond_instances

3504

-1

performance_schema_max_file_classes

50

80

performance_schema_max_file_instances

7693

-1

performance_schema_max_mutex_instances

15906

-1

performance_schema_max_rwlock_instances

9102

-1

performance_schema_max_socket_instances

322

-1

performance_schema_max_statement_classes

168

-1

performance_schema_max_table_handles

4000

-1

performance_schema_max_table_instances

12500

-1

performance_schema_max_thread_instances

402

-1

performance_schema_setup_actors_size

100

-1

performance_schema_setup_objects_size

100

-1

performance_schema_users_size

100

-1

The optimizer_switch, and sql_mode variables have a variety of options that can each be enabled and cause a slightly different action to occur. MySQL 5.7 enables both variables for flags, increasing their sensitivity and security. These additions make the optimizer more efficient in determining how to correctly interpret your queries.

Three flags have been added to the optimzer_switch, all of which existed in MySQL 5.6 and were set as the default in MySQL 5.7 (with the intent to increase the optimizer’s efficiency): duplicateweedout=on, condition_fanout_filter=on, and derived_merge=on. duplicateweedout is part of the optimizer’s semi join materialization strategy. condition_fanout_filter controls the use of condition filtering, and derived_merge controls the merging of derived tables, and views into the outer query block.

The additions to SQL mode do not affect performance directly, however they will improve the way you write queries (which can increase performance). Some notable changes include requiring all fields in a select … group by statement must either be aggregated using a function like SUM, or be in the group by clause. MySQL will not assume they should be grouped, and will raise an error if a field is missing. Strict_trans_tables causes a different effect depending on if it used with a transactional table.

Statements are rolled back on transaction tables if there is an invalid or missing value in a data change statement. For tables that do not use a transactional engine, MySQL’s behavior depends on the row in which the invalid data occurs. If it is the first row, then the behavior matches that of a transactional engine. If not, then the invalid value is converted to the closest valid value, or the default value for the columns. A warning is generated, but the data is still inserted.

Variables 5.6.29 5.7.11
optimizer_switch index_merge=on index_merge=on
index_merge_union=on index_merge_union=on
index_merge_sort_union=on index_merge_sort_union=on
index_merge_intersection=on index_merge_intersection=on
engine_condition_pushdown=on engine_condition_pushdown=on
index_condition_pushdown=on index_condition_pushdown=on
mrr=on,mrr_cost_based=on mrr=on
block_nested_loop=on mrr_cost_based=on
batched_key_access=off block_nested_loop=on
materialization=on, semijoin=on batched_key_access=off
loosescan=on, firstmatch=on materialization=on
subquery_materialization_cost_based=on semijoin=on
use_index_extensions=on loosescan=on
firstmatch=on
duplicateweedout=on
subquery_materialization_cost_based=on
use_index_extensions=on
condition_fanout_filter=on
derived_merge=on
sql_mode no_engine_substitution only_full_group_by
strict_trans_tables
no_zero_in_date
no_zero_date
error_for_division_by_zero
no_auto_create_user
no_engine_substitution

MySQL has begun to focus on replication using GTID’s instead of the traditional binlog position. When MySQL is started, or restarted, it must generate a list of the previously used GTIDs. If binlog_gtid_simple_recovery is OFF, or FALSE, then the server starts with the newest binlog and iterates backwards through the binlog files searching for a previous_gtids_log_event. With it set to ON, or TRUE, then the server only reviews the newest and oldest binlog files and computes the used gtids. Binlog_gtid_simple_recovery makes it much faster to identify the binlogs, especially if there are a large number of binary logs without GTID events. However, in specific cases it could cause gtid_executed and gtid_purged to be populated incorrectly. This should only happen when the newest binarly log was generated by MySQL5.7.5 or older, or if a SET GTID_PURGED statement was run on MySQL earlier than version 5.7.7.

Another replication-based variable updated in 5.7 is  slave_net_timeout. It is lowered to only 60 seconds. Previously the replication thread would not consider it’s connection to the master broken until the problem existed for at least an hour. This change informs you much sooner if there is a connectivity problem, and ensures replication does not fall behind significantly before informing you of an issue.

Variables 5.6.29 5.7.11
binlog_error_action ignore_error abort_server
binlog_format statement row
binlog_gtid_simple_recovery off on
slave_net_timeout 3600 60

InnoDB buffer pool changes impact how long starting and stopping the server takes. innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup are used together to prevent you from having to “warm up” the server. As the names suggest, this causes a buffer pool dump at shutdown and load at startup. Even though you might have a buffer pool of 100’s of gigabytes, you will not need to reserve the same amount of space on disk, as the data written is much smaller. The only things written to disk for this is the information necessary to locate the actual data, the table space and page IDs.

Variables 5.6.29 5.7.11
innodb_buffer_pool_dump_at_shutdown off on
innodb_buffer_pool_load_at_startup off on

MySQL now made some of the options implemented in InnoDB during 5.6 and earlier into its defaults. InnoDB’s checksum algorithm was updated from innodb to crc32, allowing you to benefit from the hardware acceleration recent Intel CPU’s have available.

The Barracuda file format has been available since 5.5, but had many improvements in 5.6. It is now the default in 5.7. The Barracuda format allows you to use the compressed and dynamic row formats.

The innodb_large_prefix defaults to “on”, and when combined with the barracuda file format allows for creating larger index key prefixes, up to 3072 bytes. this allows larger text fields to benefit from an index. if this is set to “off”, or the row format is not either dynamic or compressed, any index prefix larger than 767 bytes gets silently be truncated. MySQL has introduced larger innodb page sizes (32k and 64k) in 5.7.6.

MySQL 5.7 increased the innodb_log_buffer_size value as well. innodb uses the log buffer to log transactions prior to writing them to disk in the binary log. the increased size allows the log to flush to the disk less often, reducing io, and allows larger transactions to fit in the log without having to write to disk before committing.

MySQL 5.7 moved innodb’s purge operations to a background thread in order to reduce the thread contention in mysql 5.5.the latest version increases the default to four purge threads, but can be changed to have anywhere from 1 to 32 threads.

MySQL 5.7 now enables  innodb_strict_mode by default, turning some of the warnings into errors. syntax errors in create table, alter table, create index, and optimize table statements generate errors and force the user to correct them prior to running. It also enables a record size check, ensuring that insert or update statements will not fail due to the record being too large for the selected page size.

Variables 5.6.29 5.7.11
innodb_checksum_algorithm innodb crc32
innodb_file_format Antelope Barracuda
innodb_file_format_max Antelope Barracuda
innodb_large_prefix OFF ON
innodb_log_buffer_size 8388608 16777216
innodb_purge_threads 1 4
innodb_strict_mode OFF ON

MySQL has increased the number of times the optimizer dives into the index when evaluating equality ranges. If the optimizer needs to dive into the index more than theeq_range_index_dive_limit, defaulted to 200 in MySQL 5.7, then it uses the existing index statistics. You can adjust this limit from 0, eliminating index dives, to 4294967295. This can have a significant impact to query performance since the table statistics are based on the cardinality of a random sample. This can cause the optimizer to estimate a much larger set of rows to review than it would with the index dives, changing the method the optimizer chooses to execute the query.

MySQL 5.7 deprecated log_warnings. The new preference utilizes log_error_verbosity. By default this is set to 3, and logs errors, warnings, and notes to the error log. You can alter this to 1 (log errors only) or 2 (log errors and warnings). When consulting the error log, verbosity is often a good thing. However this increases the IO and disk space needed for the error log.

table_open_cache_instances has changed starting with MySQL 5.7.8. The number of instances was increased from 1, to 16. This has two main benefits, reducing contention of DML statements, and segmenting access to the cache. By increasing the number of instances, a DML statement can lock a single instance, and not the entire cache. A thing to note is that the DDL statements still require a lock on the entire cache. On systems where a large number of sessions are accessing the cache it increases the performance.

Variables 5.6.29 5.7.11
eq_range_index_dive_limit 10 200
log_warnings 1 2
table_open_cache_instances 1 16

References:

1) MySQL 5.7 the complete features list

2) MySQL 5.7 documentation

Tunneling RDP and VNC over SSH

IT

How many times have you ever been in a situation where you needed to perform remote administration tasks on a Server (Windows or Linux) and from various reasons you didn’t had direct connectivity ( RDP or VNC are blocked by a firewall or the target server is behind a NAT or other network restrictions) between the client machine and the targeted server?

Architec

You can tunnel RDP/VNC over SSH using different tools depending on the client machine Windows or Linux. SSH creates a secure connection between a local computer and a remote machine through which services can be relayed. Because the connection is encrypted, SSH tunneling is useful for transmitting information that uses an unencrypted protocol, such as VNC or RDP.

How to tunnel RDP or VNC on a Windows client

The most used ssh client on windows box is PuTTY. To tunnel Remote Desktop Protocol or VNC over ssh, all you need is an account on the premises. For example, a firewall or Linux server with ssh access, and PuTTY on your Windows desktop.

Once you are connected to your remote network with PuTTY, you need to reconfigure the connection to support SSH-tunneling. In the PuTTY Reconfiguration screen, go to Connection → SSH → Tunnels. This is where we can set up an SSH tunnel for Remote Desktop.

Under Source port, add your local IP address and port and as Destination the ip address and port of the target server. In the attached picture you can see that for tunneling RDP I used as a source 127.0.0.2:3388 and destination 192.168.93.130:3389, and for tunneling VNC I used as a source 127.0.0.3:5901 and destination 192.168.93.131:5901.

Putty Reconfig

After clicking Apply, the SSH-tunnel for remote desktop (RDP or VNC) is active, and you can connect as shown below:

RDP conn

VNC conn

How to tunnel RDP or VNC on a Linux client

On a Linux workstation you can use the following ssh port forwarding for tunneling RDP or VNC depending on the requirement. The requirement for having an account on the premises still remains.

ssh port forwarding / tunnel set-up for RDP:

ssh -L 3389:[Windows Server RDP address]:3389 [address ssh server] -l [ssh username] -N
ssh -L 3389:192.168.93.130:3389 192.168.72.136 -l root -N
ssh -L 5901:[Linux Server RDP address]:5901 [address ssh server] -l [ssh username] -N
ssh -L 5901:192.168.93.131:5901 192.168.72.136 -l root -N

Now you can connect your remote client (krdc in my case) to 127.0.0.1:3389 or 127.0.0.1:5901 as if it were the remote server.

Linuxssh port

Disaster Recovery Scenarios with AWS

resized

 

Enterprises have different levels of tolerance for business interruptions and therefore a wide variety of disaster recovery preferences, ranging from solutions that provide a few hours of downtime to seamless failover. Smart DR offers several strategies which meet the recovery needs of most enterprises using combinations of AWS services.

 

Method RTO Cost
Cold Low RTO >= 1 business day Lowest Cost
Pilot-Light Moderate RTO < 4 hours Moderate Cost
Warm Standby Aggressive RTO < 1 hour High Cost
Multi-Site No Interruptions Highest Cost

AWS enables you to cost-effectively operate each of these DR strategies. It’s important to note that these are just examples of possible approaches, and variations and combinations of these are possible. If your application is already running on AWS, then multiple regions can be employed and the same DR strategies will still apply.

Enterprise-level disaster recovery is primarily measured in terms of Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO is a measure of the maximum amount of time within which operations are expected to be resumed after a disaster. RPO is a measure, in terms of time, of the maximum amount of data that can be lost as a result of a disaster.

Cold /Backup and Restore

The Backup and Restore scenario is an entry level form of disaster recovery on AWS. This approach is the most suitable one in the event that you don’t have a DR plan. In most traditional environments, data is backed up to non-volatile media such as tape and sent off-site regularly. If you use this method, it can take a long time to restore your system in the event of a disruption or disaster. Amazon S3 is an ideal destination for backup data that might be needed quickly to perform a restore. Transferring data to and from Amazon S3 is typically done through the network, and is therefore accessible from any location. There are many commercial and open-source backup solutions that integrate with Amazon S3. You can use AWS Import/Export to transfer very large data sets by shipping storage devices directly to AWS. For longer-term data storage where retrieval times of several hours are adequate, there is Amazon Glacier, which has the same durability model as Amazon S3. Amazon Glacier is a low-cost alternative starting from $0.01/GB per month. Amazon Glacier and Amazon S3 can be used in conjunction to produce a tiered backup solution. AWS Storage Gateway enables snapshots of your on-premises data volumes to be transparently copied into Amazon S3 for backup. You can subsequently create local volumes or Amazon EBS volumes from these snapshots. Storage-cached volumes allow you to store your primary data in Amazon S3, but keep your frequently accessed data local for low-latency access. As with AWS Storage Gateway, you can snapshot the data volumes to give highly durable backup. In the event of DR, you can restore the cache volumes either to a second site running a storage cache gateway or to Amazon EC2. For systems already running on AWS, you also can back up into Amazon S3. Snapshots of Amazon EBS volumes, Amazon RDS databases, and Amazon Redshift data warehouses can be stored in Amazon S3. Alternatively, you can copy files directly into Amazon S3, or you can choose to create backup files and copy those to Amazon S3. There are many backup solutions that store data directly in Amazon S3, and these can be used from Amazon EC2 systems as well.

The following figure shows data backup options to Amazon S3, from either on-site infrastructure or from AWS.

Cold method

Of course, the backup of your data is only half of the story. If disaster strikes, you’ll need to recover your data quickly and reliably. You should ensure that your systems are configured to retain and secure your data, and you should test your data recovery processes. 

Key steps for backup and restore method:

  • Select an appropriate tool or method to back up your data into AWS.
  • Ensure that you have an appropriate retention policy for this data.
  • Ensure that appropriate security measures are in place for this data, including encryption and access policies.
  • Regularly test the recovery of this data and the restoration of your system

The Backup and Restore plan is suitable for lower level business-critical applications. This is also an extremely cost-effective scenario and one that is most often used when we need backup storage. If we use a compression and de-duplication tool, we can further decrease our expenses here. For this scenario, RTO will be as long as it takes to bring up infrastructure and restore the system from backups. RPO will be the time since the last backup.

Pilot-Light

The term “Pilot Light” is often used to describe a DR scenario where a minimal version of an environment is always running in the cloud. This scenario is similar to a Backup and Restore scenario. For example, with AWS you can maintain a Pilot Light by configuring and running the most critical core elements of your system in AWS. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core. Infrastructure elements for the pilot light itself typically include your database servers, which would replicate data to Amazon EC2 or Amazon RDS. Depending on the system, there might be other critical data outside of the database that needs to be replicated to AWS. This is the critical core of the system (the pilot light) around which all other infrastructure pieces in AWS can quickly be provisioned to restore the complete system.

To provision the remainder of the infrastructure to restore business-critical services, you would typically have some preconfigured servers bundled as Amazon Machine Images (AMIs), which are ready to be started up at a moment’s notice. When starting recovery, instances from these AMIs come up quickly with their pre-defined role (for example, Web or App Server) within the deployment around the pilot light. From a networking point of view, you have two main options for provisioning:

  • Use Elastic IP addresses, which can be pre-allocated and identified in the preparation phase for DR, and associate them with your instances. Note that for MAC address-based software licensing, you can use elastic network interfaces (ENIs), which have a MAC address that can also be pre-allocated to provision licenses against. You can associate these with your instances, just as you would with Elastic IP addresses.
  • Use Elastic Load Balancing (ELB) to distribute traffic to multiple instances. You would then update your DNS records to point at your Amazon EC2 instance or point to your load balancer using a CNAME. We recommend this option for traditional web-based applications.

For less critical systems, you can ensure that you have any installation packages and configuration information available in AWS, for example, in the form of an Amazon EBS snapshot. This will speed up the application server setup, because you can quickly create multiple volumes in multiple Availability Zones to attach to Amazon EC2 instances. You can then install and configure accordingly, for example, by using the backup-and-restore method. The pilot light method gives you a quicker recovery time than the backup-and-restore method because the core pieces of the system are already running and are continually kept up to date. AWS enables you to automate the provisioning and configuration of the infrastructure resources, which can be a significant benefit to save time and help protect against human errors. However, you will still need to perform some installation and configuration tasks to recover the applications fully.

Preparation phase

The following figure shows the preparation phase, in which you need to have your regularly changing data replicated to the pilot light, the small core around which the full environment will be started in the recovery phase. Your less frequently updated data,such as operating systems and applications, can be periodically updated and stored as AMIs.

Light method1

Key steps for preparation:

  1. Set up Amazon EC2 instances to replicate or mirror data.
  2. Ensure that you have all supporting custom software packages available in AWS.
  3. Create and maintain AMIs of key servers where fast recovery is required.
  4. Regularly run these servers, test them, and apply any software updates and configuration changes.
  5. Consider automating the provisioning of AWS resources.

Recovery phase

To recover the remainder of the environment around the pilot light, you can start your systems from the AMIs within minutes on the appropriate instance types. For your dynamic data servers, you can resize them to handle production volumes as needed or add capacity accordingly. Horizontal scaling often is the most cost-effectiveand scalable approach to add capacity to a system. For example, you can add more web servers at peak times. However, you can also choose larger Amazon EC2 instance types, and thus scale vertically for applications more intensive. From a networking perspective, any required DNS updates can be done in parallel.

After recovery, you should ensure that redundancy is restored as quickly as possible. A failure of your DR environment shortly after your production environment fails is unlikely, but you should be aware of this risk. Continue to take regular backups of your system, and consider additional redundancy at the data layer. The following figure shows the recovery phase of the pilot light scenario.

Light method2

Key steps for recovery:

  1. Start your application Amazon EC2 instances from your custom AMIs.
  2. Resize existing database/data store instances to process the increased traffic.
  3. Add additional database/data store instances to give the DR site resilience in the data tier;if you are using Amazon RDS, turn on Multi-AZ to improve resilience.
  4. Change DNS to point at the Amazon EC2 servers.
  5. Install and configure any non-AMI based systems, ideally in an automated way

Warm Standby

A Warm Standby scenario is an expansion of the Pilot Light scenario where some services are always up and running. Disaster Recovery in a warm configuration allows customers a near no-downtime solution with a near-to-100% uptime SLA arrangement. As we plan a DR plan, we need to identify crucial points of our on-premise infrastructure and then duplicate it inside the AWS. In most cases, we’re talking about web and app servers running on a minimum-sized fleet.

By identifying your business-critical systems, you can fully duplicate these systems on AWS and have them always on. These servers can be running on a minimum-sized fleet of Amazon EC2 instances on the smallest sizes possible. This solution is not scaled to take a full-production load, but it is fully functional. It can be used for non-production work, such as testing, quality assurance, and internal use. Once a disaster occurs, infrastructure located on AWS takes over the traffic and performs its scaling and converting to a fully functional production environment with minimal RPO and RTO. In AWS, this can be done by adding more instances to the load balancer and by resizing the small capacity servers to run on larger Amazon EC2 instance types. As stated in the preceding section, horizontal scaling is preferred over vertical scaling.

Preparation phase

The following figure shows the preparation phase for a warm standby solution, in which an on-site solution and an AWS solution run side-by-side.

Warm Standby method1

Key steps for preparation:

  1. Set up Amazon EC2 instances to replicate or mirror data.
  2. Create and maintain AMIs.
  3. Run your application using a minimal footprint of Amazon EC2 instances or AWS infrastructure.
  4. Patch and update software and configuration files in line with your live environment.

Recovery phase

In the case of failure of the production system, the standby environment will be scaled up for production load , and DNS records will be changed to route all traffic to AWS.

Warm Standby method2

Key steps for recovery:

  1. Increase the size of the Amazon EC2 fleets in service with the load balancer (horizontal scaling).
  2. Start applications on larger Amazon EC2 instance types as needed (vertical scaling).
  3. Either manually change the DNS records, or use Amazon Route 53 automated health checks so that all traffic is routed to the AWS environment.
  4. Consider using Auto Scaling to right-size the fleet or accommodate the increased load.
  5. Add resilience or scale up your database.

Multi-Site

The Multi-Site scenario is a solution for an infrastructure that is up and running completely on AWS as well as on an “on-premise” data center. The data replication method that you employ will be determined by the recovery point that you choose.In addition to recovery point options, there are various replication methods,such as synchronous and asynchronous methods.

You can use a DNS service that supports weighted routing, such as Amazon Route 53, to route production traffic to different sites that deliver the same application or service. A proportion of traffic will go to your infrastructure in AWS, and the remainder will go to your on-site infrastructure. In an on-site disaster situation, you can adjust the DNS weighting and send all traffic to the AWS servers. The capacity of the AWS service can be rapidly increased to handle the full production load. You can use Amazon EC2 Auto Scaling to automate this process. You might need some application logic to detect the failure of the primary database services and cut over to the parallel database services running in AWS. The cost of this scenario is determined by how much production traffic is handled by AWS during normal operation. In the recovery phase, you pay only for what you use for the duration that the DR environment is required at full scale. You can further reduce cost by purchasing Amazon EC2 Reserved Instances for your “always on” AWS servers.

Preparation phase

The following figure shows how you can use the weighted routing policy of the Amazon Route 53 DNS to route a portion of yourtraffic to the AWS site. The application on AWS might access data sources in the on-site production system. Data is replicated or mirrored to the AWS infrastructure.

Multi site method1

Key steps for preparation:

  1. Set up your AWS environment to duplicate your production environment.
  2. Set up DNS weighting, or similar traffic routing technology,to distribute incoming requests to both sites.
  3. Configure automated failover to re-route traffic away from the affected site.

Recovery phase

The following figure shows the change in traffic routing in the event of an on-site disaster. Traffic is cut over to the AWS infrastructure by updating DNS, and all traffic and supporting data queries are supported by the AWS infrastructure.

Multi site method2

Key steps for recovery:

  1. Either manually or by using DNS failover, change the DNS weighting so that all requests are sent to the AWS site.
  2. Have application logic for failover to use the local AWS database servers for all queries.
  3. Consider using Auto Scaling to automatically right-size the AWS fleet. You can further increase the availability of your multi-site solution by designing Multi-AZ architectures.

AWS Production to an AWS DR Solution Using Multiple AWS Regions

Applications deployed on AWS have multi-site capability by means of multiple Availability Zones. Availability Zones are distinct locations that are engineered to be insulated from each other. They provide inexpensive, low-latency network connectivity within the same region. Some applications might have an additional requirement to deploy their components using multiple regions; this can be a business or regulatory requirement. Any of the preceding scenarios in this article can be deployed using separate AWS regions.

The advantages for both production and DR scenarios include the following:

  • You don’t need to negotiate contracts with another provider in another region.
  • You can use the same underlying AWS technologies across regionn.
  • You can use the same tools or APIs

References

AWS disaster recovery plan

resized

First of all What is Disaster Recovery?

Disaster recovery (DR) is about preparing for and recovering from a disaster. Any event that has a negative impact on a company’s business continuity or finances could be termed a disaster. This includes hardware or software failure, a network outage, a power outage, physical damage to a building like fire or flooding, human error, or some other significant event.

In any case, it is crucial to have a tested disaster recovery plan ready. A disaster recovery plan will ensure that our application stays online no matter the circumstances. Ideally, it ensures that users will experience zero, or at worst, minimal issues while using your application.

Let’s take a closer look at some of the important terminology associated with disaster recovery:

Business Continuity. All of our applications require Business Continuity. Business Continuity ensures that an organization’s critical business functions continue to operate or recover quickly despite serious incidents.

Recovery time objective (RTO) — The time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA). For example, if a disaster occurs at 12:00 PM (noon) and the RTO is eight hours, the DR process should restore the business process to the acceptable service level by 8:00 PM.

Recovery point objective (RPO) — The acceptable amount of data loss measured in time. For example, if a disaster occurs at 12:00 PM (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 AM. Data loss will span only one hour, between 11:00 AM and 12:00 PM (noon).

Traditional Disaster Recovery plan (on-premise)

A traditional on-premise Disaster Recovery plan often includes a fully duplicated infrastructure that is physically separate from the infrastructure that contains our production. In this case, an additional financial investment is required to cover expenses related to hardware and for maintenance and testing. When it comes to on-premise data centers, physical access to the infrastructure is often overlooked.

These are the security requirements for an on-premise data center disaster recovery infrastructure:

  • Facilities to house the infrastructure, including power and cooling.
  • Security to ensure the physical protection of assets.
  • Suitable capacity to scale the environment.
  • Support for repairing, replacing, and refreshing the infrastructure.
  • Contractual agreements with an internet service provider (ISP) to provide internet connectivity that can sustain bandwidth utilization for the environment under a full load.
  • Network infrastructure such as firewalls, routers, switches, and load balancers.
  • Enough server capacity to run all mission-critical services. This includes storage appliances for the supporting data, and servers to run applications and backend services such as user authentication, Domain Name System (DNS), Dynamic Host Configuration Protocol (DHCP), monitoring, and alerting.

Obviously, this kind of disaster recovery plan requires large investments in building disaster recovery sites or data centers (CAPEX). In addition, storage, backup, archival and retrieval tools, and processes (OPEX) are also expensive. And, all of these processes, especially installing new equipment, take time.

An on-premise disaster recovery plan can be challenging to document, test, and verify, especially if you have multiple clients on a single infrastructure. In this scenario, all clients on this infrastructure will experience problems with performance even if only one client’s data is corrupted.

Disaster Recovery plan on AWS

There are many advantages of implementing a disaster recovery plan on AWS. Financially, we will only need to invest a small amount in advance (CAPEX), and we won’t have to worry about the physical expenses for resources (for example, hardware delivery) that we would have in on an “on-premise” data center.

AWS enables high flexibility, as we don’t need to perform a failover of the entire site in case only one part of our application isn’t working properly. Scaling is fast and easy. Most importantly, AWS allows a “pay as you use” (OPEX) model, so we don’t have to spend a lot in advance. Also, AWS services allow us to fully automate our disaster recovery plan. This results in much easier testing, maintenance, and documentation of the DR plan itself.

This table shows the AWS service equivalents to an infrastructure inside an on-premise data center.

On premise data center infrastructure AWS Infrastructure
DNS Route 53
Load Balancers ELB/appliance
Web/app servers EC2/Auto Scaling
Database servers RDS
AD/authentication AD failover nodes
Data centers Availability Zones
Disaster recovery Multi-region

Now that you saw the differences between DR on-premise versus DR on AWS lets point out some tips  that you should take into consideration when you develop your DR plan:

1. Backups not equal DR

Disaster Recovery is not only doing backups, but rather, it is the process, policies, and procedures that you put in place to prepare for recovery or business continuity in the event of a crisis. In other words, simply backing up your data won’t be of much help unless you have a process in place to quickly retrieve and put it to use.

2. Prioritize: Downtime Costs Vs. Backup/Recovery Costs

As with any successful plan, an AWS disaster recovery strategy must be tailored to meet your company’s specific needs. As such, choices will have to be made between the amount of money spent on backup and restoration of data versus the amount of money that might be lost during downtime. If your company can withstand a lengthy outage without hemorrhaging cash, a slower, less expensive backup and recovery option might make sense. But if you run a business in which you can afford not even the slightest amount of downtime, that more expensive methods such as an AWS-based duplicate production environment might be required.

3.  Determine Your RTO/RPO

A company typically decides on an acceptable RTO and RPO based on the financial impact to the business when systems are unavailable. The company determines financial impact by considering many factors, such as the loss of business and damage to its reputation due to downtime and the lack of systems availability. IT organizations then plan solutions to provide cost-effective system recovery based on the RPO within the timeline and the service level established by the RTO.

4. Choose The Right Backup Strategy

As mentioned above, regular backups are only one part an effective AWS disaster recovery plan. Nonetheless, they are an extremely important component. That’s why choosing the right backup recovery plan for your business is vital. Even though you’ve already settled on a cloud-based solution, you will have to choose between various backup options such as using Amazon Machine Images (AMI) or EBS snapshots.

5. Identify Mission-Critical Applications And Know Your AWS DR Options

After determining your company’s RTO, RPO, and preferred backup strategy, it’s time to choose which type of AWS disaster recovery method is right for you. And depending on which option you ultimate choose, it may also be necessary to identify and prioritize mission-critical applications. Some of the most common methods include:

  • Backup and Restore: a simple, cost-effective method that utilizes services such as Amazon S3 to backup and restore data.
  • Pilot Light: This method keeps critical applications and data at the ready so that it can be quickly fired up should disaster strike.
  • Warm Standby: This method keeps a duplicate version of your business’ core elements running at all times, resulting in a nearly seamless transition with very little downtime.
  • Multi-Site Solution: Also known as a Hot Standby, this configuration leaves almost nothing to chance by fully replicating your data/applications between two or more active locations and splitting traffic/usage between them. In the event of a disaster, traffic is simply routed to the unaffected location, resulting in no downtime.

6. Implement Cross-Region Backups

As with traditional methods of backup and recovery, geographic diversification of your data is essential for your AWS disaster recovery plan. If a natural disaster or man-made catastrophe brings down your primary production environment, having a backup stored in the same building, or even the same region, makes little sense. Luckily, the global reach of AWS makes geographic diversification very easy to implement. If your primary AWS services are knocked off line, you can rest assured that your DR plan can be implemented using backup data that’s been safely stored a world away (literally).

7. Test And Retest Your Plan

Sometimes even the best-laid plans go awfully wrong. Even the most detail-oriented AWS disaster recovery plan has the potential to fail when put into actual practice. That’s why it’s important to constantly test and retest your plan for flaws. And thanks to AWS’ ability to create a duplicate environment, you can test your plan using real-world scenarios without jeopardizing your actual production environment.

References

Migrate Your Own VMs into AWS Cloud

resized

Recently we received a task to move our virtual infrastructure into AWS cloud. During this process I faced a couple of challenges and I thought to shared them with you.
First let’s start with the prerequisites and limitations :

  • Operating systems that can be imported into EC2, Windows: Windows Server 2012 R2 (Standard), Windows Server 2012 (Standard, Data center), Windows Server 2008 R2 (Standard, Data center, Enterprise),Windows Server 2008 (Standard, Data center, Enterprise), Windows Server 2003 R2 (Standard, Data center, Enterprise), Windows Server 2003 (Standard, Data center, Enterprise) with Service Pack 1 (SP1) or later
  • Linux: Linux/Unix (64-bit)- Red Hat Enterprise Linux (RHEL) 5.1-5.10, 6.1-6.5, CentOS 5.1-5.10, 6.1-6.5, Ubuntu 12.04, 12.10, 13.04, 13.10, Debian 6.0.0-6.0.8, 7.0.0-7.2.0 (RHEL 6.0 is unsupported because it lacks the drivers required to run on Amazon EC2).
  • Image-Formats Supported: RAW format,VHD,VMDK, (you can only import VMDK files into Amazon EC2 that were created through the OVF export process in VMware).
  • Define an s3 bucket (in a region close to you, to speed up uploads.) This will be used to upload the images for conversion to AMI.
  • Define roles and policies in AWS. In particular:
  • vmimport service role and a policy attached to it, precisely as explained in this AWS doc.
{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"",
         "Effect":"Allow",
         "Principal":{
            "Service":"vmie.amazonaws.com"
         },
         "Action":"sts:AssumeRole",
         "Condition":{
            "StringEquals":{
               "sts:ExternalId":"vmimport"
            }
         }
      }
   ]
}
  • If you’re logged on as an AWS Identity and Access Management (IAM) user, you’ll need the following permissions in your IAM policy to use VM Import/Export:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListAllMyBuckets"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:CreateBucket",
        "s3:DeleteBucket",
        "s3:DeleteObject",
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject"
      ],
      "Resource": ["arn:aws:s3:::exported-vm","arn:aws:s3:::exported-vm/*"]
    }, 
    {
      "Effect": "Allow",
      "Action": [
        "iam:CreateRole",
        "iam:PutRolePolicy"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CancelConversionTask",
        "ec2:CancelExportTask",
        "ec2:CreateImage",
        "ec2:CreateInstanceExportTask",
        "ec2:CreateTags",
        "ec2:DeleteTags",
        "ec2:DescribeConversionTasks",
        "ec2:DescribeExportTasks",
        "ec2:DescribeInstanceAttribute",
        "ec2:DescribeInstanceStatus",
        "ec2:DescribeInstances",
        "ec2:DescribeTags",
        "ec2:ImportInstance",
        "ec2:ImportVolume",
        "ec2:StartInstances",
        "ec2:StopInstances",
        "ec2:TerminateInstances",
        "ec2:ImportImage",
        "ec2:ImportSnapshot",
        "ec2:DescribeImportImageTasks",
        "ec2:DescribeImportSnapshotTasks",
        "ec2:CancelImportTask"
      ],
      "Resource": "*"
    }
  ]
}
  • Fast upstream bandwidth as you will be uploading the image to s3!
  • Disk image cannot exceed 1 TB.
  • Make sure that you have at least 250 MB of available disk space for installing drivers and other software.
  • Multiple network interfaces are not currently supported.
  • IPv6 are not supported.
  • To use your own Microsoft licenses through set LicenseType as BYOL, your BYOL instances will be priced at the prevailing AWS EC2 Linux instance pricing, provided that you run on a Dedicated Instance.

In order to initiate and manage the migration (import), you’ll need to install the AWS Cli tools on the machine where the source images reside. You can refer to AWS documentation for installing CLI tools.
Migrating virtual machines: prepare your VM

  • Uninstall the VMWare Tools from your VMWare VM.
  • Disconnect any CD-ROM drives (virtual or physical).
  • Set your network to DHCP instead of a static IP address. If you want to assign a static private IP address, be sure to use a non-reserved private IP address in your VPC subnet.
  • Shut down your VM before exporting it.
  • On Windows, enable Remote Desktop (RDP) for remote access, and on Linux enable SSH server access.
  • Allow RDP and SSH access through your host firewall if you have one.
  • Use secure passwords for your all user accounts and disable Auto logon on your Windows VM.
  • Make sure that your Linux VM uses GRUB (GRUB legacy) or GRUB 2 as its boot loader.
  • Make sure that your Linux VM uses one of the following root file systems: EXT2, EXT3, EXT4, Btrfs, JFS, or XFS.
  • Export your VM from its virtual environment for VMware and Microsoft Hyper-V.

Okay, Now we are going to import our OVA files using AWS CLI tool. I’ve already created S3 bucket in my AWS account and uploaded OVA files also, put it in bucket named ”exported-vm”.

aws s3 mb s3://exported-vm –region us-west-1
aws s3 cp c:\myfolder\RHEL_6.5.ova s3://exported-vm/RHEL_6.5.ova

Let’s open our terminal/cmd console or whatever console you’re using with AWS CLI and type the following command to import OVA file and convert it into AMI image.
Here’s my example of the command:

aws ec2 import-image --cli-input-json "{ \"Description\": \"RHEL OVA\", \"DiskContainers\": [ { \"Description\": \"First CLI task\", \"UserBucket\": { \"S3Bucket\": \"exported-vm\", \"S3Key\" : \"RHEL_6.5.ova\" } } ]}"

Example response:

{
    "Status": "active",
    "Description": "RHEL OVA",
    "Progress": "2",
    "SnapshotDetails": [
        {
            "UserBucket": {
                "S3Bucket": "exported-vm",
                "S3Key": "RHEL_6.5.ova"
            },
            "DiskImageSize": 0.0
        }
    ],
    "StatusMessage": "pending",
    "ImportTaskId": "import-ami-ffqfkywt"
}

Run the aws ec2 describe-import-image-tasks command to check status of importing:

aws ec2 describe-import-image-tasks --import-task-ids import-ami-ffqfkywt

Example response:

{
    "ImportImageTasks": [
        {
            "Status": "active",
            "Description": "RHEL OVA",
            "Progress": "28",
            "SnapshotDetails": [
                {
                    "UserBucket": {
                        "S3Bucket": "exported-vm",
                        "S3Key": "RHEL_6.5.ova"
                    },
                    "DiskImageSize": 15508464640.0,
                    "Format": "VMDK"
                }
            ],
            "StatusMessage": "converting",
            "ImportTaskId": "import-ami-ffqfkywt"
        }
    ]
}

After converting completed, you can find the new AMI in AWS Console / EC2 / AMIs and use it to create new EC2 instances. Note the instance ID from VM import status, right-click the instance, select Instance State, and then click Start.

NewAMIs

 

 

 

 

References