Heshan's Blog: 2013

Monday, November 18, 2013

My first tandem sky dive experience

Never thought jumping out of an air plane at 11000 feet would be this much fun. What a rush!

Video quality is a bit low because I had to convert it to a lesser quality to upload it to youtube.

Techniques to backup and restore mysql databases - A comparison

When I started working at Digital Mediat Solutions, one of the biggest problems that we had was the runtime of daily mysql backup scripts. They were taking 8 hours to complete. The first task I had was to fix the performance of the backups. The existing scripts were based on mysqldump.

I compiled a list of options to look into by referring the web. I'm sharing this, so that someone else might find it useful.

Out of the following options, the most feasible (economically, disk space wise and performance wise) option for us, was to use mysqlhotcopy. Now the backup script I have written based on mysqlhotcopy, completes in 1 hours time, which is a significant imporvement.

Options

1) MySQL enterprise backup

pros

incremental backup
compressed backup
Backing up the physical database files makes restore much faster than logical techniques such as the mysqldump command.
InnoDB tables are copied using a hot backup mechanism. (Ideally, the InnoDB tables should represent a substantial majority of the data.)
Tables from other storage engines are copied using a warm backup mechanism.

cons

enterprise licensed
Minimalistic license per 1 mysql server instance per 1 year is XXXXUSD.
The cost will grow as the instances grow and the running time exceeds year by year.
It even exceeds our current budget for underlying hardware.
Therefore, I don’t think it’s feasible for our budget and IMV it’s not needed for the current deployment.

2) MySQL dumps with diffs

pros

Saves space in the backed up device as we are backing up the diff.
It won't result in network traffic as the only time the full dump is transferred, is when it’s run for the first time.

cons

We are looking into stopping the overhead generated by the mysql dump command.
Since, we are using the same command over and over again, in this deployment, this wont satisfuy our need. Therefore, IMV this is not the option for us.

3) Make incremental backups by enabling the binary logs.

pros

Used to set up replication
Used for restore operations
The above two pros outweigh the con.
Can do pinpoint restorations
Can tarball the existing logs and ship to the backup location. Just a matter of file zipping operation.

cons

Makes performance slightly slower.

4) Suggestions to improve current backup scripts

Improve the cron job (which gets triggered on low usage period) to perform the slaving task to avoid high CPU utilization.

5) Should investigate and run some performance matrices on mysql-parallel-dump and mk-parallel-dump (http://www.maatkit.org/) utilities.

mk-parallel-dump is deprecated now. So, it not feasible.
TODO: Talk to Matt regarding:

Setting up a dev env, so that I could test these stuff. I don’t need any new fancy hardware for these. Just give me a PC and I’ll configure it with linux and try these out. We don’t have to buy new hardware.
Need to test these in a dev env. So, that we could roll these out to a production environment.
Need to have a mock server (which have kind of the same traffic as the real servers), so that we could test and verify things properly

6) Remove the current scripts and use a tool specifically written for this and see how it performs.

http://sourceforge.net/projects/automysqlbackup/

7) Update the current scripts with --quick option which avoids buffering of large tables and does row by row backups. Resulting in faster restores. This will be a minor tweak to the current scripts.

8) Use of mysqlhotcopy instead of mysquldump

It uses FLUSH TABLES, LOCK TABLES, and cp or scp to make a database backup.
It is a fast way to make a backup of the database or single tables
but it can be run only on the same machine where the database directories are located.
mysqlhotcopy works only for backing up MyISAM and ARCHIVE tables.
Since we are using MyISAM, this wont be a problem for us.

9) Use SVN to backup databases.

This will help to have revisions.
I don’t think we need such a mechanism for us.

10) If replication to a slave isn't an option, we could leverage the filesystem, depending on the OS we are using,

Consistent backup with Linux Logical Volume Manager (LVM) snapshots.
MySQL backups using ZFS snapshots.
The joys of backing up MySQL with ZFS...
Some people have used ZFS snapshots on a quite large MySQL database (30GB+) as a backup method and it completes very quickly (never more than a few minutes) and says it doesn't block.
They say that we can then mount the snapshot somewhere else and back it up to tape, etc.

Wednesday, April 17, 2013

Apache Airavata 0.7 Released

The Apache Airavata PMC is pleased to announce the immediate availability of the Airavata 0.7 release.

The release can be obtained from the Apache Airavata download page - http://airavata.apache.org/about/downloads.html

Release notes are available at - https://svn.apache.org/repos/asf/airavata/tags/airavata-0.7/RELEASE_NOTES

Apache Airavata is a software framework providing API’s, sophisticated server-side tools, and graphical user interfaces to construct, execute, control and manage long running applications and workflows on distributed computing resources. Apache Airavata builds on general concepts of service oriented computing, distributed messaging, and workflow composition and orchestration.

For general information on Apache Airavata, please visit the project website: http://airavata.apache.org/

Friday, April 5, 2013

Run EC2 Jobs with Airavata - Part III

This is a followup to my earlier posts [1] [2]. Here we will execute the application mentioned in [2] programmetically using Airavata.

import org.apache.airavata.commons.gfac.type.*;
import org.apache.airavata.gfac.GFacAPI;
import org.apache.airavata.gfac.GFacConfiguration;
import org.apache.airavata.gfac.GFacException;
import org.apache.airavata.gfac.context.security.AmazonSecurityContext;
import org.apache.airavata.gfac.context.ApplicationContext;
import org.apache.airavata.gfac.context.JobExecutionContext;
import org.apache.airavata.gfac.context.MessageContext;
import org.apache.airavata.schemas.gfac.*;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.File;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

/**
 * Your Amazon instance should be in a running state before running this test.
 */
public class EC2ProviderTest {
    private JobExecutionContext jobExecutionContext;

    private static final String hostName = "ec2-host";

    private static final String hostAddress = "ec2-address";

    private static final String sequence1 = "RR042383.21413#CTGGCACGGAGTTAGCCGATCCTTATTCATAAAGTACATGCAAACGGGTATCCATA" +
            "CTCGACTTTATTCCTTTATAAAAGAAGTTTACAACCCATAGGGCAGTCATCCTTCACGCTACTTGGCTGGTTCAGGCCTGCGCCCATTGACCAATATTCCTCA" +
            "CTGCTGCCTCCCGTAGGAGTTTGGACCGTGTCTCAGTTCCAATGTGGGGGACCTTCCTCTCAGAACCCCTATCCATCGAAGACTAGGTGGGCCGTTACCCCGC" +
            "CTACTATCTAATGGAACGCATCCCCATCGTCTACCGGAATACCTTTAATCATGTGAACATGCGGACTCATGATGCCATCTTGTATTAATCTTCCTTTCAGAAG" +
            "GCTGTCCAAGAGTAGACGGCAGGTTGGATACGTGTTACTCACCGTGCCGCCGGTCGCCATCAGTCTTAGCAAGCTAAGACCATGCTGCCCCTGACTTGCATGT" +
            "GTTAAGCCTGTAGCTTAGCGTTC";

    private static final String sequence2 = "RR042383.31934#CTGGCACGGAGTTAGCCGATCCTTATTCATAAAGTACATGCAAACGGGTATCCATA" +
            "CCCGACTTTATTCCTTTATAAAAGAAGTTTACAACCCATAGGGCAGTCATCCTTCACGCTACTTGGCTGGTTCAGGCTCTCGCCCATTGACCAATATTCCTCA" +
            "CTGCTGCCTCCCGTAGGAGTTTGGACCGTGTCTCAGTTCCAATGTGGGGGACCTTCCTCTCAGAACCCCTATCCATCGAAGACTAGGTGGGCCGTTACCCCGC" +
            "CTACTATCTAATGGAACGCATCCCCATCGTCTACCGGAATACCTTTAATCATGTGAACATGCGGACTCATGATGCCATCTTGTATTAAATCTTCCTTTCAGAA" +
            "GGCTATCCAAGAGTAGACGGCAGGTTGGATACGTGTTACTCACCGTGCG";

    /* Following variables are needed to be set in-order to run the test. Since these are account specific information,
       I'm not adding the values here. It's the responsibility of the person who's running the test to update
       these variables accordingly.
       */

    /* Username used to log into your ec2 instance eg.ec2-user */
    private String userName = "";

    /* Secret key used to connect to the image */
    private String secretKey = "";

    /* Access key used to connect to the image */
    private String accessKey = "";

    /* Instance id of the running instance of your image */
    private String instanceId = "";

    @Before
    public void setUp() throws Exception {
        URL resource = GramProviderTest.class.getClassLoader().getResource("gfac-config.xml");
        assert resource != null;
        System.out.println(resource.getFile());
        GFacConfiguration gFacConfiguration = GFacConfiguration.create(new File(resource.getPath()), null, null);

        /* EC2 Host */
        HostDescription host = new HostDescription(Ec2HostType.type);
        host.getType().setHostName(hostName);
        host.getType().setHostAddress(hostAddress);

        /* App */
        ApplicationDescription ec2Desc = new ApplicationDescription(Ec2ApplicationDeploymentType.type);
        Ec2ApplicationDeploymentType ec2App = (Ec2ApplicationDeploymentType)ec2Desc.getType();

        String serviceName = "Gnome_distance_calculation_workflow";
        ec2Desc.getType().addNewApplicationName().setStringValue(serviceName);
        ec2App.setJobType(JobTypeType.EC_2);
        ec2App.setExecutable("/home/ec2-user/run.sh");
        ec2App.setExecutableType("sh");

        /* Service */
        ServiceDescription serv = new ServiceDescription();
        serv.getType().setName("GenomeEC2");

        List inputList = new ArrayList();

        InputParameterType input1 = InputParameterType.Factory.newInstance();
        input1.setParameterName("genome_input1");
        input1.setParameterType(StringParameterType.Factory.newInstance());
        inputList.add(input1);

        InputParameterType input2 = InputParameterType.Factory.newInstance();
        input2.setParameterName("genome_input2");
        input2.setParameterType(StringParameterType.Factory.newInstance());
        inputList.add(input2);

        InputParameterType[] inputParamList = inputList.toArray(new InputParameterType[inputList.size()]);

        List outputList = new ArrayList();
        OutputParameterType output = OutputParameterType.Factory.newInstance();
        output.setParameterName("genome_output");
        output.setParameterType(StringParameterType.Factory.newInstance());
        outputList.add(output);

        OutputParameterType[] outputParamList = outputList
                .toArray(new OutputParameterType[outputList.size()]);

        serv.getType().setInputParametersArray(inputParamList);
        serv.getType().setOutputParametersArray(outputParamList);

        jobExecutionContext = new JobExecutionContext(gFacConfiguration,serv.getType().getName());
        ApplicationContext applicationContext = new ApplicationContext();
        jobExecutionContext.setApplicationContext(applicationContext);
        applicationContext.setServiceDescription(serv);
        applicationContext.setApplicationDeploymentDescription(ec2Desc);
        applicationContext.setHostDescription(host);

        AmazonSecurityContext amazonSecurityContext =
                new AmazonSecurityContext(userName, accessKey, secretKey, instanceId);
        jobExecutionContext.addSecurityContext(AmazonSecurityContext.AMAZON_SECURITY_CONTEXT, amazonSecurityContext);

        MessageContext inMessage = new MessageContext();
        ActualParameter genomeInput1 = new ActualParameter();
        ((StringParameterType)genomeInput1.getType()).setValue(sequence1);
        inMessage.addParameter("genome_input1", genomeInput1);

        ActualParameter genomeInput2 = new ActualParameter();
        ((StringParameterType)genomeInput2.getType()).setValue(sequence2);
        inMessage.addParameter("genome_input2", genomeInput2);

        MessageContext outMessage = new MessageContext();
        ActualParameter echo_out = new ActualParameter();
        outMessage.addParameter("distance", echo_out);

        jobExecutionContext.setInMessageContext(inMessage);
        jobExecutionContext.setOutMessageContext(outMessage);
    }

    @Test
    public void testGramProvider() throws GFacException {
        GFacAPI gFacAPI = new GFacAPI();
        gFacAPI.submitJob(jobExecutionContext);
        MessageContext outMessageContext = jobExecutionContext.getOutMessageContext();
        Assert.assertEquals(MappingFactory.
                toString((ActualParameter) outMessageContext.getParameter("genome_output")), "476");
    }
}

References
[1] - http://heshans.blogspot.com/2013/04/run-ec2-jobs-with-airavata-part-i.html
[2] - http://heshans.blogspot.com/2013/04/run-ec2-jobs-with-airavata-part-ii.html

Run EC2 Jobs with Airavata - Part II

In this post we will look at how to compose a workflow out of an application that is installed in an Amazon Machine Image (AMI). In the earlier post we discussed how to do ec2 instance management using XBaya GUI. This is the followup to that post.

For the Airavata EC2 integration testing, I created an AMI which has an application which does gene sequence alignment using Smith-Waterman algorithm. I will be using that application as a reference to this post. You can use an application of your preference that resides in your AMI.

1. Unzip Airavata server distribution and start the server.

unzip apache-airavata-server-0.7-bin.zip
cd apache-airavata-server-0.7/bin
./airavata-server.sh

2. Unzip Airavata XBaya distribution and start XBaya.

unzip apache-airavata-xbaya-gui-0.7-bin.zip
cd apache-airavata-xbaya-gui-0.7/bin
./xbaya-gui.sh

Then you'll get the XBaya UI.

3. Select "XBaya" Menu and click "Add Host" to register an EC2 Host. Once you add the details, click "ok".

4. You will then be prompted to enter "Airavata Registry" information. If you are using the default setup, you don't have to do any configuration. Just click "ok".

5. In order to use your application installed in the AMI, you must register it as an application in Airavata system. Select "XBaya" menu and click "Register Application". You will get the following dialog. Add the input parameters expected and the output parameters generated by your application.

6. Then Click the "New deployment" button. You have to then select the EC2Host that you registered earlier as the Application Host. Configure the executable path to your application in your AMI and click "Add".

7. Then click "Register". If the application registration was successful, you will be getting the following message.

8. Now select "Registry" menu and click "Setup Airavata Registry". Click "ok".

9. Select "XBaya" menu and click "New workflow". Then configure it accordingly.

10. Select your registered application from the "Application Services" and drag drop it to the workflow window.

11. Drag an "Instance" component from "Amazon Components" and drop it into workflow window. Then connect it to your application using Control ports.

12. Click on top of the "Instance" components config label. Configure your instance accordingly.

13. Drag and drop two input components and one output component to the workflow from "System Components".

14. Connect the components together accordingly.

15. Now click the red colored "play" button to run your workflow. You will be prompted for the inputs values (in my case the gene sequences) and experiment id. Then click "Run" to execute your workflow.

16. The execution result will be shown in the XBaya GUI.

References
[1] - http://heshans.blogspot.com/2013/04/run-ec2-jobs-with-airavata-part-i.html

Run EC2 Jobs with Airavata - Part I

This will be the first of many posts that I will be doing on Apache Airavata EC2 integration. First let's have a look at how you can use Airavata's "XBaya GUI" to manage amazon instances.

Applies to : Airavata 0.7 and above

1. Unzip Airavata server distribution and start the server.

unzip apache-airavata-server-0.7-bin.zip
cd apache-airavata-server-0.7/bin
./airavata-server.sh

2. Unzip Airavata XBaya distribution and start XBaya.

unzip apache-airavata-xbaya-gui-0.7-bin.zip
cd apache-airavata-xbaya-gui-0.7/bin
./xbaya-gui.sh

Then you'll get the XBaya UI.

3. Then Select "Amazon" menu and click "Security Credentials". Specify your secret key and access key in the security credentials dialog box and click "ok".

4. Then Select "Amazon" menu and click "EC2 Instance Management". It will give a glimpse of your running instances.

5. Click the "launch" button to launch new instances and "terminate" button to terminate, running instances.

6. When you launch a new instance, it will be showed in your "Amazon EC2 Management Console".

Friday, March 15, 2013

Airavata Deployment Studio (ADS)

This is an independent study that I have been doing for Apache Airavata [1]. Airavata Deployment Studio or simply ADS, is a platform where an Airavata user can deploy his/her Airavata deployment on a Cloud computing resource on demand. Now let's dive into ADS and what's the actual problem that we are trying the solve here.

What is Airavata?

Airavata is a framework which enables a user to build Science Gateways. It is used to compose, manage, execute and monitor distributed applications and workflows on computational resources. These computational resources can range from local resources to computational grids and clouds. Therefore, various users with different backgrounds either contribute or use Airavata in their applications.

Who uses Airavata?

From the Airavata standpoint, three main users can be identified.

1) End Users

End User is the one who will have a model code to do some scientific application. Sometimes this End User can be a Research Scientist. He/She writes scripts to wrap the applications up and by executing those scripts, they run the scientific workflows in Super Computers. This can be called a scientific experiment.

2) Gateway Developers

The Research Scientist is the one who comes up with requirement of bundling scientific applications together and composing as a workflow. The job of the Gateway Developer is to use Airavata and wrap the above mentioned model code and scripts together. Then, scientific workflows are created out these. In some cases, Scientist might be the Gateway Developer as well.

3) Core Developers

Core Developer is the one who develops and contributes to Airavata framework code-base. The Gateway Developers use the software developed by the Core Developers to create science gateways.

Why ADS?

According to the above description, Airavata is used by different people with different technical backgrounds. Some people will have in depth technical knowledge on their scientific domains; like chemistry, biology, astronomy, etc and may not have in depth knowledge on computer science aspects such as cluster configuration, configuring and trouble-shooting in VMs, etc.

When it comes to ADS, it's targeted towards the first two types of users as they will be ones who will be running in to configuration issues with Airavata in their respective systems.

Sometimes we come across instances where a user might run into issues while setting up Airavata on their Systems. These might be attributed to;

User not following the documented steps properly.
Issues in setting up the user environment.
User not being able to diagnose the issues at their end on their own.
Sometimes when we try to diagnose their issue remotely, we face difficulties trying to access user's VM remotely due to security policies defined in their System.
Different security policies at client's firewall.

Due to the above mentioned issues, a first time user might go away with a bad impression due to a System/VM level issue that might not be directly related to Airavata.

What we are trying to do here is to give a first time user a good first impression as well as ease of configuring the Airavata eco system for production usage.

How?

Now you might be wondering how does ADS achieve this? ADS will use FutureGrid [3] as the underlying resource platform for this application. If you are interested in learning about what FutureGrid is, please refer [3] for more information. ADS will ultimately become a plugin to the FutureGrid's CloudMesh [4] environment.

ADS will provide a user with a web interface which a user can use to configure his/her Airavata eco system. Once the configuration options are selected and user hits the submit button, a new VM with the selected configurations will be created. The user will be able to create his/her image with the following properties.

Infrastructure - eg: OpenStack, Eucalyptus, EC2, etc
Architecture - eg: 64-bit, 32-bit
Memory - eg: 2GB, 4GB, 8GB, etc
Operating System - eg: Ubuntu, CentOS, Fedora, etc
Java version - eg: Java 1.6, Java 1.7
Tomcat Version - eg: Tomcat6, Tomcat7
Airavata Version - eg: Airavata-0.6, Airavata-0.7

Advantages?

One click install.
No need to interact with the shell to configure an Airavata environment.
Deploying on various Cloud platforms based on user preference.
Ease of use.
First time user will be able to quickly configure an insatnce of his own and run a sample workflow quickly.
On demand aspect.

Sneak Peak

Following screenshots show how ADS will look like.

References

[1] - http://airavata.apache.org

[2] - http://airavata.apache.org/architecture/airavata-stakeholders.html

[3] - https://portal.futuregrid.org/about
[4] - http://cloudmesh.blogspot.com

Installing Moab Web Services on a Unix box

1) Install Tomcat

yum install tomcat6

2) Install 64-bit version of Oracle Java SE6 JRE.

sh jre-6u37-linux-x64-rpm.bin
rm -f /usr/bin/java
ln -s /etc/alternatives/java /usr/bin/java
alternatives --install /usr/bin/java java /usr/java/jre1.6.0_37/bin/java 500
alternatives --set java /usr/java/jre1.6.0_37/bin/java

3) Create mws home directories and sub-directories

mkdir -p /opt/mws/etc /opt/mws/hooks /opt/mws/plugins /opt/mws/log
chown -R tomcat:tomcat /opt/mws # Depending on your OS, the Tomcat username might be
tomcat6.
chmod -R 555 /opt/mws
chmod u+w /opt/mws/plugins /opt/mws/log

4) Extract mws tarball to a tempory directory.

mkdir /tmp/mws-install
cd /tmp/mws-install
tar xvzf $HOME/Downloads/mws-.tar.gz
cd /tmp/mws-install/mws-

5) Set up the MWS configuration file.
i) In the extracted MWS directory are two sample configuration files:

   mws-config-cloud.groovy and mws-config-hpc.groovy
   mws-config-cloud.groovy provides sample configuration for the Moab Cloud Suite
   mws-config-hpc.groovy provides sample configuration for the Moab HPC Suites

ii) Choose the correct file for your suite, rename it to mws-config.groovy, and copy it to /opt/mws/etc.

iii) Give the Tomcat user read access to /opt/mws/etc/mws-config.groovy. 6) Add the following line to the end of /etc/tomcat6/tomcat6.conf.

CATALINA_OPTS="-DMWS_HOME=/opt/mws -Xms256m -Xmx3g -XX:MaxPermSize=384m"

7) Start Tomcat and deploy mws.war.

chkconfig tomcat6 on
service tomcat6 stop
cp /tmp/mws-install/mws-/mws.war /var/lib/tomcat6/webapps
service tomcat6 start

8) Visit http://localhost:8080/mws/ in a web browser to verify that MWS is running. You will see some sample queries and a few other actions.

9) Log into MWS to verify that the MWS credentials are working. The credentials are the values of auth.defaultUser.username and auth.defaultUser.password that you set above.

Wednesday, February 13, 2013

Google Chrome remote desktop plugin

This is one of the best and useful plugins that I have ever come across. I was introduced to this plugin when I was doing some work with Gregor on FutureGrid. The setup and running is pretty trivial.

https://chrome.google.com/webstore/detail/chrome-remote-desktop/gbchcmhmhahfdphkhkmpfmihenigjmpp?hl=en

Setting up Future Grid Cloud Mesh environment on a unix box

I wrote a post on Cloud Mesh blog on "how to setup Future Grid's Cloud Mesh setup on a linux box". Enjoy!