The recent case of Wikileaks being booted off AWS provoked the following thought: if lots of people started mirroring wikileaks on EC2, Amazon would be forced into playing whack-a-mole to stop it. Game over: the revolt of a user-base, the inevitable collateral damage, etc. leads to bad PR and (hopefully) reform and vertebrates. Amazon's nice GUI combined with their introductory offer making a micro instance basically free for a year, means that putting up a how-to page/YouTube video showing every college kid with a credit card how to do it can't be too difficult.
My question is, what next? How to you turn a swarm of small, transient mirrors into something findable and load-balanced to deal with the (potentially) huge demand of serving Wikileaks traffic?
Therein lies an interesting problem. First and foremost the question of how you resolve a stable domain to one of a set of highly dynamic addresses is a difficult one. Round-robin DNS load-balancing with a very short TTL is one obvious approach, but this begs the question of how one boot-straps and then maintains the CNAME record containing the list of mirrors. The stable domain could be a CNAME to DNS servers that are themselves part of the mirroring swarm, if each node acted as both a DNS and an HTTP server. Each node would return a randomised list of the nodes in it's topological neighborhood.
Apart from that, I'm not much further along in my thinking. Posadis could be used to implement a simple DNS server - the sample code can practically be copy-pasted by anyone who knows a bit of C++. The P2P network should be easy to do - any DHT should be usable, provided it has an API call to get the list of nodes, since the goal is not really to store information but just to maintain a single cluster of machines that know about each other. But the boot-strapping problem remains.
Showing posts with label ec2. Show all posts
Showing posts with label ec2. Show all posts
2010/12/11
2010/12/08
Thoughts on WikiLeaks
- We should be more outraged at the malfeasance revealed than at the manner in which we learnt of it. That we aren't, speaks to our cynicism and the low-to-nonexistent standards to which we hold government.
- It was EveryDNS, not EasyDNS.
- Much as I love Amazon AWS (see many, many previous posts), their justification for booting out WL is disappointing, because (a) the cause (Lieberman's call) and effect is there for all to see, and (b) they didn't boot WL off because of the Iraq or Afghanistan war logs, which were also violations of the TOS, six months ago. Besides which, they have a business interest in a robust first amendment - they sell books, including dangerous and subversive ones containing state secrets! This isn't hard people!
- As Tim Bray said, the spinelessness of the IT industry in general is depressing when one considers that we should have "freedom of speech" in our DNA.
- The following ideas are not contradictory:
- Cablegate is not necessarily a good thing
- The way in which various governments and their officials have responded to it is an attack on the freedom of information and on the free press.
- Julian Assange can be (a) a scumbag rapist, (b) justifiably paranoid, (c) a raging egotist and (d) doing really important work, all at the same time.
2010/05/19
Jevons paradox, Moore's law and utility computing
TL;DR: A quirk of economics may explain why computers have always seemed too slow, and could indicate that a utility computing boom will take place in the near future.
First: Jevons paradox...
...is defined as follows on Wikipedia:
This is backed up by anecdotal evidence and folk knowledge. Consider generally accepted adages such as Wirth's law, "Software is getting slower more rapidly than hardware becomes faster", or the variations of Parkinsons law, such as "Data expands to fill the space available for storage", and Nathan Myhrvold's 'first law of software', to paraphrase: "Software is a gas: it always expands to fit whatever container it is stored in."
To provide more empirical proof, one would have to be able to take a survey of data like "historical average time taken by a user to perform common operation X", or "average % disk space free since 1990 on home/office PCs". I'd be interested in any links to studies similar to this, if anyone has any examples.
Second: the commoditisation of CPU time and storage
Generally, clients do not seem to be getting 'thicker', or pushing boundaries in terms of CPU power and storage. Mean-while, many companies are spending a lot of money on new datacenters, and the concept of "Big Data" is gaining ground. Server-side, we've moved to multi-core and we're learning more about running large clusters of cheap hardware.
Right now we're at the inflection point in a movement towards large datacenters that use economies of scale and good engineering to attain high efficiencies. When this infrastructure is exposed at a low level, its use is metered in ways similar to other utilities such as electricity and water. EC2 is the canonical example.
I believe that CPU time and storage will eventually become a true commodity, traded on open markets, like coal, oil, or in some regions of the world, electricity. The barriers to this happening today are many: lack of standard APIs and bandwidth- or network-related lock-in are two worth mentioning. However, you can see foreshadowing of this in Amazon's spot instances feature, which uses a model close to how real commodities are priced. (Aside: Jonathan Schwartz was posting about this back in 2005.)
In an open market, building a 'compute power station' becomes an investment whose return on capital would be linked to the price of CPU time & storage in that market. The laws of supply and demand would govern this price as it does any other. For example, one can imagine that just as CO2 emissions dip during an economic downturn, CPU time would be also cheaper as less work is done.
In addition to this, if Moore's law continues to hold, newer facilities would host ever-faster, ever-more-efficient hardware. This would normally push the price of CPU time inexorably downwards, and make investing in any datacenter a bad idea. As a counter-point to this, we can see that todays hosting market is relatively liquid on a smaller scale, and people still build normal datacenters. Applying Jevons paradox, however, goes further, indicating that as efficiency increases, demand will also increase. Software expands to fill the space available.
Third: looking back
I think a closer look at recent history will help to shed light on the coming market in utility computing. Two subjects in particular might be useful to study.
Coal was, in Britain, a major fuel of the last industrial revolution. From 1700 to 1960, production volume increased exponentially [PDF, page 34] and the 'consumer' price fell in real terms by 40%, mainly due to decreased taxes and transportation costs. At the same time, however, production prices rose by 20%. In his book The Coal Question, Jevons posed a question about coal that we may find familiar: for how long can supply continue to increase exponentially?
However, the parallels only go so far. Coal mining technology only progressed iteratively, with nothing like Moore's law behind it - coal production did peak eventually. The mines were controlled by a cartel, the "Grand Allies", who kept production prices relatively stable by limiting over-production. Today we have anti-trust laws to prevent that from happening.
Lastly, the cost structure of the market was different: production costs were never more than 50% of the consumer price, whereas the only cost between the producer and the consumer of CPU time is bandwidth. Bandwidth is getting cheaper all the time, although maybe not at a rate sufficient for it to remain practically negligible throughout an exponential increase in demand for utility computing.
Electricity, as Jon Schwartz noted in the blog posts linked to above, and as Michael Manos noted here, started out as the product of small, specialised plants and generators in the basements of large buildings, before transforming rapidly into the current grid system, complete with spot prices etc. Giants like GE were created in the process.
As with electricity, it makes sense to use CPU power as close to you as possible. For electricity there are engineering limits concerning long distance power transmission; on networks, latency increases the further away you go. There are additional human constraints. Many businesses prefer to deal with suppliers in their own jurisdictions for tax reasons and for easier access to legal recourse. Plenty of valuable data may not leave its country of origin. For example, a medical institution may not be permitted to transfer its patient data abroad.
For me, this indicates that there would be room at a national level for growth in utility computing. Regional players may spring up, differentiating themselves simply through their jurisdiction or proximity to population centers. To enable rapid global build-out, players could set up franchise operations, albeit with startup costs and knowledge-bases a world away from the typical retail applications of the business model.
Just as with the datacenter construction business, building out the fledgling electricity grid was capital intensive. Thomas Edison's company was allied with the richest and most powerful financier of the time, J.P. Morgan, and grew to become General Electric. In contrast, George Westinghouse, who built his electricity company on credit and arguably managed it better, didn't have Wall St. on his side and so lost control of his company in an economic crisis.
Finally: the questions this leaves us with
It's interesting to note that two of the companies that are currently ahead in utility computing - Google and Microsoft - have sizable reserves measured in billions. With that kind of cash, external capital isn't necessary as it was to GE and Westinghouse. But neither of them, nor any other player, seem to be building datacenters at a rate comparable to that of power station construction during the electrifying of America. Are their current rates going to take off in future? If so, how will they finance it?
Should investors pour big money into building utility datacenters? How entrenched is the traditional hosting business, and will someone eat their lunch by diving into this? Clay Shirky compared a normal web-hosting environment to one run by AT&T - will a traditional hosting business have the same reaction as AT&T did to him?
A related question is the size of the first-mover advantage - are Google, Amazon, Microsoft and Rackspace certain to dominate this business? I think this depends on how much lock-in they can create. Will the market start demanding standard APIs and fighting against lock-in, and if so, when? Looking at the adoption curves of other technologies, like the Web, should help to answer this question. Right now the de-facto standard APIs such as those of EC2 and Rackspace can be easily cloned, but will this change?
I'm throwing these questions out here because I don't have the answers. But the biggest conclusion that I think can be tentatively drawn from applying Jevons paradox to the 'resources' of CPU time and storage space, is that in the coming era of utility computing, there may soon be a business case to be made for building high-efficiency datacenters and exposing them to the world at spot prices from day one.
First: Jevons paradox...
...is defined as follows on Wikipedia:
The proposition that technological progress that increases the efficiency with which a resource is used, tends to increase (rather than decrease) the rate of consumption of that resource.This was first observed by Jevons himself with coal:
Watt's innovations made coal a more cost effective power source, leading to the increased use of the steam engine in a wide range of industries. This in turn increased total coal consumption, even as the amount of coal required for any particular application fell.I think that we're experiencing this effect with CPU time and storage. Specifically, if we recast Moore's law in terms of increased efficiency of CPU instruction processing per dollar, then Jevons paradox explains why software generally never seems to get any faster, and why we always seem to be running out of storage space.
This is backed up by anecdotal evidence and folk knowledge. Consider generally accepted adages such as Wirth's law, "Software is getting slower more rapidly than hardware becomes faster", or the variations of Parkinsons law, such as "Data expands to fill the space available for storage", and Nathan Myhrvold's 'first law of software', to paraphrase: "Software is a gas: it always expands to fit whatever container it is stored in."
To provide more empirical proof, one would have to be able to take a survey of data like "historical average time taken by a user to perform common operation X", or "average % disk space free since 1990 on home/office PCs". I'd be interested in any links to studies similar to this, if anyone has any examples.
Second: the commoditisation of CPU time and storage
Generally, clients do not seem to be getting 'thicker', or pushing boundaries in terms of CPU power and storage. Mean-while, many companies are spending a lot of money on new datacenters, and the concept of "Big Data" is gaining ground. Server-side, we've moved to multi-core and we're learning more about running large clusters of cheap hardware.
Right now we're at the inflection point in a movement towards large datacenters that use economies of scale and good engineering to attain high efficiencies. When this infrastructure is exposed at a low level, its use is metered in ways similar to other utilities such as electricity and water. EC2 is the canonical example.
I believe that CPU time and storage will eventually become a true commodity, traded on open markets, like coal, oil, or in some regions of the world, electricity. The barriers to this happening today are many: lack of standard APIs and bandwidth- or network-related lock-in are two worth mentioning. However, you can see foreshadowing of this in Amazon's spot instances feature, which uses a model close to how real commodities are priced. (Aside: Jonathan Schwartz was posting about this back in 2005.)
In an open market, building a 'compute power station' becomes an investment whose return on capital would be linked to the price of CPU time & storage in that market. The laws of supply and demand would govern this price as it does any other. For example, one can imagine that just as CO2 emissions dip during an economic downturn, CPU time would be also cheaper as less work is done.
In addition to this, if Moore's law continues to hold, newer facilities would host ever-faster, ever-more-efficient hardware. This would normally push the price of CPU time inexorably downwards, and make investing in any datacenter a bad idea. As a counter-point to this, we can see that todays hosting market is relatively liquid on a smaller scale, and people still build normal datacenters. Applying Jevons paradox, however, goes further, indicating that as efficiency increases, demand will also increase. Software expands to fill the space available.
Third: looking back
I think a closer look at recent history will help to shed light on the coming market in utility computing. Two subjects in particular might be useful to study.
Coal was, in Britain, a major fuel of the last industrial revolution. From 1700 to 1960, production volume increased exponentially [PDF, page 34] and the 'consumer' price fell in real terms by 40%, mainly due to decreased taxes and transportation costs. At the same time, however, production prices rose by 20%. In his book The Coal Question, Jevons posed a question about coal that we may find familiar: for how long can supply continue to increase exponentially?
However, the parallels only go so far. Coal mining technology only progressed iteratively, with nothing like Moore's law behind it - coal production did peak eventually. The mines were controlled by a cartel, the "Grand Allies", who kept production prices relatively stable by limiting over-production. Today we have anti-trust laws to prevent that from happening.
Lastly, the cost structure of the market was different: production costs were never more than 50% of the consumer price, whereas the only cost between the producer and the consumer of CPU time is bandwidth. Bandwidth is getting cheaper all the time, although maybe not at a rate sufficient for it to remain practically negligible throughout an exponential increase in demand for utility computing.
Electricity, as Jon Schwartz noted in the blog posts linked to above, and as Michael Manos noted here, started out as the product of small, specialised plants and generators in the basements of large buildings, before transforming rapidly into the current grid system, complete with spot prices etc. Giants like GE were created in the process.
As with electricity, it makes sense to use CPU power as close to you as possible. For electricity there are engineering limits concerning long distance power transmission; on networks, latency increases the further away you go. There are additional human constraints. Many businesses prefer to deal with suppliers in their own jurisdictions for tax reasons and for easier access to legal recourse. Plenty of valuable data may not leave its country of origin. For example, a medical institution may not be permitted to transfer its patient data abroad.
For me, this indicates that there would be room at a national level for growth in utility computing. Regional players may spring up, differentiating themselves simply through their jurisdiction or proximity to population centers. To enable rapid global build-out, players could set up franchise operations, albeit with startup costs and knowledge-bases a world away from the typical retail applications of the business model.
Just as with the datacenter construction business, building out the fledgling electricity grid was capital intensive. Thomas Edison's company was allied with the richest and most powerful financier of the time, J.P. Morgan, and grew to become General Electric. In contrast, George Westinghouse, who built his electricity company on credit and arguably managed it better, didn't have Wall St. on his side and so lost control of his company in an economic crisis.
Finally: the questions this leaves us with
It's interesting to note that two of the companies that are currently ahead in utility computing - Google and Microsoft - have sizable reserves measured in billions. With that kind of cash, external capital isn't necessary as it was to GE and Westinghouse. But neither of them, nor any other player, seem to be building datacenters at a rate comparable to that of power station construction during the electrifying of America. Are their current rates going to take off in future? If so, how will they finance it?
Should investors pour big money into building utility datacenters? How entrenched is the traditional hosting business, and will someone eat their lunch by diving into this? Clay Shirky compared a normal web-hosting environment to one run by AT&T - will a traditional hosting business have the same reaction as AT&T did to him?
A related question is the size of the first-mover advantage - are Google, Amazon, Microsoft and Rackspace certain to dominate this business? I think this depends on how much lock-in they can create. Will the market start demanding standard APIs and fighting against lock-in, and if so, when? Looking at the adoption curves of other technologies, like the Web, should help to answer this question. Right now the de-facto standard APIs such as those of EC2 and Rackspace can be easily cloned, but will this change?
I'm throwing these questions out here because I don't have the answers. But the biggest conclusion that I think can be tentatively drawn from applying Jevons paradox to the 'resources' of CPU time and storage space, is that in the coming era of utility computing, there may soon be a business case to be made for building high-efficiency datacenters and exposing them to the world at spot prices from day one.
Labels:
cloudcomputing,
commoditisation,
ec2,
software,
utilitycomputing
2010/02/06
EC2 and CXF: Serialising objects in JAX-WS
The second problem I had with CXF (or, more correctly, JAXB) was in trying to serialise JAXB objects such as CreateVolumeType into XML using a copy of JAXBuddy from Typica.
This failed with the error message: "Unable to marshal type XXX as an element because it is missing an @XmlRootElement annotation". Searching for this error message led me to this blog post by Kohsuke Kawaguchi, and so I copied the sample configuration from the comment into my binding file. This didn't work, and the error message remained the same - the configuration didn't take effect.
Googling again I found this CXF bug on serialisation and "simple" mode. The information in this issue, combined with the information in the wsdl-to-java documentation gave me the information I needed to correct the binding file.
As a humorous side-effect this also changed a lot of class names and broke a lot of code, but the new class-names are an improvement, so there we go.
My updated sample binding file is on pastebin here. Also, my original post about CXF is messed up in some way and I'll get round to fixing it soon.
This failed with the error message: "Unable to marshal type XXX as an element because it is missing an @XmlRootElement annotation". Searching for this error message led me to this blog post by Kohsuke Kawaguchi, and so I copied the sample configuration from the comment into my binding file. This didn't work, and the error message remained the same - the configuration didn't take effect.
Googling again I found this CXF bug on serialisation and "simple" mode. The information in this issue, combined with the information in the wsdl-to-java documentation gave me the information I needed to correct the binding file.
As a humorous side-effect this also changed a lot of class names and broke a lot of code, but the new class-names are an improvement, so there we go.
My updated sample binding file is on pastebin here. Also, my original post about CXF is messed up in some way and I'll get round to fixing it soon.
2009/12/27
Building a WS-Security enabled SOAP client in Maven2 to the EC2 WSDL using JAX-WS / CXF & WSS4J: tips & tricks
Generating a Java client from the Amazon EC2 WSDL that correctly used WS-Security is not completely simple. This blog post from Glen Mazza contains pretty much all the info you need, but as usual there are many things to trip up over along the way. So, without further ado, my contribution.
My setup: I was using Maven2 to construct a JAR file. Running "mvn generate-sources", then, downloads the WSDL and uses it to generate the EC2 object model in src/main/java.
Blogger doesn't like me quoting XML, so I've put my sample POM at pastebin, here. Inside the cxf-codegen-plugin plugin XML you'll see two specific options, "autoNameResolution", which is needed to prevent naming conflicts with the WSDL, and a link to the JXB binding file for JAXWS, which is needed to generate the correct method signatures
Once this is done, then the security credentials need to be configured. There are some pecularities:
As laid out in this tutorial for the Amazon product advertising API, the X.509 certificate and the private key need to be converted into a pkcs12 -format file before they're usable in Java. This is done using OpenSSL:
My next problem was the exception "java.io.IOException: exception decrypting data - java.security.InvalidKeyException: Illegal key size". This was solved by downloading the "Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files". Simple!
At this point the request was being sent to Amazon! Which then returned a new error message, "Security Header Element is missing the timestamp element". This was because the request didn't have a timestamp. So, I changed the action to TIMESTAMP+SIGNATURE (as seen in the below code sample), at which point I got a new error message: "Timestamp must be signed". This I fixed by setting a custom SIGNATURE_PARTS property also as below.
Finally, once this was all done, and everything was signed, Amazon gave me back the message "AWS was not able to authenticate the request: access credentials are missing". This is exactly the same error that you get when nothing is signed at all, which needless to say is somewhat ambiguous.
At this point I decided that I'd really like to see what was being sent over the wire. The WSDL specifies the port address with an HTTPS URL. However, I had saved the WSDL locally, and changing the URL to HTTP made the result inspectable with the inestimable Wireshark. Despite the request being sent in HTTP, not HTTPS, it was still executed. According to the docs, this should not be!
Anyway, once I was looking at the bytes, I saw that the certificate was only being referred to, not included as specified in the AWS SOAP documents, in this case for SDB. This was fixed by setting the SIG_KEY_ID (key identifier type) property to "DirectReference", which includes the certificate in the request.
...and then it worked. Oh Frabjous Day, Callooh, Callay! The final testcase code that I used is more or less as follows:
package net.ex337.postgrec2.test;
import com.amazonaws.ec2.doc._2009_10_31.AmazonEC2;
import com.amazonaws.ec2.doc._2009_10_31.AmazonEC2PortType;
import com.amazonaws.ec2.doc._2009_10_31.DescribeInstancesType;
import junit.framework.TestCase;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import javax.security.auth.callback.Callback;
import javax.security.auth.callback.CallbackHandler;
import javax.security.auth.callback.UnsupportedCallbackException;
import org.apache.cxf.endpoint.Client;
import org.apache.cxf.frontend.ClientProxy;
import org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor;
import org.apache.ws.security.WSPasswordCallback;
import org.apache.ws.security.handler.WSHandlerConstants;
/**
*
* @author Ian
*
*/
public class Testcase_CXF_EC2 extends TestCase {
public void test_01_DescribeInstances() throws Exception {
AmazonEC2PortType port = new AmazonEC2().getAmazonEC2Port();
Client client = ClientProxy.getClient(port);
org.apache.cxf.endpoint.Endpoint cxfEndpoint = client.getEndpoint();
Map outProps = new HashMap();
//the order is important, apparently. Both must be present.
outProps.put(WSHandlerConstants.ACTION, WSHandlerConstants.TIMESTAMP+" "+WSHandlerConstants.SIGNATURE);
//this is the configuration that signs both the body and the timestamp
outProps.put(WSHandlerConstants.SIGNATURE_PARTS,
"{Element}{http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd}Timestamp;"+
"{}{http://schemas.xmlsoap.org/soap/envelope/}Body");
//alias, password & properties file for actual signature.
outProps.put(WSHandlerConstants.USER, "amaws");
outProps.put(WSHandlerConstants.PW_CALLBACK_CLASS, PasswordCallBackHandler.class.getName());
outProps.put(WSHandlerConstants.SIG_PROP_FILE, "client_sign.properties");
//necessary to include the certificate in the request
outProps.put(WSHandlerConstants.SIG_KEY_ID, "DirectReference");
cxfEndpoint.getOutInterceptors().add(new WSS4JOutInterceptor(new HashMap(outProps)));
//sample request.
DescribeInstancesType r = new DescribeInstancesType();
System.out.println(port.describeInstances(r));
}
//simple callback handler with the password.
public static class PasswordCallBackHandler implements CallbackHandler {
private Map passwords = new HashMap();
public PasswordCallBackHandler() {
passwords.put("amaws", "password");
}
@Override
public void handle(Callback[] callbacks) throws IOException, UnsupportedCallbackException {
for (int i = 0; i < pc =" (WSPasswordCallback)callbacks[i];" pass =" passwords.get(pc.getIdentifer());"
provider="org.apache.ws.security.components.crypto.Merlin" type="pkcs12" password="password" alias="amaws" file="aws.pkcs12" href="http://s3.amazonaws.com/ec2-downloads/ec2.wsdl">http://s3.amazonaws.com/ec2-downloads/ec2.wsdl.
[I think I mangled somethjing here, will fix it soon]
At this, the method signatures of the generated port abruptly changed to something other, because I forgot to change the wsdlLocation in the JXB binding file. Once I fixed this, it worked again.
Some thoughts:
1) Were I publishing a library for general use in accessing AWS, I would probably not use the direct "symlink" above that always points to the latest version of the WSDL. Instead, I would link deliberately to each version, and in that way always generate ports for each version of the WSDL, this ensuring backwards compatibility.
2) Secondly, I find it inelegant to have to specify the WSDL location in two places (the POM and the binding file), and so I'd like to try and pass the binding file through a filter, using a ${variable} in both places referring to a property in the POM.
3) I find it likewise confusing that the password for the keystore is used in two places, firstly in client_sign.properties and secondly in the CallbackHandler that is invoked from within the bowels of the WSS4JOutInterceptor. In the code above, this is obviously duplicated data, however in the final 'production' version of this code I expect to have the data centralised & the code prettified around it.
My setup: I was using Maven2 to construct a JAR file. Running "mvn generate-sources", then, downloads the WSDL and uses it to generate the EC2 object model in src/main/java.
Blogger doesn't like me quoting XML, so I've put my sample POM at pastebin, here. Inside the cxf-codegen-plugin plugin XML you'll see two specific options, "autoNameResolution", which is needed to prevent naming conflicts with the WSDL, and a link to the JXB binding file for JAXWS, which is needed to generate the correct method signatures
Once this is done, then the security credentials need to be configured. There are some pecularities:
As laid out in this tutorial for the Amazon product advertising API, the X.509 certificate and the private key need to be converted into a pkcs12 -format file before they're usable in Java. This is done using OpenSSL:
openssl pkcs12 -export -name amaws -out aws.pkcs12 -in cert-BLABLABLA.pem -inkey pk-BLABLABLA.pemAt this point, I should admit that I spent hours scratching my head because the generated client (see below) gave me the error "java.io.IOException: DER length more than 4 bytes" when trying to read the PKCS12 file. So I switched to the Java Keystore format by using this command (JDK6 format):
keytool -v -importkeystore -srckeystore aws.pkcs12 -srcstoretype pkcs12 -srcalias amaws -srcstorepass password -deststoretype jks -deststorepass password -destkeystore keystore.jks...and then received the error "java.io.IOException: Invalid keystore format" instead. At this point I googled a bit, and discovered two ways to verify the integrity of keystores, via openSSL and the Java keytool:
#for pkcs12Both the keystore and pkcs12 file were valid. Then, I realised that I'd put the files in src/test/resources which was being put through a filter before landing in "target". The filter was doing something to the files, so of course they couldn't be read properly. Duh me. I put the key material in a dedicated folder with no filtering and this problem was fixed.
openssl pkcs12 -in aws.pkcs12 -info
#for keystore
keytool -v -list -storetype jks -keystore keystore.jks
My next problem was the exception "java.io.IOException: exception decrypting data - java.security.InvalidKeyException: Illegal key size". This was solved by downloading the "Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files". Simple!
At this point the request was being sent to Amazon! Which then returned a new error message, "Security Header Element is missing the timestamp element". This was because the request didn't have a timestamp. So, I changed the action to TIMESTAMP+SIGNATURE (as seen in the below code sample), at which point I got a new error message: "Timestamp must be signed". This I fixed by setting a custom SIGNATURE_PARTS property also as below.
Finally, once this was all done, and everything was signed, Amazon gave me back the message "AWS was not able to authenticate the request: access credentials are missing". This is exactly the same error that you get when nothing is signed at all, which needless to say is somewhat ambiguous.
At this point I decided that I'd really like to see what was being sent over the wire. The WSDL specifies the port address with an HTTPS URL. However, I had saved the WSDL locally, and changing the URL to HTTP made the result inspectable with the inestimable Wireshark. Despite the request being sent in HTTP, not HTTPS, it was still executed. According to the docs, this should not be!
Anyway, once I was looking at the bytes, I saw that the certificate was only being referred to, not included as specified in the AWS SOAP documents, in this case for SDB. This was fixed by setting the SIG_KEY_ID (key identifier type) property to "DirectReference", which includes the certificate in the request.
...and then it worked. Oh Frabjous Day, Callooh, Callay! The final testcase code that I used is more or less as follows:
package net.ex337.postgrec2.test;
import com.amazonaws.ec2.doc._2009_10_31.AmazonEC2;
import com.amazonaws.ec2.doc._2009_10_31.AmazonEC2PortType;
import com.amazonaws.ec2.doc._2009_10_31.DescribeInstancesType;
import junit.framework.TestCase;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import javax.security.auth.callback.Callback;
import javax.security.auth.callback.CallbackHandler;
import javax.security.auth.callback.UnsupportedCallbackException;
import org.apache.cxf.endpoint.Client;
import org.apache.cxf.frontend.ClientProxy;
import org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor;
import org.apache.ws.security.WSPasswordCallback;
import org.apache.ws.security.handler.WSHandlerConstants;
/**
*
* @author Ian
*
*/
public class Testcase_CXF_EC2 extends TestCase {
public void test_01_DescribeInstances() throws Exception {
AmazonEC2PortType port = new AmazonEC2().getAmazonEC2Port();
Client client = ClientProxy.getClient(port);
org.apache.cxf.endpoint.Endpoint cxfEndpoint = client.getEndpoint();
Map
//the order is important, apparently. Both must be present.
outProps.put(WSHandlerConstants.ACTION, WSHandlerConstants.TIMESTAMP+" "+WSHandlerConstants.SIGNATURE);
//this is the configuration that signs both the body and the timestamp
outProps.put(WSHandlerConstants.SIGNATURE_PARTS,
"{Element}{http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd}Timestamp;"+
"{}{http://schemas.xmlsoap.org/soap/envelope/}Body");
//alias, password & properties file for actual signature.
outProps.put(WSHandlerConstants.USER, "amaws");
outProps.put(WSHandlerConstants.PW_CALLBACK_CLASS, PasswordCallBackHandler.class.getName());
outProps.put(WSHandlerConstants.SIG_PROP_FILE, "client_sign.properties");
//necessary to include the certificate in the request
outProps.put(WSHandlerConstants.SIG_KEY_ID, "DirectReference");
cxfEndpoint.getOutInterceptors().add(new WSS4JOutInterceptor(new HashMap
//sample request.
DescribeInstancesType r = new DescribeInstancesType();
System.out.println(port.describeInstances(r));
}
//simple callback handler with the password.
public static class PasswordCallBackHandler implements CallbackHandler {
private Map
public PasswordCallBackHandler() {
passwords.put("amaws", "password");
}
@Override
public void handle(Callback[] callbacks) throws IOException, UnsupportedCallbackException {
for (int i = 0; i < pc =" (WSPasswordCallback)callbacks[i];" pass =" passwords.get(pc.getIdentifer());"
provider="org.apache.ws.security.components.crypto.Merlin" type="pkcs12" password="password" alias="amaws" file="aws.pkcs12" href="http://s3.amazonaws.com/ec2-downloads/ec2.wsdl">http://s3.amazonaws.com/ec2-downloads/ec2.wsdl.
[I think I mangled somethjing here, will fix it soon]
At this, the method signatures of the generated port abruptly changed to something other, because I forgot to change the wsdlLocation in the JXB binding file. Once I fixed this, it worked again.
Some thoughts:
1) Were I publishing a library for general use in accessing AWS, I would probably not use the direct "symlink" above that always points to the latest version of the WSDL. Instead, I would link deliberately to each version, and in that way always generate ports for each version of the WSDL, this ensuring backwards compatibility.
2) Secondly, I find it inelegant to have to specify the WSDL location in two places (the POM and the binding file), and so I'd like to try and pass the binding file through a filter, using a ${variable} in both places referring to a property in the POM.
3) I find it likewise confusing that the password for the keystore is used in two places, firstly in client_sign.properties and secondly in the CallbackHandler that is invoked from within the bowels of the WSS4JOutInterceptor. In the code above, this is obviously duplicated data, however in the final 'production' version of this code I expect to have the data centralised & the code prettified around it.
2009/12/03
EC2 upgrades again
At lunch today, I read that EC2 can now boot off EBS images, something that simplifies the whole AMI thing and brings it up to speed with Rackspace on the ease-of-deployment front. However, two points:
- EC2 is still the clear loser in price-performance, and charging for I/O to the root partition won't help. More specifically, when will EBS I/O become consistent? Probably this has a lot to do with being popular and dealing with shared resources at the lower end, see this HN thread.
- My next question is, how does this affect the attack surface of EC2? Can the work done in the "Get off my Cloud!" paper be expanded on?
2009/11/14
Using the EC2 API: console output blank, connection refused, socket timeout, etc.
Hello there. As usual, there's a world of difference between the conceptual usage of an API and it's real-world, practical stuff. In my adventures I'm stubbornly, block-headedly not interested in using EC2 via anything except the API, i.e. no command-line tools or management console (except for debugging), and so, I intend to be able to create my images etc. in a test-harness. For reasons to be enumerated anon. Anyway, the following stuff may be useful to people writing code that uses the EC2 API for the first time:
When booting an instance, it is not assumed that once it is "running", that SSH will be serving on port 22, neither can it be assumed that the console output is there. So if you want to SSH into your instance, first poll the instance state, and once it's "running", then poll on console output. Once it's there, it's complete, so you can retrieve the fingerprints and go on from there.This is good to know of course if one wants to sling instances around, but I find it slightly incongruous is that I'm being charged for about a minute of time on a machine I can't access yet. Of course, from Amazon's perspective the instant (har) I'm blocking a slot on a server, then it's chargeable, so it makes sense from their perspective I guess.
2009/10/16
EC2: now having actually played with it a *little*...
So, in place of my previous bloviation on the subject, unfettered by the weight of experience, a couple of somewhat-more-tempered comments follow:
- Using the command-line tools is slow. They're shifting gigs of data around at the touch of a button, but hey, it's a UX thing.
- In terms of actual tools, the options seem to be:
- Go with the command-line tools and a bunch of bash scripts
- Go with a (generally) half-baked third-party API, with its own idiosyncrasies built in, and the traditional lack of documentation OSS projects feel they can get away with.
- (My inevitable option) download the WSDLs & use something to generate your own API in whatever language. Regenerate it whenever the API changes.
- This choice is especially acute since I'm not intending, ultimately, to have to do anything by hand - so programming things properly to start with seems like the only sensible option.
- Consistent IO on EBS is apparently not an option. This is something I think Amazon should fix toute suite, because things like RackSpace (maybe) and NewServers (h.t. etbe) seem to be to stomping all over the EBS I/O figures. In a different context, James Hamilton says "it makes no sense to allow a lower cost component impose constraints on the optimization of a higher cost component", and assuming that the servers are the expensive part, this is what (IMHO) may make using RDBMSs on EC2 a bit of a PIA long-term.
2008/09/24
State of play in EC2-based database hosting
Oracle recently blinked and decided to support their DB and some other stuff on EC2. Reading the actual terms though, assuming I've understood them correctly, they haven't actually done anything other than map EC2 virtual cores onto CPU Sockets, and let the normal rates apply. That's IT. What this means is that running one Oracle server for 100 hours is still 100 times more expensive (for licensing costs) than running 100 Oracle servers for one hour.
That's not cloud licensing. Cloud computing works on the premise that whether you use one, 10, or 100 CPUs, you pay per CPU-hour, no more no less. That's what DevPay does. The only problem is that in this scenario, software is a commodity, which I imagine doesn't sit too well with Oracle.
Virtualisation has been around for decades, but only once FLOSS commoditized the server operating system 'ecosystem' did it become possible to do things on the scale that Amazon are doing. Back in 2005 I had a Linux VM with the inestimable Bytemark, and it was plain for all with eyes to see that virtualisation was going to pull the floor out the bottom of the server hosting market once players had found the right way to leverage the economies of scale. Right now, that's being done behind closed doors by the big players, but Amazon are the first to have thrown open the doors to the unwashed masses, and that's why I like them so much.
(To repeat, I don't own stock - maybe I should :-)
To get back to the point. How do databases fare on Amazon EC2? Given that EBS has only been around for a couple of weeks, and before that, on EC2, DB hosting was risking everything to a block device that could go *poof* at any moment, which wasn't exactly pleasant.
This is something which will remain up in the air until someone with serious [PostGre/My]SQL-fu takes some AMIs, configures them just so, and benchmarks them. We know, right now, that on a small instance, disk throughput tops out at roughly 100 MB/s on a three-volume RAID 0 setup. I'm interested in seeing the speeds for EC2 and EBS on larger instances.
Moving on from pure throughput, how do PostGreSQL & MySQL stack up on these setups? Do their respective caching mechanisms etc. work with or against this strange new environment? Enquiring minds want to know!
That's not cloud licensing. Cloud computing works on the premise that whether you use one, 10, or 100 CPUs, you pay per CPU-hour, no more no less. That's what DevPay does. The only problem is that in this scenario, software is a commodity, which I imagine doesn't sit too well with Oracle.
Virtualisation has been around for decades, but only once FLOSS commoditized the server operating system 'ecosystem' did it become possible to do things on the scale that Amazon are doing. Back in 2005 I had a Linux VM with the inestimable Bytemark, and it was plain for all with eyes to see that virtualisation was going to pull the floor out the bottom of the server hosting market once players had found the right way to leverage the economies of scale. Right now, that's being done behind closed doors by the big players, but Amazon are the first to have thrown open the doors to the unwashed masses, and that's why I like them so much.
(To repeat, I don't own stock - maybe I should :-)
To get back to the point. How do databases fare on Amazon EC2? Given that EBS has only been around for a couple of weeks, and before that, on EC2, DB hosting was risking everything to a block device that could go *poof* at any moment, which wasn't exactly pleasant.
This is something which will remain up in the air until someone with serious [PostGre/My]SQL-fu takes some AMIs, configures them just so, and benchmarks them. We know, right now, that on a small instance, disk throughput tops out at roughly 100 MB/s on a three-volume RAID 0 setup. I'm interested in seeing the speeds for EC2 and EBS on larger instances.
Moving on from pure throughput, how do PostGreSQL & MySQL stack up on these setups? Do their respective caching mechanisms etc. work with or against this strange new environment? Enquiring minds want to know!
2008/08/30
Thoughts on Amazon EC2/EBS and the "cloud computing" bandwagon
Amazon EBS is something that I've been waiting for ever since EC2 was announced. Dare Obasanjo rightly pegged it as the final piece of the puzzle.
Right now, "cloud computing" is the buzzword of the moment. This doesn't help create clarity when discussing exactly what it is. So, my definition is:
But nobody has what Amazon has. Want 500 servers for an hour? There's no place else to go but Amazon. So, despite all the hype about cloud computing, right now there's only one real market player that has a developed, mature product, and that is Amazon. Nobody else even comes close, and that includes Google.
(And no, I don't own shares.)
I was reading an article about how MS are basically building their new datacenters around shipping containers full of server kit that were constructed directly by manufacturers in China or thereabouts. I can almost guess they've built a standard umbilical cable & docking mechanism for the containers for power, bandwidth & airco. Roboticise the docking/undocking, then all you'd have to do is have a small control center and a foreman to operate the gantry.
To get all hand-wavey, sci-fi and "thereof I cannot speak with clue" for a minute, assuming the containers are airtight, and given that they'll never be open to humans until it's time to scrap/recycle them, why not look at using CO2 as a coolant instead of standard air conditioning? Automatic fire suppression comes free. And given that CO2 is still a gas at -70°C, why not overclock your CPUs to increase your ROI (of course, the energy you spend on cooling and the reduced lifespan of your CPUs due to overclocking is an opposite factor).
If you scrub the CO2 from the atmosphere, and dispose of it safely, you may even make your datacenter carbon neutral and reap tax credits as an additional benefit... (coming soon to a cognizant country near you.)
Fna fna fna.
Right now, "cloud computing" is the buzzword of the moment. This doesn't help create clarity when discussing exactly what it is. So, my definition is:
- an API to dynamically start, stop & manage instances
- per-CPU-hour and per-GB billing
But nobody has what Amazon has. Want 500 servers for an hour? There's no place else to go but Amazon. So, despite all the hype about cloud computing, right now there's only one real market player that has a developed, mature product, and that is Amazon. Nobody else even comes close, and that includes Google.
(And no, I don't own shares.)
I was reading an article about how MS are basically building their new datacenters around shipping containers full of server kit that were constructed directly by manufacturers in China or thereabouts. I can almost guess they've built a standard umbilical cable & docking mechanism for the containers for power, bandwidth & airco. Roboticise the docking/undocking, then all you'd have to do is have a small control center and a foreman to operate the gantry.
To get all hand-wavey, sci-fi and "thereof I cannot speak with clue" for a minute, assuming the containers are airtight, and given that they'll never be open to humans until it's time to scrap/recycle them, why not look at using CO2 as a coolant instead of standard air conditioning? Automatic fire suppression comes free. And given that CO2 is still a gas at -70°C, why not overclock your CPUs to increase your ROI (of course, the energy you spend on cooling and the reduced lifespan of your CPUs due to overclocking is an opposite factor).
If you scrub the CO2 from the atmosphere, and dispose of it safely, you may even make your datacenter carbon neutral and reap tax credits as an additional benefit... (coming soon to a cognizant country near you.)
Fna fna fna.
Subscribe to:
Posts (Atom)