Proxying requests to google cloud storage using Hadoop GCS connector through Envoy proxy.
Introduction:
Hadoop GCS connector is widely used in the community to connect from on premise Hadoop cluster and related services to google cloud storage https://github.com/GoogleCloudDataproc/hadoop-connectors
Recently I have been exploring how to use Envoy proxy https://www.envoyproxy.io/docs/envoy/latest/ as a Front proxy such that all requests flow through Envoy and we can do monitoring of requests using an external http service as well before forwarding the requests to google storage servers. So in this post I am going to talk about how to do this envoy proxy setup for this use case end to end. I will leave the monitoring using external http service flow for my next post :).
Prerequisites:
- Before we start, I am assuming that we have a Hadoop client/cluster from where we are able to access our buckets present in google cloud storage. Essentially we are able to run commands like below successfully:
hadoop fs -ls gs://your_bucket_name/
- Or if we are not using Hadoop connector, we should be able to access GCS buckets using curl requests for example :
https://storage.googleapis.com:443/storage/v1/b/gcp-public-data-landsat
- Also we know to run the envoy server locally using a sample config file. For more info follow envoy docs.
Steps :
Run the envoy proxy with the below config file using command :
envoy -c config.yaml -l debug
We can change the logging level as well using -l debug/trace for logging the requests and responses on the envoy server.
Some explanation points for the above configuration:
- Envoy config consists of list of filters and clusters. Filters are from accepting the requests from downstream client/server (here Hadoop client) and clusters are for forwarding the requests to upstream server( here Google storage servers).
- This will start a http proxy sever at address localhost:1000.
- All the requests directed towards localhost:10000 will be rewritten based on the host_rewrite_literal config and then forwarded to the specific cluster based on routes match rules.
- Once the requests are rewritten, it will be sent to google storage servers.
To configure the proxy on the connector side we have to set the parameter fs.gs.storage.root.url and then run the command below. This will enable all the storage requests to flow through the proxy but the auth requests will still go directly to the google OAuth servers.
hadoop fs -Dfs.gs.storage.root.url=http://localhost:10000 -ls gs://your_bucket/
Here is the envoy log snippet:
As we can see from the logs, the flow is
- The storage calls first came to envoy
`‘:authority’, ‘localhost:10000’
’path’, ‘/storage/v1/b/my_bucket’
- And then got rewritten and forwarded to google storage servers.
cluster 'service_google' match for URL '/storage/v1/b/my_bucket'
':authority', 'storage.googleapis.com'
':path', '/storage/v1/b/my_bucket'
The same flow can be achieved via making the curl requests using postman to localhost:10000/storage/v1/b/gcp-public-data-landsat