Open Source Security

This is the second post in a series of articles on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. In this post we will show how to add authorization to the previous example using Apache Ranger.

1) Install the Apache Ranger Hive plugin

If you have not done so already, please follow the first post to install and configure Apache Hadoop and Apache Hive. Next download Apache Ranger and verify that the signature is valid and that the message digests match. Due to some bugs that were fixed for the installation process, I am using version 1.0.0-SNAPSHOT in this post. Now extract and build the source, and copy the resulting plugin to a location where you will configure and install it:

mvn clean package assembly:assembly -DskipTests
tar zxvf target/ranger-1.0.0-SNAPSHOT-hive-plugin.tar.gz
mv ranger-1.0.0-SNAPSHOT-hive-plugin ${ranger.hive.home}

Now go to ${ranger.hive.home} and edit "install.properties". You need to specify the following properties:

POLICY_MGR_URL: Set this to "http://localhost:6080"
REPOSITORY_NAME: Set this to "cl1_hive".
COMPONENT_INSTALL_DIR_NAME: The location of your Apache Hive installation

Save "install.properties" and install the plugin as root via "sudo -E ./enable-hive-plugin.sh". The Apache Ranger Hive plugin should now be successfully installed. Make sure that the default policy cache for the Hive plugin '/etc/ranger/cl1_hive/policycache' is readable by the user who is running the Hive server. Then restart the Apache Hive server to enable the authorization plugin.

2) Create authorization policies in the Apache Ranger Admin console

Next we will use the Apache Ranger admin console to create authorization policies for Apache Hive. Follow the steps in this tutorial to install the Apache Ranger admin service. Start the Ranger admin service via 'sudo ranger-admin start' and open a browser at 'http://localhost:6080', logging on with the credentials 'admin/admin'. Click the "+" button next to the "HIVE" logo and enter the following properties:

Service Name: cl1_hive
Username/Password: admin
jdbc.url: jdbc:hive2://localhost:10000

Note that "Test Connection" won't work as the "admin" user will not have the necessary authorization to invoke on Hive at this point. Click "Add" to create the service. If you have not done so in a previous tutorial, click on "Settings" and then "Users/Groups" and add two new users called "alice" and "bob", who we will use to test authorization. Then go back to the newly created "cl1_hive" service, and click "Add new policy" with the following properties:

Policy Name: SelectWords
database: default
table: words
Hive column: *

Then under "Allow Conditions", give "alice" the "select" permission and click "Add".

3) Test authorization with Apache Hive

Once our new policy has synced to '/etc/ranger/cl1_hive/policycache' we can test authorization in Hive. The user 'alice' can query the table according to our policy:

bin/beeline -u jdbc:hive2://localhost:10000 -n alice
select * from words where word == 'Dare'; (works)

However, the user 'bob' is denied access:

bin/beeline -u jdbc:hive2://localhost:10000 -n alice
select * from words where word == 'Dare'; (fails)

This is the third in a series of blog posts on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. In this post we will extend the authorization scenario by showing how Apache Ranger can be used to create policies to both mask and filter data returned in the Hive query.

1) Data-masking with Apache Ranger

As a pre-requisite to this tutorial, please follow the previous post to set up Apache Hive and to enforce an authorization policy for the user "alice" using Apache Ranger. Now let's imagine that we would like "alice" to be able to see the "counts", but not the actual words themselves. We can create a data-masking policy in Apache Ranger for this. Open a browser and log in at "http://localhost:6080" using "admin/admin" and click on the "cl1_hive" service that we have created in the previous tutorial.

Click on the "Masking" tab and add a new policy called "WordMaskingPolicy", for the "default" database, "words" table and "word" column. Under the mask conditions, add the user "alice" and choose the "Redact" masking option. Save the policy and wait for it to by synced over to Apache Hive:

Now try to login to beeline as "alice" and view the first five entries in the table:

bin/beeline -u jdbc:hive2://localhost:10000 -n alice
select * from words LIMIT 5;

You should see that the characters in the "word" column have been masked (replaced by "x"s).

2) Row-level filtering with Apache Ranger

Now let's imagine that we are happy for "alice" to view the "words" in the table, but that we would like to restrict her to words that start with a "D". The previous "access" policy we created for her allows her to view all "words" in the table. We can do this by specifying a row-level filter policy. Click on the "Masking" tab in the UI and disable the policy we created in the previous section.

Now click on the "Row-level Filter" tab and create a new policycalled "AliceFilterPolicy" on the "default" database, "words" table. Add a Row Filter condition for the user "alice" with row filter "word LIKE 'D%'". Save the policy and wait for it to by synced over to Apache Hive:

Now try to login to beeline as "alice" as above. "alice" can successfully retrieve all entries where the words start with "D", but no other entries via:

select * from words where word like 'D%';

JSON Web Tokens (JWTs) are a standard way of encapsulating a number of claims about a particular subject. Kerberos is a long-established and widely-deployed SSO protocol, used extensively in the Big-Data space in recent years. An interesting question is to examine how a JWT could be used as part of the Kerberos protocol. In this post we will consider one possible use-case, where a JWT is used to convey additional authorization information to the kerberized service provider.

This use-case is based on a document available at HADOOP-10959, called "A Complement and Short Term Solution to TokenAuth Based on
Kerberos Pre-Authentication Framework", written by Kai Zheng and Weihua Jiang of Intel (also see here).

1) The test-case

To show how to integrate JWTs with Kerberos we will use a concrete test-case available in my github repo here:

cxf-kerberos-kerby: This project contains a number of tests that show how to use Kerberos with Apache CXF, where the KDC used in the tests is based on Apache Kerby

The test-case relevant to this blog entry is the JWTJAXRSAuthenticationTest. Here we have a trivial "double it" JAX-RS service implemented using Apache CXF, which is secured using Kerberos. An Apache Kerby-based KDC is launched which the client code uses to obtain a service ticket using JAAS (all done transparently by CXF), which is sent to the service code as part of the Authorization header when making the invocation.

So far this is just a fairly typical example of a kerberized web-service request. What is different is that the service configuration requires a level of authorization above and beyond the kerberos ticket, by insisting that the user must have a particular role to access the web service. This is done by inserting the CXF SimpleAuthorizingInterceptor into the service interceptor chain. An authenticated user must have the "boss" role to access this service.

So we need somehow to convey the role of the user as part of the kerberized request. We can do this using a JWT as will be explained in the next few sections.

2) High-level overview of JWT use-case with Kerberos

As stated above, we need to convey some additional claims about the user to the service. This can be done by including a JWT containing those claims in the Kerberos service ticket. Let's assume that the user is in possession of a JWT that is issued by an IdP that contains a number of claims relating to that user (including the "role" as required by the service in our test-case). The token must be sent to the KDC when obtaining a service ticket.

The KDC must validate the token (checking the signature is correct, and that the signing identity is trusted, etc.). The KDC must then extract some relevant information from the token and insert it somehow into the service ticket. The kerberos spec defines a structure that can be used for this purposes called the AuthorizationData, which consists of a "type" along with some data to be interpreted according to the "type". We can use this structure to insert the encoded JWT as part of the data.

On the receiving side, the service can extract the AuthorizationData structure from the received ticket and parse it accordingly to retrieve the JWT, and obtain whatever claims are desired from this token accordingly.

3) Sending a JWT Token to the KDC

Let's take a look at how the test-case works in more detail, starting with the client. The test code retrieves a JWT for "alice" by invoking on the JAX-RS interface of the Apache CXF STS. The token contains the claim that "alice" has the "boss" role, which is required to invoke on the "double it" service. Now we need to send this token to the KDC to retrieve a service ticket for the "double it" service, with the JWT encoded in the ticket.

This cannot be done by the built-in Java GSS implementation. Instead we will use Apache Kerby. Apache Kerby has been covered extensively on this blog (see for example here). As well as providing the implementation for the KDC used in our test-case, Apache Kerby provides a complete GSS implementation that supports tokens in the forthcoming 1.1.0 release. To use the Kerby GSS implementation we need to register the KerbyGssProvider as a Java security provider.

To actually pass the JWT we got from the STS to the Kerby GSS layer, we need to use a custom version of the CXF HttpAuthSupplier interface. The KerbyHttpAuthSupplier implementation takes the JWT String, and creates a Kerby KrbToken class using it. This class is added to the private credential list of the current JAAS Subject. This way it will be available to the Kerby GSS layer, which will send the token to the KDC using Kerberos pre-authentication as defined in the document which is linked at the start of this post.

4) Processing the received token in the KDC

The Apache Kerby-based KDC extracts the JWT token from the pre-authentication data entry and verifies that it is signed and that the issuer is trusted. The KDC is configured in the test-case with a certificate to use for this purpose, and also with an issuer String against which the issuer of the JWT must match. If there is an audience claim in the token, then it must match the principal of the service for which we are requesting a ticket.

If the verification of the received JWT passes, then it is inserted into the AuthorizationData structure in the issued service ticket. The type that is used is a custom value defined here, as this behaviour is not yet standardized. The JWT is serialized and added to the data part of the token. Note that this behaviour is fully customizable.

5) Processing the AuthorizationData structure on the service end

After the service successfully authenticates the client, we have to access the AuthorizationData part of the ticket to extract the JWT. This can all be done using the Java APIs, Kerby is not required on the receiving side. The standard CXF interceptor for Kerberos is subclassed in the tests, to set up a custom CXF SecurityContext using the GssContext. By casting it to a ExtendedGSSContext, we can access the AuthorizationData and hence the JWT. The role claim is then extracted from the JWT and used to enforce the standard "isUserInRole" method of the CXF SecurityContext.

If you are interested in exploring this topic further, please get involved with the Apache Kerby project, and help us to further improve and expand this integration between JWT and Kerberos.

This is the fourth in a series of blog posts on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. The third post looked at how to use Apache Ranger to create policies to both mask and filter data returned in the Hive query.

In this post we will show how Apache Ranger can create "tag" based authorization policies for Apache Hive using Apache Atlas. In the second post, we showed how to create a "resource" based policy for "alice" in Ranger, by granting "alice" the "select" permission for the "words" table. Instead, we can grant a user "bob" the "select" permission for a given "tag", which is synced into Ranger from Apache Atlas. This means that we can avoid managing specific resources in Ranger itself.

1) Start Apache Atlas and create entities/tags for Hive

First let's look at setting up Apache Atlas. Download the latest released version (0.8.1) and extract it. Build the distribution that contains an embedded HBase and Solr instance via:

mvn clean package -Pdist,embedded-hbase-solr -DskipTests

The distribution will then be available in 'distro/target/apache-atlas-0.8.1-bin'. To launch Atlas, we need to set some variables to tell it to use the local HBase and Solr instances:

export MANAGE_LOCAL_HBASE=true
export MANAGE_LOCAL_SOLR=true

Now let's start Apache Atlas with 'bin/atlas_start.py'. Open a browser and go to 'http://localhost:21000/', logging on with credentials 'admin/admin'. Click on "TAGS" and create a new tag called "words_tag". Unlike for HDFS or Kafka, Atlas doesn't provide an easy way to create a Hive Entity in the UI. Instead we can use the following json file to create a Hive Entity for the "words" table that we are using in our example, that is based off the example given here:
You can upload it to Atlas via:

curl -v -H 'Accept: application/json, text/plain, */*' -H 'Content-Type: application/json; charset=UTF-8' -u admin:admin -d @hive-create.json http://localhost:21000/api/atlas/entities

Once the new entity has been uploaded, then you can search for it in the Atlas UI. Once it is found, then click on "+" beside "Tags" and associate the new entity with the "words_tag" tag.

2) Use the Apache Ranger TagSync service to import tags from Atlas into Ranger

To create tag based policies in Apache Ranger, we have to import the entity + tag we have created in Apache Atlas into Ranger via the Ranger TagSync service. After building Apache Ranger then extract the file called "target/ranger-<version>-tagsync.tar.gz". Edit 'install.properties' as follows:

Set TAG_SOURCE_ATLAS_ENABLED to "false"
Set TAG_SOURCE_ATLASREST_ENABLED to "true"
Set TAG_SOURCE_ATLASREST_DOWNLOAD_INTERVAL_IN_MILLIS to "60000" (just for testing purposes)
Specify "admin" for both TAG_SOURCE_ATLASREST_USERNAME and TAG_SOURCE_ATLASREST_PASSWORD

Save 'install.properties' and install the tagsync service via "sudo ./setup.sh". Start the Apache Ranger admin service via "sudo ranger-admin start" and then the tagsync service via "sudo ranger-tagsync-services.sh start".

3) Create Tag-based authorization policies in Apache Ranger

Now let's create a tag-based authorization policy in the Apache Ranger admin UI (http://localhost:6080). Click on "Access Manager" and then "Tag based policies". Create a new Tag service called "HiveTagService". Create a new policy for this service called "WordsTagPolicy". In the "TAG" field enter a "w" and the "words_tag" tag should pop up, meaning that it was successfully synced in from Apache Atlas. Create an "Allow" condition for the user "bob" with the "select" permissions for "Hive":

We also need to go back to the Resource based policies and edit "cl1_hive" that we created in the second tutorial, and select the tag service we have created above. Once our new policy (including tags) has synced to '/etc/ranger/cl1_hive/policycache' we can test authorization in Hive. Previously, the user "bob" was denied access to the "words" table, as only "alice" was assigned a resource-based policy for the table. However, "bob" can now access the table via the tag-based authorization policy we have created:

bin/beeline -u jdbc:hive2://localhost:10000 -n bob
select * from words where word == 'Dare';

This is the fifth in a series of blog posts on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. The third post looked at how to use Apache Ranger to create policies to both mask and filter data returned in the Hive query. The fourth post looked how Apache Ranger can create "tag" based authorization policies for Apache Hive using Apache Atlas. In this post we will look at an alternative authorization solution called Apache Sentry.

1) Build the Apache Sentry distribution

First we will build and install the Apache Sentry distribution. Download Apache Sentry (1.8.0 was used for the purposes of this tutorial). Verify that the signature is valid and that the message digests match. Now extract and build the source and copy the distribution to a location where you wish to install it:

tar zxvf apache-sentry-1.8.0-src.tar.gz
cd apache-sentry-1.8.0-src
mvn clean install -DskipTests
cp -r sentry-dist/target/apache-sentry-1.8.0-bin ${sentry.home}

I previously covered the authorization plugin that Apache Sentry provides for Apache Kafka. In addition, Apache Sentry provides an authorization plugin for Apache Hive. For the purposes of this tutorial we will just configure the authorization privileges in a configuration file locally to the Hive Server. Therefore we don't need to do any further configuration to the distribution at this point.

2) Install and configure Apache Hive

Please follow the first tutorial to install and configure Apache Hadoop if you have not already done so. Apache Sentry 1.8.0 does not support Apache Hive 2.1.x, so we will need to download and extract Apache Hive 2.0.1. Set the "HADOOP_HOME" environment variable to point to the Apache Hadoop installation directory above. Then follow the steps as outlined in the first tutorial to create the table in Hive and make sure that a query is successful.

3) Integrate Apache Sentry with Apache Hive

Now we will integrate Apache Sentry with Apache Hive. We need to add three new configuration files to the "conf" directory of Apache Hive.

3.a) Configure Apache Hive to use authorization

Create a file called 'conf/hiveserver2-site.xml' with the content:
Here we are enabling authorization and adding the Sentry authorization plugin.

3.b) Add Sentry plugin configuration

Create a new file in the "conf" directory of Apache Hive called "sentry-site.xml" with the following content:
This is the configuration file for the Sentry plugin for Hive. It essentially says that the authorization privileges are stored in a local file, and that the groups for authenticated users should be retrieved from this file. As we are not using Kerberos, the "testing.mode" configuration parameter must be set to "true".

3.c) Add the authorization privileges for our test-case

Next, we need to specify the authorization privileges. Create a new file in the config directory called "sentry.ini" with the following content:
Here we are granting the user "alice" a role which allows her to perform a "select" on the table "words".

3.d) Add Sentry libraries to Hive

Finally, we need to add the Sentry libraries to Hive. Copy the following files from ${sentry.home}/lib to ${hive.home}/lib:

sentry-binding-hive-common-1.8.0.jar
sentry-core-model-db-1.8.0.jar
sentry*provider*.jar
sentry-core-common-1.8.0.jar
shiro-core-1.2.3.jar
sentry-policy*.jar
sentry-service-*.jar

In addition we need the "sentry-binding-hive-v2-1.8.0.jar" which is not bundled with the Apache Sentry distribution. This can be obtained from "http://repo1.maven.org/maven2/org/apache/sentry/sentry-binding-hive-v2/1.8.0/sentry-binding-hive-v2-1.8.0.jar" instead.

4) Test authorization with Apache Hive

Now we can test authorization after restarting Apache Hive. The user 'alice' can query the table according to our policy:

bin/beeline -u jdbc:hive2://localhost:10000 -n alice
select * from words where word == 'Dare'; (works)

However, the user 'bob' is denied access:

bin/beeline -u jdbc:hive2://localhost:10000 -n bob
select * from words where word == 'Dare'; (fails)

This the sixth and final blog post in a series of articles on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. The third post looked at how to use Apache Ranger to create policies to both mask and filter data returned in the Hive query. The fourth post looked how Apache Ranger can create "tag" based authorization policies for Apache Hive using Apache Atlas. The fifth post looked at an alternative authorization solution called Apache Sentry.

In this post we will switch our attention from authorization to authentication, and show how we can authenticate Apache Hive users via kerberos.

1) Set up a KDC using Apache Kerby

A github project that uses Apache Kerby to start up a KDC is available here:

bigdata-kerberos-deployment: This project contains some tests which can be used to test kerberos with various big data deployments, such as Apache Hadoop etc.

The KDC is a simple junit test that is available here. To run it just comment out the "org.junit.Ignore" annotation on the test method. It uses Apache Kerby to define the following principals for both Apache Hadoop and Apache Hive:

hdfs/localhost@hadoop.apache.org
HTTP/localhost@hadoop.apache.org
mapred/localhost@hadoop.apache.org
hiveserver2/localhost@hadoop.apache.org
alice@hadoop.apache.org

Keytabs are created in the "target" folder. Kerby is configured to use a random port to lauch the KDC each time, and it will create a "krb5.conf" file containing the random port number in the target directory.

2) Configure Apache Hadoop to use Kerberos

The next step is to configure Apache Hadoop to use Kerberos. As a pre-requisite, follow the first tutorial on Apache Hive so that the Hadoop data and Hive table are set up before we apply Kerberos to the mix. Next, follow the steps in section (2) of an earlier tutorial on configuring Hadoop with Kerberos that I wrote. Some additional steps are also required when configuring Hadoop for use with Hive.

Edit 'etc/hadoop/core-site.xml' and add:

hadoop.proxyuser.hiveserver2.groups: *
hadoop.proxyuser.hiveserver2.hosts: localhost

The previous tutorial on securing HDFS with kerberos did not specify any kerberos configuration for Map-Reduce, as it was not required. For Apache Hive we need to configure Map Reduce appropriately. We will simplify things by using a single principal for the Job Tracker, Task Tracker and Job History. Create a new file 'etc/hadoop/mapred-site.xml' with the following properties:

mapreduce.framework.name: classic
mapreduce.jobtracker.kerberos.principal: mapred/localhost@hadoop.apache.org
mapreduce.jobtracker.keytab.file: Path to Kerby mapred.keytab (see above).
mapreduce.tasktracker.keytab.file: mapred/localhost@hadoop.apache.org
mapreduce.tasktracker.keytab.file: Path to Kerby mapred.keytab (see above).
mapreduce.jobhistory.kerberos.principal: mapred/localhost@hadoop.apache.org
mapreduce.jobhistory.keytab.file: Path to Kerby mapred.keytab (see above).

Start Kerby by running the JUnit test as described in the first section. Now start HDFS via:

sbin/start-dfs.sh
sudo sbin/start-secure-dns.sh

3) Configure Apache Hive to use Kerberos

Next we will configure Apache Hive to use Kerberos. Edit 'conf/hiveserver2-site.xml' and add the following properties:

hive.server2.authentication: kerberos
hive.server2.authentication.kerberos.principal: hiveserver2/localhost@hadoop.apache.org
hive.server2.authentication.kerberos.keytab: Path to Kerby hiveserver2.keytab (see above).

Start Hive via 'bin/hiveserver2'. In a separate window, log on to beeline via the following steps:

export KRB5_CONFIG=/pathtokerby/target/krb5.conf
kinit -k -t /pathtokerby/target/alice.keytab alice
bin/beeline -u "jdbc:hive2://localhost:10000/default;principal=hiveserver2/localhost@hadoop.apache.org"

At this point authentication is successful and we should be able to query the "words" table as per the first tutorial.

Earlier this year, I showed how to use Talend Open Studio for Big Data to access data stored in HDFS, where HDFS had been configured to authenticate users using Kerberos. A similar blog post showed how to read data from an Apache Kafka topic using kerberos. In this tutorial I will show how to create a job in Talend Open Studio for Big Data to read data from an Apache Hive table using kerberos. As a prerequisite, please follow a recent tutorial on setting up Apache Hadoop and Apache Hive using kerberos.

1) Download Talend Open Studio for Big Data and create a job

Download Talend Open Studio for Big Data (6.4.1 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "HiveKerberosRead". In the search bar under "Palette" on the right hand side enter "hive" and hit enter. Drag "tHiveConnection" and "tHiveInput" to the middle of the screen. Do the same for "tLogRow":

"tHiveConnection" will be used to configure the connection to Hive. "tHiveInput" will be used to perform a query on the "words" table we have created in Hive (as per the earlier tutorial linked above), and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tHiveConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tHiveInput". Right click on "tHiveInput" and select "Row/Main" and drag the resulting line to "tLogRow":

3) Configure the components

Now let's configure the individual components. Double click on "tHiveConnection". Select the following configuration options:

Distribution: Hortonworks
Version: HDP V2.5.0
Host: localhost
Database: default
Select "Use Kerberos Authentication"
Hive Principal: hiveserver2/localhost@hadoop.apache.org
Namenode Principal: hdfs/localhost@hadoop.apache.org
Resource Manager Principal: mapred/localhost@hadoop.apache.org
Select "Use a keytab to authenticate"
Principal: alice
Keytab: Path to "alice.keytab" in the Kerby test project.
Unselect "Set Resource Manager"
Set Namenode URI: "hdfs://localhost:9000"

Now click on "tHiveInput" and select the following configuration options:

Select "Use an existing Connection"
Choose the tHiveConnection name from the resulting "Component List".
Click on "Edit schema". Create a new column called "word" of type String, and a column called "count" of type int.
Table name: words
Query: "select * from words where word == 'Dare'"

Now the only thing that remains is to point to the krb5.conf file that is generated by the Kerby project. Click on "Window/Preferences" at the top of the screen. Select "Talend" and "Run/Debug". Add a new JVM argument: "-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf":

Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. You should see the following output in the Run Window in the Studio:

Apache Kerby is a subproject of the Apache Directory project, and is a complete open-source KDC written entirely in Java. Apache Kerby 1.1.0 has just been released. This release contains two major new features: a GSSAPI module (covered previously here) and cross-realm support (the subject of a forthcoming blog post).

I have previously used Apache Kerby in this blog as a KDC to illustrate some security-based test-cases for big data components such as Apache Hadoop, Hive, Storm, etc, by pointing to some code on github that shows how to launch a Kerby KDC using Apache maven. This is convenient as a KDC can be launched with the principals already created via a single maven command. However, it is not suitable if the KDC is to be used in a standalone setting.

In this post, we will show how to create a Kerby KDC distribution without writing any code.

1) Install and configure the Apache Kerby KDC

The first step is to download the Apache Kerby source code. Unzip the source and build the distribution via:

mvn clean install -DskipTests
cd kerby-dist
mvn package -Pdist

The "kerby-dist" directory contains the KDC distribution in "kdc-dist", as well as the client tools in "tool-dist". Copy both "kdc-dist" and "tool-dist" directories to another location instead of working directly in the Kerby source. In "kdc-dist" create a directory called "keytabs" and "runtime". Then create some keytabs via:

sh bin/kdcinit.sh conf keytabs

This will create keytabs for the "kadmin" and "protocol" principals, and store them in the "keytabs" directory. For testing purposes, we will change the port of the KDC from the default "88" to "12345" to avoid having to run the KDC with administrator privileges. Edit "conf/krb5.conf" and "conf/kdc.conf" and change "88" to "12345".

The Kerby principals are stored in a backend that is configured in "conf/backend.conf". By default this is a JSON file that is stored in "/tmp/kerby/jsonbackend". However, Kerby also supports other more robust backends, such as LDAP, Mavibot, Zookeeper, etc.

We can start the KDC via:

sh bin/start-kdc.sh conf runtime

Let's create a new user called "alice":

sh bin/kadmin.sh conf/ -k keytabs/admin.keytab
addprinc -pw password alice@EXAMPLE.COM

2) Install and configure the Apache Kerby tool dist

We can check that the KDC has started properly using the MIT kinit tool, if it is installed locally:

export KRB5_CONFIG=/path.to.kdc.dist/conf/krb5.conf
kinit alice (use "password" for the password when prompted)

Now you can see the ticket for alice using "klist". Apache Kerby also ships a "tool-dist" distribution that contains implementations of "kinit", "klist", etc. First call "kdestroy" to remove the ticket previously obtained for "alice". Then go into the directory where "tool-dist" was installed to in the previous section. Edit "conf/krb5.conf" and replace "88" with "12345". We can now obtain a ticket for "alice" via:

sh bin/kinit.sh -conf conf alice
sh bin/klist.sh

Earlier this year, I wrote a series of blog posts on how to secure access to the Apache Hadoop filesystem (HDFS), using tools like Apache Ranger and Apache Atlas. In this post, we will go further and show how to authorize access to Apache Yarn using Apache Ranger. Apache Ranger allows us to create and enforce authorization policies based on who is allowed to submit applications to run on Apache Yarn. Therefore it can be used to enforce authorization decisions for Hive on Yarn or Spark on Yarn jobs.

1) Installing Apache Hadoop

First, follow the steps outlined in the earlier tutorial (section 1) on setting up Apache Hadoop, except that in this tutorial we will work with Apache Hadoop 2.8.2. In addition, we will need to follow some additional steps to configure Yarn (see here for the official documentation). Create a new file called 'etc/hadoop/mapred-site.xml' with the content:
Next edit 'etc/hadoop/yarn-site.xml' and add:
Now we can start Apache Yarn via 'sbin/start-yarn.sh'. We are going to submit jobs as a local user called "alice" to test authorization. First we need to create some directories in HDFS:

bin/hdfs dfs -mkdir -p /user/alice/input
bin/hdfs dfs -put etc/hadoop/*.xml /user/alice/input
bin/hadoop fs -chown -R alice /user/alice
bin/hadoop fs -mkdir /tmp
bin/hadoop fs -chmod og+w /tmp

Now we can submit an example job as "alice" via:

sudo -u alice bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep input output 'dfs[a-z.]+'

The job should run successfully and store the output in '/user/alice/output'. Delete this directory before trying to run the job again ('bin/hadoop fs -rm -r /user/alice/output').

2) Install the Apache Ranger Yarn plugin

Next we will install the Apache Ranger Yarn plugin. Download Apache Ranger and verify that the signature is valid and that the message digests match. Due to some bugs that were fixed for the installation process, I am using version 1.0.0-SNAPSHOT in this post. Now extract and build the source, and copy the resulting plugin to a location where you will configure and install it:

mvn clean package assembly:assembly -DskipTests
tar zxvf target/ranger-1.0.0-SNAPSHOT-yarn-plugin.tar.gz
mv ranger-1.0.0-SNAPSHOT-yarn-plugin ${ranger.yarn.home}

Now go to ${ranger.yarn.home} and edit "install.properties". You need to specify the following properties:

POLICY_MGR_URL: Set this to "http://localhost:6080"
REPOSITORY_NAME: Set this to "YarnTest".
COMPONENT_INSTALL_DIR_NAME: The location of your Apache Hadoop installation

Save "install.properties" and install the plugin as root via "sudo -E ./enable-yarn-plugin.sh". Make sure that the user who is running Yarn has the permission to read the policies stored in '/etc/ranger/YarnTest'. There is one additional step to be performed in Hadoop before restarting Yarn. Edit 'etc/hadoop/ranger-yarn-security.xml' and add a property called "ranger.add-yarn-authorization" with value "false". This means that if Ranger policy authorization fails, it doesn't fall back to the default Yarn ACLs (which allow all users to submit jobs to the default queue).

Finally, re-start Yarn and try to resubmit the job as "alice" as per the previous section. You should now see an authorization error: "User alice cannot submit applications to queue root.default".

3) Create authorization policies in the Apache Ranger Admin console

Next we will use the Apache Ranger admin console to create authorization policies for Yarn. Follow the steps in this tutorial to install the Apache Ranger admin service. Start the Apache Ranger admin service with "sudo ranger-admin start" and open a browser and go to "http://localhost:6080/" and log on with "admin/admin". Add a new Yarn service with the following configuration values:

Service Name: YarnTest
Username: admin
Password: admin
Yarn REST URL: http://localhost:8088

Click on "Test Connection" to verify that we can connect successfully to Yarn + then save the new service. Now click on the "YarnTest" service that we have created. Add a new policy for the "root.default" queue for the user "alice" (create this user if you have not done so already under "Settings, Users/Groups"), with a permission of "submit-app".

Allow up to 30 seconds for the Apache Ranger plugin to download the new authorization policy from the admin service. Then try to re-run the job as "alice". This time it should succeed due to the authorization policy that we have created.

A recent blog post covered how to install the Apache Kerby KDC. In this post we will build on that tutorial to show how to get a major new feature of Apache Kerby 1.1.0 to work - namely kerberos cross-realm support. Cross-realm support means that the KDCs in realm "A" and realm "B" are configured in such a way that a user who is authenticated in realm "A" can obtain a service ticket for a service in realm "B" without having to explicitly authenticate to the KDC in realm "B".

1) Configure the KDC for the "EXAMPLE.COM" realm

First we will configure the Apache Kerby KDC for the "EXAMPLE.COM" realm. Follow the previous tutorial to install and configure the KDC for this (default) realm. We need to follow some additional steps to get cross-realm support working with a second KDC in realm "EXAMPLE2.COM". Edit 'conf/krb5.conf' and replace the "realms" section with the following configuration:
Next we need to add a special principal to the KDC to enable cross-realm support via (after restarting the KDC):

sh bin/kadmin.sh conf/ -k keytabs/admin.keytab
addprinc -pw security krbtgt/EXAMPLE2.COM@EXAMPLE.COM

2) Configure the KDC for the "EXAMPLE2.COM" realm

Now we will configure a second KDC for the "EXAMPLE2.COM" realm. Download the Apache Kerby source code as before. Unzip the source and build the distribution via:

mvn clean install -DskipTests
cd kerby-dist
mvn package -Pdist

Copy "kdc-dist" to a location where you wish to install the second KDC. In this directory, create a directory called "keytabs" and "runtime". Edit 'conf/backend.conf' and change the value for 'backend.json.dir' to avoid conflict with the first KDC instance. Then create some keytabs via:

sh bin/kdcinit.sh conf keytabs

For testing purposes, we will change the port of the KDC from the default "88" to "54321" to avoid having to run the KDC with administrator privileges. Edit "conf/krb5.conf" and "conf/kdc.conf" and change "88" to "54321". In addition, change the realm from "EXAMPLE.COM" to "EXAMPLE2.COM" in both of these files. As above, edit 'conf/krb5.conf' and replace the "realms" section with the following configuration:
Next start the KDC via:

sh bin/start-kdc.sh conf runtime

We need to add a special principal to the KDC to enable cross-realm support, as in the KDC for the "EXAMPLE.COM" realm. Note that it must be the same principal name and password as for the first realm. We will also add a principal for a service in this realm:

sh bin/kadmin.sh conf/ -k keytabs/admin.keytab
addprinc -pw security krbtgt/EXAMPLE2.COM@EXAMPLE.COM
addprinc -pw password service@EXAMPLE2.COM

3) Obtaining a service ticket for service@EXAMPLE2.COM as alice@EXAMPLE.COM

Now we can obtain a service ticket for the service we have configured in the "EXAMPLE2.COM" realm as a user who is authenticated to the "EXAMPLE.COM" realm. Configure the "tool-dist" distribution as per the previous tutorial, updating 'conf/krb5.conf' with the same "realms", "domain_realm" and "capaths" information as shown above. Now we can authenticate as "alice" and obtain a service ticket as follows:

sh bin/kinit.sh -conf conf alice@EXAMPLE.COM
sh bin/kinit.sh -conf conf -c /tmp/krb5cc_1000 -S service@EXAMPLE2.COM

If you run "klist" then you should see that a ticket for "service@EXAMPLE2.COM" was obtained successfully.

Apache Syncope is a powerful open source Identity Management project, that has recently celebrated 5 years as an Apache top level project. Up to recently, a username and password must be supplied to log onto either the admin or enduser web consoles of Apache Syncope. However SAML SSO login is now supported since the 2.0.3 release. Instead of supplying a username/password, the user is redirected to a third party IdP for login, before redirecting back to the Apache Syncope web console. In 2.0.5, support for the IdP-initiated flow of SAML SSO was added.

In this post we will show how to configure Apache Syncope to use SAML SSO as an alternative to logging in using a username and password. We will use Apache CXF Fediz as the SAML SSO IdP. In addition, we will show how to achieve IdP-initiated SSO using Okta. Please also refer to this tutorial on achieving SAML SSO with Syncope and Shibboleth.

1) Logging in to Apache Syncope using SAML SSO

In this section, we will cover setting up Apache Syncope to re-direct to a third party IdP so that the user can enter their credentials. The next section will cover the IdP-initiated case.

1.a) Enable SAML SSO support in Apache Syncope

First we will configure Apache Syncope to enable SAML SSO support. Download and extract the most recent standalone distribution release of Apache Syncope (2.0.6 was used in this post). Start the embedded Apache Tomcat instance and then open a web browser and navigate to "http://localhost:9080/syncope-console", logging in as "admin" and "password".

Apache Syncope is configured with some sample data to show how it can be used. Click on "Users" and add a new user called "alice" by clicking on the subsequent "+" button. Specify a password for "alice" and then select the default values wherever possible (you will need to specify some required attributes, such as "surname"). Now in the left-hand column, click on "Extensions" and then "SAML 2.0 SP". Click on the "Service Provider" tab and then "Metadata". Save the resulting Metadata document, as it will be required to set up the SAML SSO IdP.

1.b) Set up the Apache CXF Fediz SAML SSO IdP

Next we will turn our attention to setting up the Apache CXF Fediz SAML SSO IdP. Download the most recent source release of Apache CXF Fediz (1.4.3 was used for this tutorial). Unzip the release and build it using maven ("mvn clean install -DskipTests"). In the meantime, download and extract the latest Apache Tomcat 8.5.x distribution (tested with 8.5.24). Once Fediz has finished building, copy all of the "IdP" wars (e.g. in fediz-1.4.3/apache-fediz/target/apache-fediz-1.4.3/apache-fediz-1.4.3/idp/war/fediz-*) to the Tomcat "webapps" directory.

There are a few configuration changes to be made to Apache Tomcat before starting it. Download the HSQLDB jar and copy it to the Tomcat "lib" directory. Next edit 'conf/server.xml' and configure TLS on port 8443:

The two keys referenced here can be obtained from 'apache-fediz/target/apache-fediz-1.4.3/apache-fediz-1.4.3/examples/samplekeys/' and should be copied to the root directory of Apache Tomcat. Tomcat can now be started.

Next we have to configure Apache CXF Fediz to support Apache Syncope as a "service" via SAML SSO. Edit 'webapps/fediz-idp/WEB-INF/classes/entities-realma.xml' and add the following configuration:

In addition, we need to make some changes to the "idp-realmA" bean in this file:

Add a reference to this bean in the "applications" list: <ref bean="srv-syncope" />
Change the "idpUrl" property to: https://localhost:8443/fediz-idp/saml
Change the port for "stsUrl" from "9443" to "8443".

Now we need to configure Fediz to accept Syncope's signing cert. Edit the Metadata file you saved from Syncope in step 1.a. Copy the Base-64 encoded certificate in the "KeyDescriptor" section, and paste it (including line breaks) into 'webapps/fediz-idp/WEB-INF/classes/syncope.cert', enclosing it in between "-----BEGIN CERTIFICATE-----" and "-----END CERTIFICATE-----".

Now restart Apache Tomcat. Open a browser and save the Fediz metadata which is available at "http://localhost:8080/fediz-idp/metadata?protocol=saml", which we will require when configuring Apache Syncope.

1.c) Configure the Apache CXF Fediz IdP in Syncope

The final configuration step takes place in Apache Syncope again. In the "SAML 2.0 SP" configuration screen, click on the "Identity Providers" tab and click the "+" button and select the Fediz metadata that you saved in the previous step. Now logout and an additional login option can be seen:

Select the URL for the SAML SSO IdP and you will be redirected to Fediz. Select the IdP in realm "A" as the home realm and enter credentials of "alice/ecila" when prompted. You will be successfully authenticated to Fediz and redirected back to the Syncope admin console, where you will be logged in as the user "alice".

2) Using IdP-initiated SAML SSO

Instead of the user starting with the Syncope web console, being redirected to the IdP for authentication, and then redirected back to Syncope - it is possible instead to start from the IdP. In this section we will show how to configure Apache Syncope to support IdP-initiated SAML SSO using Okta.

2.a) Configuring a SAML application in Okta

The first step is to create an account at Okta and configure a SAML application. This process is mapped out at the following link. Follow the steps listed on this page with the following additional changes:

Specify the following for the Single Sign On URL: http://localhost:9080/syncope-console/saml2sp/assertion-consumer
Specify the following for the audience URL: http://localhost:9080/syncope-console/
Specify the following for the default RelayState: idpInitiated

When the application is configured, you will see an option to "View Setup Instructions". Open this link in a new tab and find the section about the IdP Metadata. Save this to a local file and set it aside for the moment. Next you need to assign the application to the username that you have created at Okta.

2.b) Configure Apache Syncope to support IdP-Initiated SAML SSO

Log on to the Apache Syncope admin console using the admin credentials, and add a new IdP Provider in the SAML 2.0 SP extension as before, using the Okta metadata file that you have saved in the previous section. Edit the metadata and select the 'Support Unsolicited Logins' checkbox. Save the metadata and make sure that the Okta user is also a valid user in Apache Syncope.

Now go back to the Okta console and click on the application you have configured for Apache Syncope. You should seemlessly be logged into the Apache Syncope admin console.

Apache Syncope is a powerful open source Identity Management project, covered extensively on this blog. Amongst many other features, it allows the management of three core types - Users, Groups and "Any Objects", the latter which can be used to model arbitrary types. These core types can be accessed via a flexible REST API powered by Apache CXF. In this post we will explore the concept of "membership" in Apache Syncope, as well as a new feature that was added for Syncope 2.0.7 which allows an easy way to see membership counts.

1) Membership in Apache Syncope

Users and "Any Objects" can be members of Groups in two ways - statically and dynamically. "Static" membership is when the User or "Any Object" is explicitly assigned membership of a given Group. "Dynamic" membership is when the Group is defined with a set of rules, which if they evaluate to true for a given User or "Any Object", then that User or "Any Object" is a member of the group. For example, a User could be a dynamic member of a group based on the value for a given User attribute. So we could have an Apache group with a dynamic User membership rule of "*@apache.org" matching an "email" attribute.

2) Exploring group membership via the REST API

Let's examine group membership with some practical examples. Start Apache Syncope and log in to the admin console. Click on "Groups" and add a new group called "employee", accepting the default options. Now click on the "User" tab and add new Users called "alice" and "bob", with static membership of the "employee" group.

Using a tool like "curl", we can access the REST API using the admin credentials to obtain information on "alice":

curl -u admin:password http://localhost:9080/syncope/rest/users/alice

Note that "alice" has a "memberships" attribute pointing to the "employee" group. Next we can see information on the "employee" group via:

curl -u admin:password http://localhost:9080/syncope/rest/groups/employee

3) Obtaining membership counts

Now consider obtaining the membership count of a given group. Let's say we are interested in finding out how many employees we have - how can this be done? Prior to Apache Syncope 2.0.7, we have to leverage the power of FIQL which underpins the search capabilities of the REST API of Apache Syncope:

curl -u admin:password http://localhost:9080/syncope/rest/users?fiql=%24groups==employee

In other words, search for all Users who are members of the "employee" group. This returns a long list of all Users, even though all we care about is the count (which is encoded in the "totalCount" attribute). There is a new way to do this Apache Syncope 2.0.7. Instead of having to search for Users, membership counts are now encoded in groups. So we can see the total membership counts for a given group just by doing a GET call:

curl -u admin:password http://localhost:9080/syncope/rest/groups/employee

Following the example above, you should see an "staticUserMembershipCount" attribute with a value of "2". Four new attributes are defined for GroupTO:

staticUserMembershipCount: The static user membership count of a given group
dynamicUserMembershipCount: The dynamic user membership count of a given group
staticAnyObjectMembershipCount: The static "Any Object" membership count of a given group
dynamicAnyObjectMembershipCount: The dynamic "Any Object" membership count of a given group.

Some consideration was given to returning the Any Object counts associated with a given Any Object type, but this was abandoned due to performance reasons.

Last year I wrote a series of posts on securing Apache Solr, firstly using basic authentication and then using Apache Ranger for authorization. In this post we will look at an alternative authorization solution called Apache Sentry. Previously I have blogged about using Apache Sentry to secure Apache Hive and Apache Kafka.

1) Install and deploy a SolrCloud example

Download and extract Apache Solr (7.1.0 was used for the purpose of this tutorial). Now start SolrCloud via:

bin/solr -e cloud

Accept all of the default options. This creates a cluster of two nodes, with a collection "gettingstarted" split into two shards and two replicas per-shard. A web interface is available after startup at: http://localhost:8983/solr/. Once the cluster is up and running we can post some data to the collection we have created via:

bin/post -c gettingstarted example/exampledocs/books.csv

We can then perform a search for all books with author "George R.R. Martin" via:

curl http://localhost:8983/solr/gettingstarted/query?q=author:George+R.R.+Martin

2) Authenticating users to our SolrCloud instance

Now that our SolrCloud instance is up and running, let's look at how we can secure access to it, by using HTTP Basic Authentication to authenticate our REST requests. Download the following security configuration which enables Basic Authentication in Solr:
Two users are defined - "alice" and "bob" - both with password "SolrRocks". Now upload this configuration to the Apache Zookeeper instance that is running with Solr:

server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd putfile /security.json security.json

Now try to run the search query above again using Curl. A 401 error will be returned. Once we specify the correct credentials then the request will work as expected, e.g.:

curl -u alice:SolrRocks http://localhost:8983/solr/gettingstarted/query?q=author:George+R.R.+Martin

3) Using Apache Sentry for authorization

a) Install the Apache Sentry distribution

Download the binary distribution of Apache Sentry (2.0.0 was used for the purposes of this tutorial). Verify that the signature is valid and that the message digests match. Now extract it to ${sentry.home}. Apache Sentry provides an RPC service which stores authorization privileges in a database. For the purposes of this tutorial we will just configure the authorization privileges in a configuration file local to the Solr distrbution. Therefore we don't need to do any further configuration to the Apache Sentry distribution at this point.

b) Copy Apache Sentry jars into Apache Solr

To get Sentry authorization working in Apache Solr, we need to copy some jars from the Sentry distribution into Solr. Copy the following jars from ${sentry.home}/lib into ${solr.home}/server/solr-webapp/webapp/WEB-INF/lib:

sentry-binding-solr-2.0.0.jar
sentry-core-model-solr-2.0.0.jar
sentry-core-model-db-2.0.0.jar
sentry-core-common-2.0.0.jar
shiro-core-1.4.0.jar
sentry-policy*.jar
sentry-provider-*

c) Add Apache Sentry configuration files

Next we will configure Apache Solr to use Apache Sentry for authorization. Create a new file in the Solr distribution called "sentry-site.xml" with the following content (substituting the correct directory for "sentry.solr.provider.resource"):
This is the configuration file for the Sentry plugin for Solr. It essentially says that the authorization privileges are stored in a local file, and that the groups for authenticated users should be retrieved from this file. Finally, we need to specify the authorization privileges. Create a new file in the config directory called "sentry.ini" with the following content:

This configuration file contains three separate sections. The "[users]" section maps the authenticated principals to local groups. The "[groups]" section maps the groups to roles, and the "[roles]" section lists the actual privileges.

d) Update security.json to add authorization

Next we need to update the security.json to reference Apache Sentry for authorization. Use the following content, substituting the correct path for the "authorization.sentry.site" parameter. Also change the "superuser" to the user running Sentry:

Upload this file via:

server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd putfile /security.json security.json

5) Testing authorization

We need to restart Apache Solr to enable authorization with Apache Sentry. Stop Solr via:

bin/solr stop -all

Next edit 'bin/solr.in.sh' and add the following properties:

SOLR_AUTH_TYPE="basic"
SOLR_AUTHENTICATION_OPTS="-Dbasicauth=colm:SolrRocks"

Now restart Apache Solr and test authorization. When "bob" is used, an error should be returned (either 403 or in our case 500, as we have not configured a group for "bob"). "alice" should be able to query the collection, due to the authorization policy we have created for her.

This is the first in a series of posts on how to secure Apache Sqoop. Apache Sqoop is a tool to transfer bulk data mainly between HDFS and relational databases, but also supporting other projects such as Apache Kafka. In this post we will look at how to set up Apache Sqoop to perform a simple use-case of transferring a file from HDFS to Apache Kafka. Subsequent posts will show how to authorize this data transfer using both Apache Ranger and Apache Sentry.

Note that we will only use Sqoop 2 (current version 1.99.7), as this is the only version that both Sentry and Ranger support. However, this version is not (yet) recommended for production deployment.

1) Set up Apache Hadoop and Apache Kafka

First we will set up Apache Hadoop and Apache Kafka. The use-case is that we want to transfer a file from HDFS (/data/LICENSE.txt) to a Kafka topic (test). Follow part (1) of an earlier tutorial I wrote about installing Apache Hadoop. The following change is also required for ''etc/hadoop/core-site.xml' (in addition to the "fs.defaultFS" setting that is configured in the earlier tutorial):

Make sure that LICENSE.txt is uploaded to the /data directory as outlined in the tutorial. Now we will set up Apache Kafka. Download Apache Kafka and extract it (1.0.0 was used for the purposes of this tutorial). Start Zookeeper with:

bin/zookeeper-server-start.sh config/zookeeper.properties

and start the broker and then create a "test" topic with:

bin/kafka-server-start.sh config/server.properties
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

Finally let's set up a consumer for the "test" topic:

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning --consumer.config config/consumer.properties

2) Set up Apache Sqoop

Download Apache Sqoop and extract it (1.99.7 was used for the purposes of this tutorial).

2.a) Configure + start Sqoop

Before starting Sqoop, edit 'conf/sqoop.properties' and change the following property to point instead to the Hadoop configuration directory (e.g. /path.to.hadoop/etc/hadoop):

org.apache.sqoop.submission.engine.mapreduce.configuration.directory

Then configure and start Apache Sqoop with the following commands:

export HADOOP_HOME=path to Hadoop home
bin/sqoop2-tool upgrade
bin/sqoop2-tool verify
bin/sqoop2-server start (stop)

2.b) Configure links/job in Sqoop

Now that Sqoop has started we need to configure it to transfer data from HDFS to Kafka. Start the Shell via:

bin/sqoop2-shell

"show connector" lists the connectors that are available. We first need to configure a link for the HDFS connector:

create link -connector hdfs-connector
Name: HDFS
URI: hdfs://localhost:9000
Conf directory: Path to Hadoop conf directory

Similarly, for the Kafka connector:

create link -connector kafka-connector
Name: KAFKA
Kafka brokers: localhost:9092
Zookeeper quorum: localhost:2181

"show link" shows the links we've just created. Now we need to create a job from the HDFS link to the Kafka link as follows (accepting the default values if they are not specified below):

create job -f HDFS -t KAFKA
Name: testjob
Input Directory: /data
Topic: test

We can see the job we've created with "show job". Now let's start the job:

start job -name testjob

You should see the content of the HDFS "/data" directory (i.e. the LICENSE.txt) appear in the window of the Kafka "test" consumer, thus showing that Sqoop has transfered data from HDFS to Kafka.

This is the second in a series of posts on how to secure Apache Sqoop. The first post looked at how to set up Apache Sqoop to perform a simple use-case of transferring a file from HDFS to Apache Kafka. In this post we will look at securing Apache Sqoop with Apache Ranger, such that only authorized users can interact with it. We will then show how to use the Apache Ranger Admin UI to create authorization policies for Apache Sqoop.

1) Install the Apache Ranger Sqoop plugin

If you have not done so already, please follow the steps in the earlier tutorial to set up Apache Sqoop. First we will install the Apache Ranger Sqoop plugin. Download Apache Ranger and verify that the signature is valid and that the message digests match. Due to some bugs that were fixed for the installation process, I am using version 1.0.0-SNAPSHOT in this post. Now extract and build the source, and copy the resulting plugin to a location where you will configure and install it:

mvn clean package assembly:assembly -DskipTests
tar zxvf target/ranger-1.0.0-SNAPSHOT-sqoop-plugin.tar.gz
mv ranger-1.0.0-SNAPSHOT-sqoop-plugin ${ranger.sqoop.home}

Now go to ${ranger.sqoop.home} and edit "install.properties". You need to specify the following properties:

POLICY_MGR_URL: Set this to "http://localhost:6080"
REPOSITORY_NAME: Set this to "SqoopTest".
COMPONENT_INSTALL_DIR_NAME: The location of your Apache Sqoop installation

Save "install.properties" and install the plugin as root via "sudo -E ./enable-sqoop-plugin.sh". Make sure that the user you are running Sqoop as has permission to access '/etc/ranger/SqoopTest', which is where the Ranger plugin for Sqoop will download authorization policies created in the Ranger Admin UI.

In the Apache Sqoop directory, copy 'conf/ranger-sqoop-security.xml' to the root directory (or else add the 'conf' directory to the Sqoop classpath). Now restart Apache Sqoop and try to see the Connectors that were installed:

bin/sqoop2-server start
bin/sqoop2-shell
show connector

You should see an empty list here as you are not authorized to see the connectors. Note that "show job" should still work OK, as you have permission to view jobs that you created.

2) Create authorization policies in the Apache Ranger Admin console

Next we will use the Apache Ranger admin console to create authorization policies for Sqoop. Follow the steps in this tutorial (except use at least Ranger 1.0.0) to install the Apache Ranger admin service. Start the Apache Ranger admin service with "sudo ranger-admin start" and open a browser and go to "http://localhost:6080/" and log on with "admin/admin". Add a new Sqoop service with the following configuration values:

Service Name: SqoopTest
Username: admin
Sqoop URL: http://localhost:12000

Note that "Test Connection" is not going to work here, as the "admin" user is not authorized at this stage to read from the Sqoop 2 server. However, once the service is created and the policies synced to the Ranger plugin in Sqoop (roughly every 30 seconds by default), it should work correctly.

Once the "SqoopTest" service is created, we will create some authorization policies for the user who is using the Sqoop Shell.
Click on "Settings" and "Users/Groups" and add a new user corresponding to the user for whom you wish to create authorization policies. When this is done then click on the "SqoopTest" service and edit the existing policies, adding this user (for example):

Wait 30 seconds for the policies to sync to the Ranger plugin that is co-located with the Sqoop service. Now re-start the Shell and "show connector" should list the full range of Sqoop Connectors, as authorization has succeeded. Similar policies could be created to allow only certain users to run jobs created by other users.

This is the third and final post about securing Apache Sqoop. The first post looked at how to set up Apache Sqoop to perform a simple use-case of transferring a file from HDFS to Apache Kafka. The second post showed how to secure Apache Sqoop with Apache Ranger. In this post we will look at an alternative way of implementing authorization in Apache Sqoop, namely using Apache Sentry.

1) Install the Apache Sentry Sqoop plugin

If you have not done so already, please follow the steps in the earlier tutorial to set up Apache Sqoop. Download the binary distribution of Apache Sentry (2.0.0 was used for the purposes of this tutorial). Verify that the signature is valid and that the message digests match, and extract it to ${sentry.home}.

a) Configure sqoop.properties

We need to configure Apache Sqoop to use Apache Sentry for authorization. Edit 'conf/sqoop.properties' and add the following properties:

org.apache.sqoop.security.authentication.type=SIMPLE
org.apache.sqoop.security.authentication.handler=org.apache.sqoop.security.authentication.SimpleAuthenticationHandler
org.apache.sqoop.security.authorization.handler=org.apache.sentry.sqoop.authz.SentryAuthorizationHandler
org.apache.sqoop.security.authorization.access_controller=org.apache.sentry.sqoop.authz.SentryAccessController
org.apache.sqoop.security.authorization.validator=org.apache.sentry.sqoop.authz.SentryAuthorizationValidator
org.apache.sqoop.security.authorization.server_name=SqoopServer1
sentry.sqoop.site.url=file:./conf/sentry-site.xml

In addition, we need to add some of the Sentry jars to the Sqoop classpath. Add the following property to 'conf/sqoop.properties', substituting the value for "${sentry.home}":

org.apache.sqoop.classpath.extra=${sentry.home}/lib/sentry-binding-sqoop-2.0.0.jar:${sentry.home}/lib/sentry-core-common-2.0.0.jar:${sentry.home}/lib/sentry-core-model-sqoop-2.0.0.jar:${sentry.home}/lib/sentry-provider-file-2.0.0.jar:${sentry.home}/lib/sentry-provider-common-2.0.0.jar:${sentry.home}/lib/sentry-provider-db-2.0.0.jar:${sentry.home}/lib/shiro-core-1.4.0.jar:${sentry.home}/lib/sentry-policy-engine-2.0.0.jar:${sentry.home}/lib/sentry-policy-common-2.0.0.jar

b) Add Apache Sentry configuration files

Next we will configure the Apache Sentry authorization plugin. Create a new file in the Sqoop "conf" directory called "sentry-site.xml" with the following content (substituting the correct directory for "sentry.sqoop.provider.resource"):

It essentially says that the authorization privileges are stored in a local file, and that the groups for authenticated users should be retrieved from this file. Finally, we need to specify the authorization privileges. Create a new file in the config directory called "sentry.ini" with the following content, substituting "colm" for the name of the user running the Sqoop shell:

2) Test authorization

Now start Apache Sqoop ("bin/sqoop2-server start") and start the shell ("bin/sqoop2-shell"). "show connector" should list the full range of Sqoop Connectors, as authorization has succeeded. To test that authorization is correctly disabling access for unauthorized users, change the "ALL" permission in 'conf/sentry.ini' to "WRITE", and restart the server and shell. This time access is not granted and a blank list should be returned for "show connector".

Apache Sentry is a role-based authorization solution for a number of big-data projects. I have previously blogged about how to install the authorization plugin to secure various deployments, e.g.:

For all of these tutorials, the authorization privileges were stored in a configuration file local to the deployment. However this is just a "test configuration" to get simple examples up and running quickly. For production scenarios, Apache Sentry offers a central security service, which stores the user roles and privileges in a database, and provides an RPC service that the Sentry authorization plugins can invoke on. In this article, we will show how to set up the Apache Sentry security service in a couple of different ways.

1) Installing the Apache Sentry security service manually

Download the binary distribution of Apache Sentry (2.0.0 was used for the purposes of this tutorial). Verify that the signature is valid and that the message digests match, and extract it to ${sentry.home}. In addition, download a compatible version of Apache Hadoop (2.7.5 was used for the purposes of this tutorial). Set the "HADOOP_HOME" environment variable to point to the Hadoop distribution.

First we need to specify two configuration files, "sentry-site.xml" which contains the Sentry configuration, and "sentry.ini" which defines the user/group information for the user who will be invoking on the Sentry security service. You can download sample configuration files here. Copy these files to the root directory of "${sentry.home}". Edit the 'sentry.ini' file and replace 'user' with the user who will be invoking on the security service (such as "kafka" or "solr"). The other entries will be ignored - 'sentry-site.xml' defines that a user must belong to the "admin" group to invoke on the security service successfully.

Finally configure the database and start the Apache Sentry security service via:

bin/sentry --command schema-tool --conffile sentry-site.xml --dbType derby --initSchema
bin/sentry --command service -c sentry-site.xml

2) Installing the Apache Sentry security service via docker

Instead of having to download and configure Apache Sentry and Hadoop, a simpler way to get started is to download a pre-made docker image that I created. The DockerFile is available here and the docker image is available here. Note that this docker image is only for testing use, as the security service is not secured with kerberos and it uses the default credentials. Download and run the docker image with:

docker pull coheigea/sentry
docker run -p 8038:8038 coheigea/sentry

Once the container has started, we need to update the 'sentry.ini' file with the username that we are going to use to invoke on the Apache Sentry security service. Get the "id" of the running container via "docker ps" and then run "docker exec -it <id> bash". Edit 'sentry.ini' and change 'user' to the username you are using.

In the next tutorial we will look at how to manually invoke on the security service.

This is the second in a series of blog posts on the Apache Sentry security service. The first post looked at how to get started with the Apache Sentry security service, both from scratch and via a docker image. The next logical question is how can we can define the authorization privileges held in the Sentry security service. In this post we will briefly cover what those privileges look like, and how we can query them using two different tools that ship with the Apache Sentry distribution.

1) Apache Sentry privileges

The Apache Sentry docker image we covered in the previous tutorial ships with a 'sentry.ini' configuration file (see here) that is used to retrieve the groups associated with a given user. A user must be a member of the "admin" group to invoke on the Apache Sentry security service, as configured in 'sentry-site.xml' (see here). To avoid confusion, 'sentry.ini' also contains "[groups]" and "[roles]" sections, but these are not used by the Sentry security service.

In Apache Sentry, a user is associated with one or more groups, which in turn are associated with one or more roles, which in turn are associated with one or more privileges. Privileges are made up of a number of different components that vary slightly depending on what service the privilege is associated with (e.g. Hive, Kafka, etc.). For example:

Host=*->Topic=test->action=ALL - This Kafka privilege grants all actions on the "test" topic on all hosts.
Collection=logs->action=* - This Solr privilege grants all actions on the "logs" collection.
Server=sqoopServer1->Connector=c1->action=* - This Sqoop privilege grants all actions on the "c1" connector on the "sqoopServer1" server.
Server=server1->Db=default->Table=words->Column=count->action=select - This Hive privilege grants the "select" action on the "count" column of the "words" table in the "default" database on the "server1" server.

For more information on the Apache sentry privilege model please consult the official wiki.

2) Querying the Apache Sentry security service using 'sentryShell'

Follow the steps outlined in the previous tutorial to get the Apache Sentry security service up and running using either the docker image or by setting it up manually. The Apache Sentry distribution ships with a "sentryShell" command line tool that we can use to query that Apache Sentry security service. So depending on which approach you followed to install Sentry, either go to the distribution or else log into the docker container.

We can query the roles, groups and privileges via:

bin/sentryShell -conf sentry-site.xml -lr
bin/sentryShell -conf sentry-site.xml -lg
bin/sentryShell -conf sentry-site.xml -lp -r admin_role

We can create a "admin_role" role and add it to the "admin" group via:

bin/sentryShell -conf sentry-site.xml -cr -r admin_role
bin/sentryShell -conf sentry-site.xml -arg -g admin -r admin_role

We can grant a (Hive) privilege to the "admin_role" role as follows:

bin/sentryShell -conf sentry-site.xml -gpr -r admin_role -p "Server=*->action=ALL"

If we are adding a privilege for anything other than Apache Hive, we need to explicitly specify the "type", e.g.:

bin/sentryShell -conf sentry-site.xml -gpr -r admin_role -p "Host=*->Cluster=kafka-cluster->action=ALL" -t kafka
bin/sentryShell -conf sentry-site.xml -lp -r admin_role -t kafka

3) Querying the Apache Sentry security service using 'sentryCli'

A rather more user-friendly alternative to the 'sentryShell' is available in Apache Sentry 2.0.0. The 'sentryCli' can be started with 'bin/sentryCli'. Typing ?l lists the available commands:

The Apache Sentry security service can be queried using any of these commands.

The Apache CXF Fediz subproject provides an easy way to secure your web applications via the WS-Federation Passive Requestor Profile. An earlier tutorial I wrote covers how to deploy and secure a "simpleWebapp" project that ships with Fediz in Apache Tomcat. One of the questions that came up recently on that article was how to enable logging for the Fediz plugin itself (as opposed to the IdP/STS). My colleague Jan Bernhardt has covered this topic using Apache Log4j. Here we will show a simple alternative way to enable logging using java.util.logging.

Please follow the earlier tutorial to set up and secure the "simpleWebapp" in Apache Tomcat. Note that after a successful test, the IdP logs appear in "logs/idp.log" and the STS logs appear in "logs/sts.log". However no logs exist for the plugin itself. To rectify this, copy the "slf4j-jdk14" jar into "lib/fediz" (for example from here). Then edit 'webapps/fedizhelloworld/WEB-INF/classes/logging.properties' with the following content:

This configuration logs "INFO" level messages to the Console (catalina.out) and logs "FINE" level messages to the log file "logs/rp.log" in XML Format. For example:

This is the third in a series of blog posts on the Apache Sentry security service. The first post looked at how to get started with the Apache Sentry security service, both from scratch and via a docker image. The second post looked at how to define the authorization privileges held in the Sentry security service. In this post we will look at updating an earlier tutorial I wrote about securing Apache Kafka with Apache Sentry, this time using the security service instead of defining the privileges in a file local to the Kafka distribution.

1) Configure authorization in the broker

Firstly download configure Apache Kafka using SSL as per this tutorial, except use Kafka 0.11.0.2. To enable authorization using Apache Sentry we also need to follow these steps. First edit 'config/server.properties' and add:

authorizer.class.name=org.apache.sentry.kafka.authorizer.SentryKafkaAuthorizer
sentry.kafka.site.url=file:./config/sentry-site.xml

Next copy the jars from the "lib" directory of the Sentry distribution to the Kafka "libs" directory. Then create a new file in the config directory called "sentry-site.xml" with the following content:

This is the configuration file for the Sentry plugin for Kafka. It instructs Sentry to retrieve the authorization privileges from the Sentry security service, and to get the groups of authenticated users from the 'sentry.ini' configuration file. Create a new file in the config directory called "sentry.ini" with the following content:
Note that in the earlier tutorial this file also contained the authorization privileges, but they are not required in this scenario as we are using the Apache Sentry security service.

2) Configure the Apache Sentry security service

Follow the first tutorial to install the Apache Sentry security service. Now we need to create the authorization privileges for our Apache Kafka test scenario as per the second tutorial. Start the 'sentryCli" in the Apache Sentry distribution.

Create the roles:

t kafka
cr admin_role
cr describe_role
cr read_role
cr write_role
cr describe_consumer_group_role
cr read_consumer_group_role

Add the privileges to the roles:

gp admin_role "Host=*->Cluster=kafka-cluster->action=ALL"
gp describe_role "Host=*->Topic=test->action=describe"
gp read_role "Host=*->Topic=test->action=read"
gp write_role "Host=*->Topic=test->action=write"
gp describe_consumer_group_role "Host=*->ConsumerGroup=test-consumer-group->action=describe"
gp read_consumer_group_role "Host=*->ConsumerGroup=test-consumer-group->action=read"

Associate the roles with groups (defined in 'sentry.ini' above):

gr admin_role admin
gr describe_role producer
gr read_role producer
gr write_role producer
gr read_role consumer
gr describe_role consumer
gr describe_consumer_group_role consumer
gr read_consumer_group_role consumer

3) Test authorization

Now start the broker (after starting Zookeeper):

bin/kafka-server-start.sh config/server.properties

Start the producer:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test --producer.config config/producer.properties

Send a few messages to check that the producer is authorized correctly. Now start the consumer:

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning --consumer.config config/consumer.properties --new-consumer

Authorization should succeed and you should see the messages made by the producer appear in the consumer console window.

Securing Apache Hive - part II

Securing Apache Hive - part III

Integrating JSON Web Tokens with Kerberos using Apache Kerby

Securing Apache Hive - part IV

Securing Apache Hive - part V

Securing Apache Hive - part VI

Configuring Kerberos for Hive in Talend Open Studio for Big Data

Installing the Apache Kerby KDC

Authorizing access to Apache Yarn using Apache Ranger

Kerberos cross-realm support in Apache Kerby 1.1.0

SAML SSO support for the Apache Syncope web console

A fast way to get membership counts in Apache Syncope

Securing Apache Solr with Apache Sentry

Securing Apache Sqoop - part I

Securing Apache Sqoop - part II

Securing Apache Sqoop - part III

The Apache Sentry security service - part I

The Apache Sentry security service - part II

Enabling Apache CXF Fediz plugin logging in Apache Tomcat

The Apache Sentry security service - part III