:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
Elasticsearch real-time search and analytics natively integrated with Hadoop.
Supports Map/Reduce, Apache Hive, and Apache Spark.
See project page and documentation for detailed information.
Elasticsearch (1.x or higher (2.x highly recommended)) cluster accessible through REST. That’s it!
Significant effort has been invested to create a small, dependency-free, self-contained jar that can be downloaded and put to use without any dependencies. Simply make it available to your job classpath and you’re set.
For a certain library, see the dedicated chapter.
ES-Hadoop 6.x and higher are compatible with Elasticsearch 1.X, 2.X, 5.X, and 6.X
ES-Hadoop 5.x and higher are compatible with Elasticsearch 1.X, 2.X and 5.X
ES-Hadoop 2.2.x and higher are compatible with Elasticsearch 1.X and 2.X
ES-Hadoop 2.0.x and 2.1.x are compatible with Elasticsearch 1.X only
8.15.1
)Available through any Maven-compatible tool:
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>8.15.1</version>
</dependency>
or as a stand-alone ZIP.
Grab the latest nightly build from the repository again through Maven:
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>9.0.0-SNAPSHOT</version>
</dependency>
<repositories>
<repository>
<id>sonatype-oss</id>
<url>http://oss.sonatype.org/content/repositories/snapshots</url>
<snapshots><enabled>true</enabled></snapshots>
</repository>
</repositories>
or build the project yourself.
We do build and test the code on each commit.
Running against Hadoop 1.x is deprecated in 5.5 and will no longer be tested against in 6.0.
ES-Hadoop is developed for and tested against Hadoop 2.x and YARN.
More information in this section.
We’re interested in your feedback! You can find us on the User mailing list - please append [Hadoop]
to the post subject to filter it out. For more details, see the community page.
The latest reference documentation is available online on the project home page. Below the README contains basic usage instructions at a glance.
All configuration properties start with es
prefix. Note that the es.internal
namespace is reserved for the library internal use and should not be used by the user at any point.
The properties are read mainly from the Hadoop configuration but the user can specify (some of) them directly depending on the library used.
es.resource=<ES resource location, relative to the host/port specified above>
es.query=<uri or query dsl query> # defaults to {"query":{"match_all":{}}}
es.nodes=<ES host address> # defaults to localhost
es.port=<ES REST port> # defaults to 9200
The full list is available here
For basic, low-level or performance-sensitive environments, ES-Hadoop provides dedicated InputFormat
and OutputFormat
that read and write data to Elasticsearch. To use them, add the es-hadoop
jar to your job classpath
(either by bundling the library along - it’s ~300kB and there are no-dependencies), using the DistributedCache or by provisioning the cluster manually.
See the documentation for more information.
Note that es-hadoop supports both the so-called ‘old’ and the ‘new’ API through its EsInputFormat
and EsOutputFormat
classes.
org.apache.hadoop.mapred
) APITo read data from ES, configure the EsInputFormat
on your job configuration along with the relevant properties:
JobConf conf = new JobConf();
conf.setInputFormat(EsInputFormat.class);
conf.set("es.resource", "radio/artists");
conf.set("es.query", "?q=me*"); // replace this with the relevant query
...
JobClient.runJob(conf);
Same configuration template can be used for writing but using EsOuputFormat
:
JobConf conf = new JobConf();
conf.setOutputFormat(EsOutputFormat.class);
conf.set("es.resource", "radio/artists"); // index or indices used for storing data
...
JobClient.runJob(conf);
org.apache.hadoop.mapreduce
) APIConfiguration conf = new Configuration();
conf.set("es.resource", "radio/artists");
conf.set("es.query", "?q=me*"); // replace this with the relevant query
Job job = new Job(conf)
job.setInputFormatClass(EsInputFormat.class);
...
job.waitForCompletion(true);
Configuration conf = new Configuration();
conf.set("es.resource", "radio/artists"); // index or indices used for storing data
Job job = new Job(conf)
job.setOutputFormatClass(EsOutputFormat.class);
...
job.waitForCompletion(true);
ES-Hadoop provides a Hive storage handler for Elasticsearch, meaning one can define an external table on top of ES.
Add es-hadoop-hive.aux.jars.path
or register it manually in your Hive script (recommended):
ADD JAR /path_to_jar/es-hadoop-<version>.jar;
To read data from ES, define a table backed by the desired index:
CREATE EXTERNAL TABLE artists (
id BIGINT,
name STRING,
links STRUCT<url:STRING, picture:STRING>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists', 'es.query' = '?q=me*');
The fields defined in the table are mapped to the JSON when communicating with Elasticsearch. Notice the use of TBLPROPERTIES
to define the location, that is the query used for reading from this table.
Once defined, the table can be used just like any other:
SELECT * FROM artists;
To write data, a similar definition is used but with a different es.resource
:
CREATE EXTERNAL TABLE artists (
id BIGINT,
name STRING,
links STRUCT<url:STRING, picture:STRING>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists');
Any data passed to the table is then passed down to Elasticsearch; for example considering a table s
, mapped to a TSV/CSV file, one can index it to Elasticsearch like this:
INSERT OVERWRITE TABLE artists
SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture) FROM source s;
As one can note, currently the reading and writing are treated separately but we’re working on unifying the two and automatically translating HiveQL to Elasticsearch queries.
ES-Hadoop provides native (Java and Scala) integration with Spark: for reading a dedicated RDD
and for writing, methods that work on any RDD
. Spark SQL is also supported
To read data from ES, create a dedicated RDD
and specify the query as an argument:
import org.elasticsearch.spark._
..
val conf = ...
val sc = new SparkContext(conf)
sc.esRDD("radio/artists", "?q=me*")
import org.elasticsearch.spark.sql._
// DataFrame schema automatically inferred
val df = sqlContext.read.format("es").load("buckethead/albums")
// operations get pushed down and translated at runtime to Elasticsearch QueryDSL
val playlist = df.filter(df("category").equalTo("pikes").and(df("year").geq(2016)))
Import the org.elasticsearch.spark._
package to gain savetoEs
methods on your RDD
s:
import org.elasticsearch.spark._
val conf = ...
val sc = new SparkContext(conf)
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")
sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")
import org.elasticsearch.spark.sql._
val df = sqlContext.read.json("examples/people.json")
df.saveToEs("spark/people")
In a Java environment, use the org.elasticsearch.spark.rdd.java.api
package, in particular the JavaEsSpark
class.
To read data from ES, create a dedicated RDD
and specify the query as an argument.
import org.apache.spark.api.java.JavaSparkContext;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(jsc, "radio/artists");
SQLContext sql = new SQLContext(sc);
DataFrame df = sql.read().format("es").load("buckethead/albums");
DataFrame playlist = df.filter(df.col("category").equalTo("pikes").and(df.col("year").geq(2016)))
Use JavaEsSpark
to index any RDD
to Elasticsearch:
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);
Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");
JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(numbers, airports));
JavaEsSpark.saveToEs(javaRDD, "spark/docs");
import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;
DataFrame df = sqlContext.read.json("examples/people.json")
JavaEsSparkSQL.saveToEs(df, "spark/docs")
Elasticsearch Hadoop uses Gradle for its build system and it is not required to have it installed on your machine. By default (gradlew
), it automatically builds the package and runs the unit tests. For integration testing, use the integrationTests
task.
See gradlew tasks
for more information.
To create a distributable zip, run gradlew distZip
from the command line; once completed you will find the jar in build/libs
.
To build the project, JVM 8 (Oracle one is recommended) or higher is required.
This project is released under version 2.0 of the Apache License
Licensed to Elasticsearch under one or more contributor
license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright
ownership. Elasticsearch licenses this file to you under
the Apache License, Version 2.0 (the "License"); you may
not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.