Thanks to the changes proposed at CASSANDRA-8717, CASSANDRA-7575 and CASSANDRA-6480, Stratio is glad to present its Lucene-based implementation of Cassandra secondary indexes as a plugin that can be attached to the Apache distribution. Before the above changes, Lucene index was distributed inside a fork of Apache Cassandra, with all the difficulties it implied, i.e. maintaining a fork. As of now, the fork is discontinued and new users should use the recently created plugin, which maintains all the features of Stratio Cassandra.
Stratio’s Lucene index extends Cassandra’s functionality to provide near real-time distributed search engine capabilities such as with ElasticSearch or Solr, including full text search capabilities, free multivariable search,relevance queries and field-based sorting. Each node indexes its own data, so high availability and scalability is guaranteed.
With the new distribution as a plugin you can easily patch an existing installation of Cassandra with the following command:
mvn clean package -Ppatch -Dcassandra_home=<CASSANDRA_INSTALL_DIR>
Now, given an existing table with the following schema
CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) );
you can create a Lucene index with this query:
ALTER TABLE tweets ADD lucene TEXT; CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "English", fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : "English"}, userid : {type : "string"}, username : {type : "string"} }} '};
Once the index has been created, we can start issuing queries. For instance, retrieve 10 tweets containing the word Cassandra in the message body:
SELECT * FROM tweets WHERE lucene = ‘{ filter : {type : "match", field : "message", value : "cassandra"} }’ LIMIT 10;
We can also find the 5 best matching tweets containing the word Cassandra in the message body for the previous query, involving Lucene relevance features:
SELECT * FROM tweets WHERE lucene = ‘{ query: {type: "match", field : "message", value : "cassandra"} }’ LIMIT 5;
Queries, filters and boolean predicates can be combined to do queries as complex as requesting the tweets inside a certain time range starting with an ‘A’ or containing a ‘B’, or with the word ‘FAST’, sorting them by ascending time, and then by descending alphabetic user name:
SELECT * FROM tweets WHERE lucene = '{ filter : { type : "boolean", must : [ {type : "range", field : "createdat", lower : "2014-04-25"}, {type : "boolean", should : [ {type : "prefix", field : "userid", value : "a"} , {type : "wildcard", field : "userid", value : "*b*"} , {type : "match", field : "userid", value : "fast"} ] } ] }, sort : { fields: [ {field :"createdat", reverse : true}, {field : "userid", reverse : false} ] } }' LIMIT 10000;
You can learn more about Stratio’s Lucene index for Cassandra reading the comprehensive documentation or viewing the video of its presentation at Cassandra Summit Europe 2014.
We will be back very soon with new exciting features such as geospatial search and bitemporal indexing.
15 Comments
I am using the plugin on version 3.0.9. I am trying to do a very similar thing to the complex query you posted here, but keep getting the following: Error from server: code=2200 [Invalid query] message=”Unparseable JSON search: {…}”.
So I copied these queries from your blog and tried them out on my server – same thing. Any ideas? How complex can these nested boolean queries be? The query verbatim:
SELECT * FROM tweets WHERE lucene = ‘{
filter :
{
type : “boolean”, must :
[
{type : “range”, field : “time” lower : “2014/04/25”},
{type : “boolean”, should :
[
{type : “prefix”, field : “user”, value : “a”} ,
{type : “wildcard”, field : “user”, value : “*b*”} ,
{type : “match”, field : “user”, value : “fast”}
]
}
]
},
sort :
{
fields: [ {field :”time”, reverse : true},
{field : “user”, reverse : false} ]
}
}’ LIMIT 10000;
The response:
InvalidRequest: Error from server: code=2200 [Invalid query] message=”Unparseable JSON search: {
filter :
{
type : “boolean”, must :
[
{type : “range”, field : “time” lower : “2014/04/25”},
{type : “boolean”, should :
[
{type : “prefix”, field : “user”, value : “a”} ,
{type : “wildcard”, field : “user”, value : “*b*”} ,
{type : “match”, field : “user”, value : “fast”}
]
}
]
},
sort :
{
fields: [ {field :”time”, reverse : true},
{field : “user”, reverse : false} ]
}
}”
Thanks for your interest in the project.
It seems that there are several mistakes in the example query. Specifically, there is a missing comma in the 6th line, right after “time”, and the field names are wrong. The proper query would be:
SELECT * FROM tweets WHERE lucene = ‘{
filter :
{
type : “boolean”, must :
[
{type : “range”, field : “createdat”, lower : “2014-04-25”},
{type : “boolean”, should :
[
{type : “prefix”, field : “userid”, value : “a”} ,
{type : “wildcard”, field : “userid”, value : “*b*”} ,
{type : “match”, field : “userid”, value : “fast”}
]
}
]
},
sort :
{
fields: [ {field :”createdat”, reverse : true},
{field : “userid”, reverse : false} ]
}
}’ LIMIT 10000;
Also, please note that you should use vertical quotation marks and apostrophes (” and ‘) instead of their curly counterparts (“,” and ’).
We are fixing the example, thanks for reporting.
Hi,
I believe the query be:
SELECT * FROM tweets WHERE lucene = ‘{
filter : {type : “match”,
field : “message”,
value : “cassandra”}
}’ LIMIT 10;
with “message” instead of “text”.
Thank you Valeriu, you are totally right. We fixed it
Hi,
first of all thumbs up for great work done with lucene index !!!
Now for my question:
can you share in few words with us the internals of the implementation…
I see that we can define some refresh time in the index options.
As I understand the index itself is maintained by C* in a synchronous manner, the refresh time is more lucene internal.
Is there a kind of index caching ? Is the refresh defining the cache invalidation/refresh time?
Also can you share some details about resource consumption? Is the cache in memory, what about space requirements and limits?
Also a question that came up… how do I know that the refresh is working as defined? Could in theory the refresh stop to do hos work and my queries work on stale data?
many thanks in advance for your kind feedback
cheers
Heiko
Hello Heiko:
You can read about all your doubts in the project github documentation. Please note that it is versioned.
Also, if the documentation does not solve all your doubts, you should open a new github issue. We try to solve all issues as soon as possible.
Cheers
Hi Team, Thanks for your great work.
I am seeing your docker repo and git repo no check-in for more than 8 months, are you discontinuing the stratio-cassandra plugin???
please let me know.
Very confused to use on a secondary index. if I run the command in cqlsh without specified the secondary index. It doesn’t require to specify the field for the secondary index. But when I am trying to run the command in devcenter must have to specify the one column which also creates the secondary index.
Hi:
As stated in the github documentation, there is some imcompatible software that does not allow ‘expr’ queries.
Cheers
Hi guys, great work!
Is there a way to get the exact implementation of the lucene scoring formula? Could it be overridden?
Thanks in advance!
Daniel
Yes, you can see the exact implementation of NoIDFSimilarity. It is based on ClassicSimilarity but independent of doc frequency. This cannot be overridden.
Cheers
Thanks!
Hi guys, I have an extra question. Is there a way to run a select all the rows that “not match” a value?
Thanks in advance,
Daniel
I hope not component will be useful.
SELECT * FROM users WHERE expr(users_index, ‘{
filter: {not: [{type: “wildcard”, field: “name”, value: “*a”}]}
}’);
For reference: https://github.com/Stratio/cassandra-lucene-index/blob/3.11.3.0/doc/documentation.rst#all-search
Is copy fields possible? If yes, Can you please share me a link or source to implement it? Else, how can we achieve searching a keyword across entire table fields?