Skip to content

Instantly share code, notes, and snippets.

View kaniska's full-sized avatar
💭
I may be slow to respond.

Kaniska Mandal kaniska

💭
I may be slow to respond.
View GitHub Profile
@kaniska
kaniska / SparkNLP_SparkML_Similarity_Test.scala
Created September 6, 2020 03:02
Semantic Similarity using Spark NLP and Spark ML
// Databricks notebook source
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.{BertEmbeddings, SentenceEmbeddings, WordEmbeddingsModel}
import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher, RecursivePipeline}
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.feature.{BucketedRandomProjectionLSH, BucketedRandomProjectionLSHModel, LSH, Normalizer, SQLTransformer}
import org.apache.spark.ml.feature.{MinHashLSH, MinHashLSHModel}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val path = "/public/retail_db"
val products = sc.textFile(path + "/products")
val minPricedProductsByCategory = products.
filter(product => product.split(",")(4) != "").
map(product => {
val p = product.split(",")
(p(1).toInt, product)
}).
reduceByKey((agg, product) => {
val path = "/Users/itversity/Research/data/retail_db" or val path = "/public/retail_db"
val orderItems = sc.textFile(path + "/order_items").
map(orderItem => (orderItem.split(",")(1).toInt, orderItem.split(",")(4).toFloat))
// Compute revenue for each order
orderItems.
reduceByKey((total, orderItemSubtotal) => total + orderItemSubtotal).
take(100).
foreach(println)
/**
*
*/
package com.xyz.topology.netflow.beam;
import java.util.Properties;
import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DecoderFactory;
@kaniska
kaniska / docker-orientation-for-node-developers.md
Last active September 21, 2015 17:09 — forked from subfuzion/docker-orientation-for-node-developers.md
Docker Orientation for Node Developers