Imagine the scenario, you have millions of customer records in an existing SQL server instance and want to tie things together with your shiny new Mongo xDB. Where to start?!?!
Hopefully the following information will help guide you towards the different areas that will need researching and developing. It’s worth noting, if you follow these steps I’d recommend the Mongo University free courses (https://university.mongodb.com/) to get acquainted with how Mongo and it’s queries work.
For this demo assume the following example infrastructure:
- Existing on premise SQL server instance containing: user data & order records. The Sitecore deployment has no r/w access to this instance.
- Data is structured so that a user can have: 0, 1 or more orders
- Sitecore, SQL server and Mongo all deployed to a cloud hosted provider. Assume Mongo is used for both xDB and session
So, how do we get the data from a relational database into xDB?
There are several options here, if the Sitecore instance and the on-premise database can talk you may choose slightly different approaches – for now lets assume not.
You can get data out of sql in many ways, one simple option is to right-click the database in question and follow the ‘export data’ wizard. Here you can specify things like source databases, destination dbs or files, queries to run etc. I’ve chosen to use CSV flat files as the interim data storage. Tip, remember to check ‘Column names in the first data row’ – it will make life easier when you come to import into mongo.
One key difference between SQL and Mongo is the way you can represent linked data. The CSV will contain something like:
UserID | OrderID | Order total |
123 | 456 | 50 |
123 | 789 | 150 |
280 | 535 | 20 |
Note: user 123 has made 2 orders.
Compare this to Mongo (*note, this isn’t the only option for storing data in Mongo. In this scenario the model fits well with the Sitecore approach to contact facets.):
- User: 123
- orders
- ID: 456, Cost: 50
- ID: 789, Cost: 150
- orders
- User: 280
- orders
- ID: 535, Cost: 20
- orders
Lots of data? Don’t panic..!!
When dealing with big sets of data in Mongo, bulk operations are your friend – they will make things much quicker! Based on that I decided to blast the whole CSV into a temporary Mongo collection. Via a cmd prompt run:
1 2 3 |
mongoimport -d DBNAME -c COLLECTIONNAME --type csv --file FILENAME --headerline e.g. mongoimport -d import_test -c CRM --type csv --file users.csv --headerline |
Note, read the mongo docs for more info on mongoimport.
All well and good, but the data looks just like a sql table!
True, so we now need to process it into a format that we want for xDB.
Sitecore defines a schema for the xDB data based around Contacts, and Contact facets. Examples of this could be: for a given user, you have a facet that represents all the user’s orders. I won’t go into too much details on this – see here for some background.
The format you’d then expect to see within xDB and Mongo would be:
- Contacts
- Contact: _id
- Customer (this name is up to you)
- Orders (this name is up to you)
- Order Id: X, Order Details: Y
- Order Id: Z, Order Details: Q
- etc
- Orders (this name is up to you)
- Customer (this name is up to you)
- Contact: _id
Ok, we need to map from rows to structured data right?
This is a pretty common problem to solve when working with databases and the solution here is the result of several attempts, each with mixed results! 🙂
The code I arrived at was pretty specific to the exact schema we have, do ask if you want a copy. I used typescript to then generate the javascript files used by the Mongo shell as it gave me type based development in a few areas.
The flow of the operations was:
- Bulk import the csv file into a temporary Mongo db (CRM) and then access via Mongo scripts:
1var sourceDb = db.getSiblingDB("CRM Temporary DB");
- Batch the import of the data via queries based on information we know of the data – e.g. process each user whose email starts with a given key:
1var externalUsers = sourceDb.CRM.find({ "EmailAddress": new RegExp("^[" + key + "].+$") });
- Iterate through each entry in externalUsers but don’t write them into your analytics db one by one, instead store in an array and then insert batches every N users (this will be much faster!) and then reset the array. The reason for storing the createdUser dictionary is to handle users with more than 1 order:
-
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859var index = 0;var runningTotal = 0;var startDate = new Date();var contactArray = new Array();var identifierArray = new Array();var createdUsers = new Array();var orderArray = new Array();var Guid = (function () {function Guid() {}Guid.NewGuid = function () {return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function (c) {var r = Math.random() * 16 | 0, v = c == 'x' ? r : (r & 0x3 | 0x8);return v.toString(16);});};return Guid;})();function insertToDestinationDb() {runningTotal += contactArray.length;this.logger.log("Contacts found: " + contactArray.length + " - running total: " + runningTotal);destinationDb.Contacts.insert(contactArray);destinationDb.Identifiers.insert(identifierArray);destinationDb.Orders.insert(orderArray);destinationDb.EmailsTemp.insert(contactArray);contactArray = new Array();identifierArray = new Array();orderArray = new Array();}externalUsers.forEach(function(doc) {var guid = Guid.NewGuid();var existingUser = createdUsers[doc.EmailAddress];if (existingUser == null) {contactArray.push(Schema.Contact(guid, doc.EmailAddress, doc));identifierArray.push(Schema.Identifier(guid, hashedEmail));createdUsers[doc.EmailAddress] = { id: guid };}orderArray.push({ "email": doc.EmailAddress, "order": Schema.Order(doc) }index++;this.logger.progress(startDate, index, "");if (index % dumpInterval == 0) {insertToDestinationDb();}});if (contactArray.length > 0) {insertToDestinationDb();}
Note, the methods for the schema generation are:
1234567891011121314151617181920212223242526272829303132333435var Schema = (function () {function Schema() {}Schema.Contact = function (guid, email, document) {return {_id: CSUUID(guid),"System": {"VisitCount": 0,"Value": 0},"Identifiers": {"Identifier": email},"Lease": {"ExpirationTime": new Date(),"Owner": {"Type": 0}},"Customer": {"Hash": email,"Email": document.EmailAddress,"MemberID": document.MemberID}};};Schema.Identifier = function (guid, email) {return {_id: email,"contact": CSUUID(guid),"updated": new Date()};};return Schema;})();
-
- This should create you 4 collections: Contacts, Identifiers, Orders and EmailsTemp. The last 2 are temporary collections used later. CSUUID helper functions can be found: here.
- Now we should have contacts and identifiers filled out, but not associated orders.
- Finally we need to glue them together – this is where the temporary collections come into play:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748var OrderMapping = (function () {function OrderMapping(logger, dumpInterval) {this.logger = logger;this.dumpInterval = dumpInterval;}OrderMapping.prototype.MapOrders = function (destinationDb) {var contacts = destinationDb.EmailsTemp.find();this.logger.log("OrderMapping - creating indexes");destinationDb.Orders.createIndex({ "email": 1 });destinationDb.Contacts.createIndex({ "Customer.Email": 1 });this.logger.log("OrderMapping - creating indexes complete");var index = 0;var startDate = new Date();contacts.forEach(function (contact) {if (contact.Customer != null) {var customerEmail = contact.Customer.Email;var matching = destinationDb.Orders.find({ "email": customerEmail });if (matching != null) {var customerOrders = {};var orderIndex = 0;var overCap = false;var cap = 50000;matching.forEach(function (order) {if (orderIndex < cap) {customerOrders[orderIndex] = order.Order;}if (orderIndex > cap) {overCap = true;}orderIndex++;});if (overCap) {this.logger.log("EEEK - '" + customerEmail + "' contact has > " + cap + " Orders. ");}index++;destinationDb.Contacts.update({ "Customer.Email": customerEmail }, { $set: { "Customer.OrderHistory": customerOrders } });this.logger.progress(startDate, index, customerEmail);}}});this.logger.log("Processed: " + index + " items");destinationDb.Orders.remove({});destinationDb.EmailsTemp.remove({});this.logger.log("Purged Orders and emailTemp collections");};return OrderMapping;})();
What issues did I run into?
- No more power! (well, memory) – if you try to store a huge array in your scripts you will quickly run out of juice and the import will grind to halt
- You miss-map the Mongo properties so that either: the data stored is null or doesn’t match the Sitecore facet properties – here you simply get a nasty ‘can’t convert type’ mongo driver exception
- What’s going on with the import? You can dump the output of any mongo script to a file via: mongo runner.js > output.txt
- Querying large sets of data can be slow – make sure you setup indexes on the collections if you need to do a lot of cross referencing e.g.: db.CRM.createIndex({“EmailAddress”:1})
- Sitecore expects arrays of data in a slightly unusual format – beware: it’s not a standard Mongo array
- Mongo has a 16MB size limit per document – the cap in the OrderMapping prevents the import creating giant order histories. You will want to review this in your implementation!
There is a lot of information here (sorry about that!) – mainly due to the fact it took several iterations of code to arrive at a solution that even completed, let alone in a timely manner!