How to bulk import contacts from SQL Server into Sitecore Mongo xDB

Imagine the scenario, you have millions of customer records in an existing SQL server instance and want to tie things together with your shiny new Mongo xDB. Where to start?!?!

Hopefully the following information will help guide you towards the different areas that will need researching and developing. It’s worth noting, if you follow these steps I’d recommend the Mongo University free courses (https://university.mongodb.com/) to get acquainted with how Mongo and it’s queries work.

For this demo assume the following example infrastructure:

  • Existing on premise SQL server instance containing: user data & order records. The Sitecore deployment has no r/w access to this instance.
    • Data is structured so that a user can have: 0, 1 or more orders
  • Sitecore, SQL server and Mongo all deployed to a cloud hosted provider. Assume Mongo is used for both xDB and session

So, how do we get the data from a relational database into xDB?

There are several options here, if the Sitecore instance and the on-premise database can talk you may choose slightly different approaches – for now lets assume not.

You can get data out of sql in many ways, one simple option is to right-click the database in question and follow the ‘export data’ wizard. Here you can specify things like source databases, destination dbs or files, queries to run etc. I’ve chosen to use CSV flat files as the interim data storage. Tip, remember to check ‘Column names in the first data row’ – it will make life easier when you come to import into mongo.

One key difference between SQL and Mongo is the way you can represent linked data. The CSV will contain something like:

UserID OrderID Order total
123 456 50
123 789 150
280 535 20

Note: user 123 has made 2 orders.

Compare this to Mongo (*note, this isn’t the only option for storing data in Mongo. In this scenario the model fits well with the Sitecore approach to contact facets.):

  • User: 123
    • orders
      • ID: 456, Cost: 50
      • ID: 789, Cost: 150
  • User: 280
    • orders
      • ID: 535, Cost: 20

Lots of data? Don’t panic..!!
When dealing with big sets of data in Mongo, bulk operations are your friend – they will make things much quicker! Based on that I decided to blast the whole CSV into a temporary Mongo collection. Via a cmd prompt run:

Note, read the mongo docs for more info on mongoimport.

All well and good, but the data looks just like a sql table!
True, so we now need to process it into a format that we want for xDB.

Sitecore defines a schema for the xDB data based around Contacts, and Contact facets. Examples of this could be: for a given user, you have a facet that represents all the user’s orders. I won’t go into too much details on this – see here for some background.

The format you’d then expect to see within xDB and Mongo would be:

  • Contacts
    • Contact: _id
      • Customer (this name is up to you)
        • Orders (this name is up to you)
          • Order Id: X, Order Details: Y
          • Order Id: Z, Order Details: Q
          • etc

Ok, we need to map from rows to structured data right?
This is a pretty common problem to solve when working with databases and the solution here is the result of several attempts, each with mixed results! 🙂

The code I arrived at was pretty specific to the exact schema we have, do ask if you want a copy. I used typescript to then generate the javascript files used by the Mongo shell as it gave me type based development in a few areas.

The flow of the operations was:

  1. Bulk import the csv file into a temporary Mongo db (CRM) and then access via Mongo scripts:
  2. Batch the import of the data via queries based on information we know of the data – e.g. process each user whose email starts with a given key:
  3. Iterate through each entry in externalUsers but don’t write them into your analytics db one by one, instead store in an array and then insert batches every N users (this will be much faster!) and then reset the array. The reason for storing the createdUser dictionary is to handle users with more than 1 order:

    1. Note, the methods for the schema generation are:
  4. This should create you 4 collections: Contacts, Identifiers, Orders and EmailsTemp. The last 2 are temporary collections used later. CSUUID helper functions can be found: here.
  5. Now we should have contacts and identifiers filled out, but not associated orders.
  6. Finally we need to glue them together – this is where the temporary collections come into play:

What issues did I run into?

  • No more power! (well, memory) – if you try to store a huge array in your scripts you will quickly run out of juice and the import will grind to halt
  • You miss-map the Mongo properties so that either: the data stored is null or doesn’t match the Sitecore facet properties – here you simply get a nasty ‘can’t convert type’ mongo driver exception
  • What’s going on with the import? You can dump the output of any mongo script to a file via:  mongo runner.js > output.txt
  • Querying large sets of data can be slow – make sure you setup indexes on the collections if you need to do a lot of cross referencing e.g.: db.CRM.createIndex({“EmailAddress”:1})
  • Sitecore expects arrays of data in a slightly unusual format – beware: it’s not a standard Mongo array
  • Mongo has a 16MB size limit per document – the cap in the OrderMapping prevents the import creating giant order histories. You will want to review this in your implementation!

There is a lot of information here (sorry about that!) – mainly due to the fact it took several iterations of code to arrive at a  solution that even completed, let alone in a timely manner!