MongoDB collection format mass modifications

MongoDB is a NoSQL database and one of the main difference compared to relational DB is you can put in your collection documents with different structure. In fact you don’t really care about structure of documents and type of data you are storing.

This is particularly true when you start using your Mongo instance with langage like PHP or JS where the langage itself does not manage data type explicitly.

As a consequence your document format may not be optimized (like storing numbers in String and making search on them later) and once you decide to move your backend to a more structured language like Java you start to have to make this more robust and organized. At this point you may want to reformat your documents to match the right type, eventually update your previous documents to a new version.

This kind of operation can be done “easily” with Mongo scripting language. This post will details 2 of the current structure to manipulate and transform Documents.

The objective is to select a list of object matching a criteria like a field is having a certain type you want to change ; then for each of them transform the format. When looking for identifying a document by the type of a field you can do it with a find clause :

{ myField : { $type : typeId }}

The part in red have to be adapted. The typeId list is available here.

for ( i = 0 ; i < 3600 ; i++ ) {
   db.collectionName.find( { 'fieldToSearch' : { $type : typeId } } )
                    .limit(1000)
                    .forEach( function (x) {
     
          x.field1 = new NumberInt(x.field1);
          x.field2 = parseFloat(x.field2);
          var makeBoolean = (x.field3 === "true");
          x.field3 = makeBoolean;
          db.collectionName.save(x);

   });
}

In this exemple we process every time 1000 documents but they are commit one by one. A second larger loop is processing this a sufficient number of time to cover all the existing documents.

This approach is easy to manipulate but is not performant and you can wait a really long time for processing collections with 1M or more lines.

The best way is to transform this into bulk operations, this is exactly the same result with a larger performance:

var bulk = db.collectionName.initializeUnorderedBulkOp();
var count = 0;
db.sample.find({ 
  fieldToSearch : { $type : typeId } 
}).forEach(function(x) {
     bulk.find({ "_id": x._id })
         .updateOne({ 
            "$set": { "_fieldToSearch": x.fieldToSearch.valueOf() } ,
            "$unset" : { "fieldToSearch" : 1 }
     });
     bulk.find({ "_id": x._id })
         .updateOne({ "$rename": { "_fieldToSearch": "fieldToSearch" } });
 
 count++;
 if ( count % 5000 == 0 ) {
 bulk.execute()
 bulk = db.collectionName.initializeUnorderedBulkOp();
 }
})
if ( count % 5000 != 0 ) bulk.execute();

Basically we can have multiple operations for each of the document matching the search, here we are creating a new field with the expected type to replace the initial one.

The operations are pending until the bulk execution and executed all at once every 5000 documents.  The last line executes the pending operation at end.

Identify duplicates

Sometime you have duplicates documents. Here is a way to identify them:

db.collectionName.aggregate(
 {"$group" : { "_id": "$fieldWhereDuplicates", "count": { "$sum": 1 } } },
 {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, 
 {"$project": {"fieldWhereDuplicates" : "$_id", "_id" : 0} }
)

 

This entry was posted in Middleware, Programming and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *