Data Loading Manifesto

by Simon Fell on February 1, 2006 at 04:58 PM

When loading a massive amount of records in salesforce.com (in the multiple millions), there is no silver bullet in terms of speed.  There are a couple of factors which can improve client performance, and a couple of factors that inherently exist and are out of the control of the client.

 

The first factor is the speed and efficiency in which the client prepares the data.  The client may be reading from a csv file, from a database, or some other data source.  Client data actions are generally very fast; when loading thousands of records they seem to be almost instantaneous.  But with such large operations every bit of efficiency has a profound effect.  If a client is loading 20 million records, then shaving a hundredth of a second off of a per row operation would save 555 hours – or 3 weeks.   In actuality, most client actions are measured in milliseconds already, if that.  But if your client does take a half second per record, it may be a place to look for optimizations.

 

The second factor with a massive import records is the fact that we are using a web services api – and must send the data over the internet.  Obviously not having direct access to the data slows down processing.  But there are a number of factors that can be minimized when sending data over the internet.  One of the more expensive operations is opening a socket.  Most of the time involved with opening a socket stems from the SSL negotiation, but without SSL, the api request would not be secure.  Included in the http/1.1 protocol is support for reusing sockets among requests, instead of having to re-open a socket per request as in http/1.0.  This is also known as persistent connections. Whether or not your client supports persistent connections depends on the soap stack you are using.  By default .Net uses persistent connections, and while Apache Axis by default does not, if you change the configuration to use the apache http-commons libraries, it will use http/1.1 and persistent connections. 

 

Since there is some processing and protocol associated with every request, we want to minimize the total number of requests.  The solution to this is very simple, batch as many records per request as possible.  Salesforce.com allows up to 200 records per insert/update request, and in general you should send up a batch of 200.  While simple, a surprisingly large number of integrations fail to batch their requests and as a result are severely hampered by sending up only 1 record at a time instead of a large batch. 

 

And while we want to send up the most records per request possible, we also want to send up as small as a request as possible.  As it is easier to download a 2 MB file than a 100 MB file, the fewer bits we send up, the faster they will arrive.  To minimize the request size, we use compression on both the request and the response.  Gzip is the most popular type of compression, and there are multiple posts on the community boards at sforce.com on how to implement compression with different soap stacks, The full Gzip analysis and discussion is available at Simon Fell’s blog post. Having a fast internet connection is a given.  If you are trying to load 20 million records over a dial-up modem, be prepared to wait, as salesforce.com can only load data as fast as you can get the data to it.

 

The third factor is the salesforce.com processing of the data.  The majority of the processing is done in the database, and is out of the control of the client.  This time spent in the database is a deciding factor why clients do not see an increase in speed with multiple threads.  When trying to perform the same action concurrently, contention for certain resources can occur.  This can possibly make running multiple threads slower than running an individual thread.  Remember, than when performing an insert or update, it is not just one simple operation that salesforce.com is performing.  Multiple security checks must be made to ensure proper access and sharing.   History and corresponding joined tables must be written to.  Especially in the case of updating the MRU (more discussed below), having multiple threads can degrade performance.

 

However, there are certain parameters that the client can set that will prevent the database from having to do extra work.  The biggest thing a client can do is do disable the “Most Recently Used” (MRU) functionality of salesforce.com.  After you insert or update a record, the left side of the web user interface will update with a list of most recently used entities.  When doing a load of 20 million records, this constant updating is absolutely unnecessary, since only the last few row will ever appear in the MRU.  The MRU is disabled by default in api version 7.0 but is enabled by default for all lower versions.  If you are using a lower version of the api, such as 6.0, you can pass up a soap header to disable the MRU. (Note that this is different to the mruHeader defined for 7.0)

<sf:UpdateMostRecentlyUsedList xmlns:sf='urn:partner.soap.sforce.com'>false</sf:UpdateMostRecentlyUsedList>

 

Moreover, when doing a mass load, you want to minimize the number of pre-operation checks that salesforce.com must perform, as well as post-operation functions that may follow.  For an example of a pre-operation check , if you are updating 20 million accounts that you have access to through a sharing rule, the operation will take much longer than if you have the “modify all data” permission.  The reason is that if you have “modify all data” salesforce.com knows that you’ll be able to access the record, but if you have access only through a sharing rule, an extra query must be made.  While inexpensive when doing a load of thousands, when performing a load of millions, every bit helps.  Remember our client efficiency example from before.  In general, the fewer number of sharing rules (or setting a public read/write for the duration of the load) helps the load go quicker, as there are fewer operations that have to be performed when setting things such as an owner.  Workflow or assignment rules are examples of settings that can slow down an operation afterwards.  Anything that causes a post-operation action falls into this category, but one thing that is often overlooked are cascading updates.  If you update an owner of an account, the contacts must be updated, and the access to opportunities may be affected as well.  Instead of just updating a single account, you are actually touching multiple entities. 

 

As a finishing note, the AppExchange Data Loader is a good reference for data loading.  It disables the MRU, uses http/1.1 persistent connections, and gzip compression on the request and response.  If you are performing a data load, or are looking at a place to start when writing your own java integration, the AppExchange Data Loader can serve as a fast and reliable solution.

 

TrackBack

TrackBack URL for this entry: http://www.typepad.com/t/trackback/324572/4173117

Listed below are links to weblogs that reference Data Loading Manifesto:

Tracked on February 1, 2006 07:23 PM:

» Everything you wanted to know about loading lots of data into Salesforce but were afaid to ask from Simon Fell > Its just code
Lexi, infamous author of the AppExchange Data Loader (nee Sforce Data Loader, nee LexiLoader), wrote up the details of how to get the best performance when trying to load lots and lots of data into Salesforce via the API. This is when you really need t... [Read More]

Comments

Posted by Enough BS on February 2, 2006 07:36 AM:

Thanks for the post. A minor correction in reference to "If a client is loading 20 million records, then shaving a hundredth of a second off of a per row operation would save 555 hours – or 3 weeks." Shaving 1/100th of a second off 20,000,000 records = 200,000 seconds saved. 200,000/3600 = 55.5 hours.

Posted by snazzy on March 23, 2006 10:48 AM:

So banificial artical have uploaded

Post a comment

If you have a TypeKey or TypePad account, please Sign In