Data Loading Manifesto
by Simon Fell on February 1, 2006 at 04:58 PM
When loading a massive amount of
records in salesforce.com (in the multiple millions), there is no silver bullet
in terms of speed. There are a couple of factors which can improve client
performance, and a couple of factors that inherently exist and are out of the
control of the client.
The first factor is the speed and
efficiency in which the client prepares the data. The client may be reading
from a csv file, from a database, or some other data source. Client data
actions are generally very fast; when loading thousands of records they seem to
be almost instantaneous. But with such large operations every bit of efficiency
has a profound effect. If a client is loading 20 million records, then shaving
a hundredth of a second off of a per row operation would save 555 hours – or 3
weeks. In actuality, most client actions are measured in milliseconds already,
if that. But if your client does take a half second per record, it may be a
place to look for optimizations.
The second factor with a massive
import records is the fact that we are using a web services api – and must send
the data over the internet. Obviously not having direct access to the data
slows down processing. But there are a number of factors that can be minimized
when sending data over the internet. One of the more expensive operations is
opening a socket. Most of the time involved with opening a socket stems from
the SSL negotiation, but without SSL, the api request would not be secure.
Included in the http/1.1 protocol is support for reusing sockets among requests,
instead of having to re-open a socket per request as in http/1.0. This is also
known as persistent connections. Whether or not your client supports persistent
connections depends on the soap stack you are using. By default .Net uses
persistent connections, and while Apache Axis by default does not, if you change
the configuration to use the apache http-commons libraries, it will use http/1.1
and persistent connections.
Since there is some processing and
protocol associated with every request, we want to minimize the total number of
requests. The solution to this is very simple, batch as many records per
request as possible. Salesforce.com allows up to 200 records per insert/update
request, and in general you should send up a batch of 200. While simple, a
surprisingly large number of integrations fail to batch their requests and as a
result are severely hampered by sending up only 1 record at a time instead of a
large batch.
And while we want to send up the
most records per request possible, we also want to send up as small as a request
as possible. As it is easier to download a 2 MB file than a 100 MB file, the
fewer bits we send up, the faster they will arrive. To minimize the request
size, we use compression on both the request and the response. Gzip is the most
popular type of compression, and there are multiple posts on the community
boards at sforce.com on how to implement compression with different soap stacks,
The full Gzip analysis and discussion is available at Simon Fell’s blog post. Having a fast internet
connection is a given. If you are trying to load 20 million records over a
dial-up modem, be prepared to wait, as salesforce.com can only load data as fast
as you can get the data to it.
The third factor is the
salesforce.com processing of the data. The majority of the processing is done
in the database, and is out of the control of the client. This time spent in
the database is a deciding factor why clients do not see an increase in speed
with multiple threads. When trying to perform the same action concurrently,
contention for certain resources can occur. This can possibly make running
multiple threads slower than running an individual thread. Remember, than when
performing an insert or update, it is not just one simple operation that
salesforce.com is performing. Multiple security checks must be made to ensure
proper access and sharing. History and corresponding joined tables must be
written to. Especially in the case of updating the MRU (more discussed below),
having multiple threads can degrade performance.
However, there are certain
parameters that the client can set that will prevent the database from having to
do extra work. The biggest thing a client can do is do disable the “Most
Recently Used” (MRU) functionality of salesforce.com. After you insert or
update a record, the left side of the web user interface will update with a list
of most recently used entities. When doing a load of 20 million records, this
constant updating is absolutely unnecessary, since only the last few row will
ever appear in the MRU. The MRU is disabled by default in api version 7.0 but
is enabled by default for all lower versions. If you are using a lower version
of the api, such as 6.0, you can pass up a soap header to disable the MRU. (Note that this is different to the mruHeader defined for 7.0)
<sf:UpdateMostRecentlyUsedList xmlns:sf='urn:partner.soap.sforce.com'>false</sf:UpdateMostRecentlyUsedList>
Moreover, when doing a mass load,
you want to minimize the number of pre-operation checks that salesforce.com must
perform, as well as post-operation functions that may follow. For an example of
a pre-operation check , if you are updating 20 million accounts that you have
access to through a sharing rule, the operation will take much longer than if
you have the “modify all data” permission. The reason is that if you have
“modify all data” salesforce.com knows that you’ll be able to access the record,
but if you have access only through a sharing rule, an extra query must be
made. While inexpensive when doing a load of thousands, when performing a load
of millions, every bit helps. Remember our client efficiency example from
before. In general, the fewer number of sharing rules (or setting a public
read/write for the duration of the load) helps the load go quicker, as there are
fewer operations that have to be performed when setting things such as an
owner. Workflow or assignment rules are examples of settings that can slow down
an operation afterwards. Anything that causes a post-operation action falls
into this category, but one thing that is often overlooked are cascading
updates. If you update an owner of an account, the contacts must be updated,
and the access to opportunities may be affected as well. Instead of just
updating a single account, you are actually touching multiple entities.
As a finishing note, the AppExchange Data
Loader is a good reference for data loading. It disables the MRU, uses http/1.1
persistent connections, and gzip compression on the request and response. If
you are performing a data load, or are looking at a place to start when writing
your own java integration, the AppExchange Data Loader can serve as a fast and
reliable solution.
TrackBack
TrackBack URL for this entry: http://www.typepad.com/t/trackback/324572/4173117
Listed below are links to weblogs that reference Data Loading Manifesto:
Tracked on February 1, 2006 07:23 PM:
» Everything you wanted to know about loading lots of data into Salesforce but were afaid to ask from Simon Fell > Its just code
Lexi, infamous author of the AppExchange Data Loader (nee Sforce Data Loader, nee LexiLoader), wrote up the details of how to get the best performance when trying to load lots and lots of data into Salesforce via the API. This is when you really need t... [Read More]

Comments
Posted by Enough BS on February 2, 2006 07:36 AM:
Thanks for the post. A minor correction in reference to "If a client is loading 20 million records, then shaving a hundredth of a second off of a per row operation would save 555 hours – or 3 weeks." Shaving 1/100th of a second off 20,000,000 records = 200,000 seconds saved. 200,000/3600 = 55.5 hours.
Posted by snazzy on March 23, 2006 10:48 AM:
So banificial artical have uploaded