Open Enterprise Master Patient Index


OpenEMPI 3.3.0 Release now available!

We are pleased to release the 3.3.0 version of the entity edition of OpenEMPI. This release of the entity edition includes many bug fixes, a few new features and significant improvements to the performance of the system. You can read more about this release in the release notes page. In addition to the community release of version 3.3.0 of OpenEMPI there is also a new release of the commercial addition of OpenEMPI that provides additional features, scalability features and support.

Another option available for evaluating OpenEMPI if you are currently using the Amazon cloud is to create an instance using an image we have created. The image is identified using the AMI: ami-23a54b35 or by the name openempi-3.3.0c-ubuntu-16.04-2. This image has OpenEMPI 3.3.0 installed and configured and we have loaded and matched some data so that the instance is ready to work with. If there are any other images you would like to see available to ease your evaluation of OpenEMPI, please let us know.

For information on commercial licenses and support for the new edition of OpenEMPI, please contact

An Introduction to Blocking Algorithms in OpenEMPI

One question we get quite often about setting up an instance of OpenEMPI has to do with the configuration of the blocking algorithm. Everyone is familiar to some extent with matching since that is what an EMPI is supposed to do but what exactly is blocking? In this blog entry we first explain what blocking is about conceptually and in a subsequent entry we will get into the specifics of how to configure the blocking algorithm of your OpenEMPI instance.

When a new record is added to OpenEMPI, the system needs to determine if there are any existing records that match this new record. If any such matching records are found, a link is created between each of the existing matching records and the new record since all those records are considered to be duplicate records that refer to the same physical patient. The brute force approach for identifying such duplicate records would be to compare the new record against every other record in the system and have the matching algorithm evaluate every such pair. This approach would certainly work but it does not scale. If the performance implications of the brute force approach are not easy to see when considering what happens when one new record is added, they become especially clear when you consider the effect of the brute force strategy when are you first setting up your instance of OpenEMPI. Let’s say you have aggregated the patient records from all the facilities that you are integrating together and you have come up with a grand total of 100,000 records (this is a fairly small federation of healthcare facilities). You now need to run the matching algorithm against all records in the system to identify all record pairs and link them together. If you use the brute force approach to generate all record pairs that need to be compared with one another, you will end up having to evaluate 10 billion record pairs. Clearly there has to be a better way of approaching this problem.

This is where blocking comes in. The purpose of blocking is to identify record pairs that are likely to result in a match and eliminate record pairs that will certainly not do so. For example, there is no reason to compare the patient record of Odysseas Pentakalos from Virginia with the record for Jane Smith from Michigan because it is close to impossible that these two records refer to the same patient. The blocking algorithm works by identifying candidate records from a source record that may potentially result in a match, and the system generates all record pairs between the source record and each of the candidate records and passes those pairs down the pipeline for evaluation.

The blocking algorithm currently used by OpenEMPI operates by partitioning the complete set of records in the system into blocks (which explains where the name “blocking” algorithm came from) and then only comparing record pairs formed by pairing together the records in each block. This results in a considerable reduction in the number of record pairs evaluated. The next question is how is the partitioning of the records done? When configuring the blocking algorithm you must select one or more record attributes forming what is referred to as the blocking key. The system then partitions the complete set of records into blocks that have the same value for the blocking key. For example, let’s say that we choose the zip or postal code as the single attribute comprising our blocking key. The system will then form blocks for each distinct zip code in our records and all the records in each block will have the same value in the zip code field. To quantify the improvement in reducing the number of record pairs evaluated in this example, let’s assume our repository had a total of 24 records in it, with 6 distinct values for the zip code in those records. To simplify the calculation, we assume that the blocking key of zip code evenly partitions the 24 records into 6 blocks of 4 records in each block. Without using the blocking algorithm, the brute force approach would require that we evaluate n(n-1)/2 or 276 pairs. By using the blocking algorithm, we have 6 blocks generating 6 record pairs each (4×3/2) for a total of only 36 record pairs. The blocking algorithm allowed us to reduce the total number of record pairs that we have to evaluate from 276 to 36 or a factor of almost 8.

The blocking algorithm is not only used for the initial evaluation of record pairs into the system but also whenever new records are added or updated. For example, when a new record is added to the system with zip code 10000, the blocking algorithm identifies all the records in the system with zip code 10000, forms record pairs between the new record and each of the existing records with zip code 10000 and passes those record pairs on to the matching algorithm for evaluation.

There are quite a few blocking algorithms that researchers have devised over the years. If you are interested in learning more about the various algorithms that are available, I highly recommend the excellent survey article by Peter Christen [1]. The ultimate goal for the blocking algorithm is to reduce the number of record pairs that need to be evaluated for a match, but the algorithm cannot get overly aggressive because then it will start eliminating record pairs that will result in a match, causing the system to generate many false negatives (this is a concise way to refer to record pairs that were classified as non-matches where in reality they are matches). Achieving the right balance between reducing the number of record pairs to be evaluated while not eliminating matching pairs from evaluation requires careful configuration of the algorithm. In a future entry we will discuss the proper configuration of the blocking algorithm used in OpenEMPI.

  1. Christen, Peter, “A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication”, IEEE Transactions on Knowledge and Data Engineering, 2011.

OpenEMPI 3.2.0 Release now available

We are pleased to release the 3.2.0 version of the entity edition of OpenEMPI. This release of the entity edition includes many bug fixes and many new features including a reporting capability, enhancements to the probabilistic algorithm to improve matching performance in the presence of null values, a new transformation function, the ability to chain transformation functions, and many other features. You can read more about this release in the release notes page. This release is only available for customers that have purchased a support license. The next release which we plan to release in October will be available to the open source community and will include features and fixes from this release.

For information on commercial licenses and support for the new edition of OpenEMPI, please contact

The 3.1.0 release of the Entity edition of OpenEMPI has now been released

We are pleased to release a new version of the entity edition of OpenEMPI. This release of the entity edition includes many bug fixes, stability enhancements, and performance improvements. Although a number of features have been added, the main focus of our work has been on increasing the stability and performance of the Entity edition of OpenEMPI.

The latest release is available from the new download site at the OpenEMPI Downloads page. Installation instructions and other documentation is available at the documentation page. If you have any questions please visit the forums and to contribute follow the Contribution instructions at the documentation page and send in your bug reports, documentation contributions or new enhancements.

Along with this new release of OpenEMPI we are also pleased to announce the release of OpenXDS 2.0. This is an update of the OpenXDS software with many fixes and improvements including a partial implementation of the XDS Metadata Update specification. This OpenXDS version has been customized to be integrated with OpenEMPI as the EMPI implementation so, no additional configuration or changes are needed to use the two services together. The OpenXDS 2.0 release is intended for use with the OpenEMPI Entity Edition 3.1.0 release. Documentation for this version of OpenXDS is available at this site.

To make it easier for you to evaluate OpenEMPI we have made both an Amazon EC2 Machine Image (AMI: ami-a74f3fcd) and a VirtualBox/VMware image (available from Amazon S3) with the latest version of the Entity edition of OpenEMPI available.

For information on commercial licenses and support for the new edition of OpenEMPI, please contact

Using the Amazon AMI to get started with OpenEMPI

To make it easier to try out OpenEMPI, we recently started making available virtual images that you can use to run the software without spending a lot of time with the installation process. We have options available to support various virtualization methods including images for VMWare, Virtual Box and the Amazon EC2 cloud. This blog post describes the last of the three approaches: creating a virtual machine using the OpenEMPI Amazon Cloud AMI.

AMI is an acronym that stands for Amazon Machine Image. You can think of it as a template that defines the configuration of the operating system and applications that comprise a given environment.  The template can then be used to automatically create virtual machine instances from. There are many public AMIs available in the Amazon EC2 cloud such as plain instances that use a specific version of the Windows or Linux operating system or more task specific instances that use a specific operating system along with a collection of applications such as a web server, programming platform and database software. AMIs are identified by

We have made available an Amazon image with OpenEMPI pre-installed along with a reasonable blocking and matching algorithm configuration, and some sample data as well. When starting an EC2 instance you need to provide the AMI template that will be used to initialize the virtual machine. You can look up the AMI using either its name or its AMI ID. The AMI name and AMI ID for the image we have made available are openempi-entity-3.0.0-ubuntu-14.04 and ami-f28dd39a. For this blog I assume that you have an Amazon Web Services (AWS) account and that you have some familiarity specifically with the EC2 service. If that is not the case for you, Amazon provides very good documentation for their web services and you can learn more about it here. Once you select that you want to create a new EC2 instance, the first step involves choosing the AMI that you want to use. You can search for the AMI using the name openempi (no need to type in the long name or cryptic AMI but they are available to you here in this paragraph) and it should come up right away.



The next step involves choosing an instance type. The instance type specifies the hardware configuration of the instance that you want to create and there are many instance types to choose from. If you want to just play around with OpenEMPI to see what it offers then a fairly minimal instance type should be sufficient. You can learn more about instance types, their relative performance characteristics and their cost here.



After you select an instance type, you will be taken to Step 6 to select a security group. You need to create a new security group that provide access through ssh (so that you can connect to the instance remotely using an SSH client) and you need to provide TCP access to port 8080 so that you can access the OpenEMPI administrative console at http://<EC2-instance-hostname>:8080/openempi-admin.

Before launching the instance you will be asked to create a key-pair. A key-pair is a secure authentication mechanism that will allow you to login into the instance via the SSH protocol without having to provide a password. If you don’t already have a key-pair that you can use, then you will need to create one. Once you launch the instance it should be ready to go within seconds. From the instance monitoring screen you can select the instance that you created, if you have more than this one running, and in the instance detail window you will be able to see the hostname assigned to the virtual machine. To connect to the instance use the ssh command on a Unix platform or something like putty if you are on a Windows platform. For the hostname of the instance you can use either the value shown next to the Public IP entry or the value shown next to the Public DNS entry.


If you need root access to the instance you need to use the username ubuntu to connect to the instance and use the private key that you created when starting the instance. The user that owns the OpenEMPI software on the image is openempi with a password of openempi so, you can just login using something like the following, once again using either the IP address or the hostname assigned to the instance after the @ symbol.


When you connect to the host using ssh, the message of the day on the instance will display some useful information about how to connect to the OpenEMPI installation on your box. We have already loaded some data on the OpenEMPI instance on that box. If you prefer to load the instance with your own data then the easiest thing to do is to delete the drop the graph database instance. To do that you first need to make sure that the Tomcat server instance is stopped and then remove the directory person-db under /home/openempi. When you start the Tomcat server again, the database will be re-created automatically but there will not be any records in the database any more.


If you run into any issues with the creation of the instance using this approach, let us know on the OpenEMPI user forum and we will try to assist you with the process.


New version of the EMPI edition of OpenEMPI has been released

We are pleased to release a new version of the EMPI edition of OpenEMPI. The 2.3.0 version includes many bug fixes, stability enhancements, and performance improvements. The Job Queueing mechanism has been integrated with most long running operations such as re-indexing the blocking configuration, loading a large number of records, and matching all records in the system. If you are currently using the 2.2.9 version of OpenEMPI or an earlier version of the EMPI edition, we encourage you to upgrade to the new release as we have fixed a number of important bugs. Upgrading to the latest version is just a matter of running any database update scripts to bring your database to the latest version and deploying the latest war file.

The latest release is available from the new download site at the OpenEMPI Downloads page. Installation instructions and other documentation is available at the documentation page. If you have any questions please visit the forums and to contribute follow the Contribution instructions at the documentation page and send in your bug reports, documentation contributions or new enhancements.

Images with OpenEMPI now available to help you get started quickly

To help new users get started with OpenEMPI quickly, we now have a couple of Virtual Machine images that you can download. The images include an installation of OpenEMPI Entity Edition version 3.0.0 along with data already pre-loaded.

If you are using VirtualBox or VMware Workstation you can download the following image from Amazon S3. The image is in standard Open Virtualization Format (ODF/ODA) so you should be able to start the VM using either VirtualBox or VMWare. The machine was built using Vagrant and the super-user is vagrant (use the insecure_private_key available through Vagrant) or using password vagrant. OpenEMPI is installed as user openempi (password: openempi) and once you login as the openempi user, you can start the Tomcat server using /opt/tomcat/bin/

We have also created an Amazon image with OpenEMPI pre-installed. The AMI name and AMI ID for this image are openempi-entity-3.0.0-ubuntu-14.04 and ami-f28dd39a respectively. The root Unix user for this VM is ubuntu but you should still use openempi (password: openempi) to look at the OpenEMPI installation and start the Tomcat server as in the other VM.

If you have any questions or issues, please post them at the forums. We will be posting a longer blog post soon to describe in more detail how to access the VMs.

Entity Edition of OpenEMPI has now been released

We are pleased to release the first version of the entity edition of OpenEMPI. This first production-ready release of the entity edition includes many bug fixes, stability enhancements, and performance improvements. The Job Queueing mechanism has been integrated with most long running operations such as re-indexing the blocking configuration, loading a large number of records, and matching all records in the system.

The latest release is available from the new download site at the OpenEMPI Downloads page. Installation instructions and other documentation is available at the documentation page. If you have any questions please visit the forums and to contribute follow the Contribution instructions at the documentation page and send in your bug reports, documentation contributions or new enhancements.

A more flexible way to load data into OpenEMPI

In the last blog post we looked at the easier way to load data into OpenEMPI that in exchange for simplicity trades-off flexibility. If you don’t want to take the trouble to transform your data into OpenEMPI’s fixed data format or if you are using the Entity edition of OpenEMPI where the record may not represent person demographic data but rather some other entity, such as a provider or a customer, then you need a more flexible approach to load data into your instance.

OpenEMPI provides such an approach called none other than the flexible data loader. We call it the flexible data loader because it lets the user specify how fields from the input file map into record fields. As long as your data is in the form of a delimited file with one record of data per line, you should be able to load the data into your instance without having to perform any additional transformations of the data.

The initial steps for loading a file using the flexible file loader are the same as those you use with the concurrent file loader. You first need to locate the data file on your local machine and then upload it onto the server. When you then press the Import button to load the data onto your instance, you need to select the flexible file loader from the list. The first two check-boxes have the same effect as in the concurrent file loader. The option to perform a bulk import is only available in the Entity edition of OpenEMPI and utilizes an optimization of the underlying repository to allow for the data to be imported very fast. The catch is that you should only use this option when the system is offline from incoming requests. If you perform a bulk import against an instance of OpenEMPI that is in production and servicing incoming requests from external systems, the response time in servicing these external requests will degrade considerably. The last check-box labeled “Only Preview Import”, as you can probably guess from the name, only simulates the import operation but does not actually load any data into the system.


We left the field labeled “Mapping File Name” in the form for last because it will take some time to go through. You use this field to specify the name of the mapping file that tells the loader how to map fields from the data file into fields in the data model of either the person entity in the EMPI edition of OpenEMPI or a specific entity in the Entity edition. The mapping file is an XML file and is expected to reside on the server and specifically in the configuration directory of the instance’s home directory ($OPENEMPI_HOME/conf).

The intricate details of how to create a mapping file and what options are available, are documented in the Administration Console section of the documentation of OpenEMPI. Make sure you look at the version of the documentation that matches the edition of OpenEMPI that you are using (EMPI or Entity). Let’s go through an example here though so that you can get an idea of what the process of creating a mapping file is like.

Here are a file rows from the test file that we will try to import into our system.

rec-36422-org, Jessica, Whillas, 1, Crampton Place, , Parc Falu, 00662, TX, 20020623, 9, 3230872715, 664363886, 5
rec-56480-org, Elly, Vincent, 7, Wilari Place, , Ext Santa Maria, 33314, NY, 19870224, 24, 3225149120, 525133755, 4
rec-25939-org, Samuel, Jeffries, 13, Barrett Street, Aleon, Westland, 47630, TX, 19741018, 37, 6652012291, 189718658, 4

The file we want to import is a comma-separated text file so we start creating the mapping file by defining the header portion of it. Aside from the necessary XML schema attributes, we have specified that the delimiter of the file is a comma and there is not header line in the file that needs to be skipped.

	xsi:schemaLocation=" fileloadermap.xsd"

Next we need to define how each field from the data file maps into a field of the data model. The first field is clearly an identifier but it looks like the identifier domain (in some circles it is referred to as the assigning authority) is unspecified and the association between the identifier and its identifier domain is implicit. While mapping the identifier file we can specify in the mapping file the identifier domain associated with this identifier.


The field mapping entry above indicates that the first field (column-index=1) to be imported is an identifier (is-identifier attribute set to true), it should be mapped into the identifier attribute of the data model (field-name is set to identifier) and since it is an identifier, it should be mapped to the identifier domain with name “NID” and namespace-identifier “NID”.

The next two fields in the file are the first name and last name fields so we map them using the following field mapping entries.


The next field is the street number so in this post we will simply skip over it. After that come the address1 and address2 fields which are then followed by the city, postal code, and state fields. These six field mapping entries are self explanatory and are shown below.


The next field is the date of birth which needs to be mapped into the field dateOfBirth which is of date data type. A date can be represented in many different ways and we can give information to the flexible file loader on how to parse the specific date format used in our data file using the date-format-string attribute. The rules for composing the date-format-string are defined in the SimpleDateFormat class of java. In this case the date of birth is a sequence of eight digits consisting of four digits for the year of birth, followed by two digits for the month, and two digits for the date thereby the format string looks is “yyyyMMdd”.


The next field is the age which we are not going to import since it is already covered through the date of birth. After that we have the phone number which we will import as text and then the social security number, which will we import as an identifier with identifier domain name of SSN. Note that the identifier domain name you select should be already defined in the system before you attempt to upload the file. The last field is a blocking number that is not important so we chose to ignore it.


And that pretty much covers the flexible file loader. You should now be able to press the import button and load your data into the system.

Loading data into your OpenEMPI instance

Once you get your OpenEMPI instance up and running the first thing you will want to do is load some data into it so that you can then take OpenEMPI for a test run. In this blog post I will describe the easiest way to get some data into OpenEMPI.

OpenEMPI’s user interface offers a couple of options for loading data into the system. The first option expects the patient demographic data to be in a fixed data format while the second one allows the user to specify a mapping document that defines how fields from the file map into fields in the OpenEMPI data model. This post will describe the first and easier approach and a future article will talk about the more flexible data loading approach.

Since the easier approach for importing data into the system expects the data to be in a fixed format, let’s first talk about the format of the input file. The expected input format of the file is essentially a typical comma-delimited text file with one record of data per line. The table below describes the sequence of fields that make up a record and provide some details on how to format individual fields when they are expected to be specific format. For example, the phone number field should be formatted as a sequence of digits without any character separators like a ‘-‘ character. The file test-data-5k-openempi is formatted using the expected format and can be used to test the upload process.

Fixed data format table
Field Name Field Description
2. Given Name First name of the person
3. Surname Last name of the person
4. Street Number Street number and any other information that should preceed the address line
5. Address 1 First line of the address of the person. The Street number field is prepended to this field before the record is imported into the database.
6. City The city portion of the address of the person
7. Zip Code The zip code portion of the address of the person
8. State The state portion of the address of the person
9. Date of birth The date of birth of the person formatted as yyyymmdd (e.g. 19870824)
10. Age of the person The age of the person as a natural number
11. Phone number The phone number of the person formatted as a sequence of digits (e.g. 8155364948)
12. Social security number The social security number of the person formatted as a sequence of digits (e.g. 692544254)
13. Blocking number This value is ignored during normal loading
14. Gender The gender of the person such as (M/F/O) (optional field)
15. Race The race of the person (optional field)
16.  Account A medical account number associated with the person (optional field)
17. Identifiers A sequence of one or more identifiers. Each identifier may have an associated domain specified using the combination of namespace identifier/universal identifier domain/universal identifier domain type code. In the case of multiple identifiers each identifier is separated from the others using the  '^' character. The identifier domain component are separated using the '&' character (e.g. MRN-2148&2.16.840.1.113883.4.357&2.16.840.1.113883.4.357&hl7).

Once you have constructed your input data in the format expected by the loader you must first upload the file onto the server hosting OpenEMPI and then import the data into the system. To upload the file, login into the admin console at http://server/openempi-admin and select the File->File List option from the menu. You then need to point your browser to the file which should at this point be on the same host as the one your are running your browser on. Pressing the Browse button will allow you to locate the file on your local host and select it. In the Name edit box you can type in the logical name you want to assign to your file. This name can be anything at all but try to pick something that will remind you of which physical file the entry refers to and you also need to make sure that the name is unique. At this point you should be able to press the Upload button to upload the file onto the server.

File Upload

The next step is to load the data into the system. To do this select the file from the list that you want to load and click on the Import button. You should then be presented with a screen that gives you a number of choices as shown in the figure below.

Concurrent loader options

As I mentioned earlier, OpenEMPI provides a couple of file loaders out of the box and the one that is easier to use and expects the input in a fixed format is called the concurrentDataLoader. The other two check boxes are fairly self-explanatory. If you have a header line in your import file that you obviously don’t want to attempt to load onto the server, you should check the “Skip Header Line” option. The second check box relates to the two options that OpenEMPI offers regarding what should happen when a record is loaded onto the system. You may choose to only import the record in which case the data is loaded into the database and there is no further processing of the record. If you leave the “Is Import” check box unchecked, then after loading each record into the system, the system will invoke the currently configured matching algorithm to determine if the new record should be linked to any existing records in the system. As you have guessed the first option is the fastest one since invoking the matching algorithm is a fairly computationally intensive task. If your are setting up a new instance of OpenEMPI and you are trying to perform the initial load of data into the system, then the recommended approach is to simply do the “Import” and then once you have configured the matching algorithm, you can run then it once against all the data on the system. If on the other hand your instance of OpenEMPI is already in operation and you are uploading data from a new facility, then you should leave the “Is Import” option unchecked.

Once the system has finished loading the data, the screen will be updated to report on how many records were uploaded successfully and how many failed to load properly. If some of the records are not loaded successfully, you can get more information about what went wrong by looking at the log file for OpenEMPI which is typically called openempi.log and should reside in the OpenEMPI home directory.