Missing DNS entries break New-Mailbox

It’s interesting how fairly obvious settings can break things in very non-obvious ways.

We recently had a case where the customer was not able to create new mailboxes on Exchange 2007. This had worked fine prior to applying some updates. After the updates, the New-Mailbox cmdlet began failing with error indicating that an address generator DLL was missing.

That error was a little misleading. The application log showed a very different error:

1
2
3
4
ID: 2030
Level: Error
Source: MSExchangeSA
Message: Unable to find the e-mail address 'smtp:someone@contoso.com SMTP:someone@contoso.com ' in the directory. Error '80072020'.

That error code is ERROR_DS_OPERATIONS_ERROR - not very specific. After a lot of tracing, we eventually found that when we created the new mailbox, Exchange was generating the new email addresses and then searching to see if they exist. The expected result is that the search returns 0 results, so we know the new addresses are unique. But in this case, wldap32 was returning code 1, LDAP_OPERATIONS_ERROR.

We used psexec -i -s ldp.exe to launch ldp as localsystem on the Exchange server, and then connected to the DCs. Choosing to bind as Current logged-on user showed that we bound as the computer account, as expected, and the searches worked fine. Then, some additional tracing revealed that we were not connecting to the DCs by name - we were connecting to the domain name, contoso.com in this example.

When we used ldp to connect to the domain name, something interesting happened - we were no longer able to bind as the computer account. The bind would succeed, but would return NT AUTHORITY\Anonymous Logon. Attempting to search while in that state produced:

1
2
3
4
5
6
7
***Searching...
ldap_search_s(ld, "(null)", 2, "(proxyAddresses=smtp:user@contoso.com)", attrList, 0, &msg)
Error: Search: Operations Error. <1>
Server error: 000004DC: LdapErr: DSID-0C0906E8, comment: In order to perform this operation a successful bind must be completed on the connection., data 0, v1db1
Error 0x4DC The operation being requested was not performed because the user has not been authenticated.
Result <1>: 000004DC: LdapErr: DSID-0C0906E8, comment: In order to perform this operation a successful bind must be completed on the connection., data 0, v1db1
Getting 0 entries:

That was exactly what we were looking for! Operations error, code 1, which is LDAP_OPERATIONS_ERROR. At this point, we turned our attention to understanding why we could not authenticate to the domain name, when authenticating to the server name worked fine. After all, connecting to the domain name just connected us to one of the DCs that we had already tested directly - we could see that by observing the dnsHostName value. So why would the name we used to connect matter?

The Active Directory engineeers eventually discovered that the _sites container, _tcp container, and other related DNS entries were all missing. Dynamic DNS had been disabled in this environment. Once it was enabled, everything worked.

The moral of this story is to be careful when you disable a setting that is, at face value, a simple and obvious thing. The effects can ripple out in very unexpected ways.

Share Comments

From Jekyll to Hexo

Two years ago, I dove into the wonderful world of static blog generators when I left my TechNet blog behind and started using Jekyll to generate an Azure web site. With my newfound freedom from complex content management systems, I raved about Jekyll in a blog post. But once the honeymoon was over, some cracks started to appear in the relationship.

Jekyll does not officially support Windows, so you have to jump through some hoops to get it up and running. This didn’t seem so bad at first, but I’m one of those people who is constantly tinkering with my PC, buying new hardware, and upgrading things, so I end up doing a clean install of my OS several times a year.

Back in the day, a clean install of Windows might sound daunting, but these days, it only takes minutes. The Windows install itself is pretty fast, and I have several Boxstarter scripts that use Chocolatey to install all the software I use. This means getting back up and running is fairly painless - except for Jekyll.

It seemed like every time I got a clean install, that fresh new clean OS feeling was soon soured by errors from Jekyll. The hoops I had to jump through to get it up and running would change slightly each time due to changes in Ruby or problems with gems. For a while, I dealt with this issue by blogging from one of my Ubuntu VMs.

Finally, I started shopping around for something not based on Jekyll and preferably with no Ruby dependency at all. There are a lot of options, but for now, I’ve settled on Hexo.

Hexo is powered by Node.js, and since I’m a big fan of JavaScript and a big fan of npm, this seems like a natural fit. Maybe this will be enough motivation to continue the series I left off with, or at least to write a new technical post of some kind.

Share Comments

A History of Cached Restrictions in Exchange

In this series of posts, I’m going to discuss three basic approaches to searching the content of Exchange mailboxes, and the tradeoffs that come with them. This series is for developers who are writing applications that talk to Exchange, or scripters who are using EWS Managed API from Powershell. I’m not going to be talking about New-MailboxSearch or searching from within Outlook, because in that case, the client code that executes the search is already written. This series is for people writing their own Exchange clients.

There are three basic ways to search a mailbox in Exchange Server:

  1. Sort a table and seek to the items you’re interested in. This approach is called a sort-and-seek.
  2. Hand the server a set of criteria and tell it to only return items that match. This is the Restrict method in MAPI and FindItems in EWS.
  3. Create a search folder with a set of criteria, and retrieve the contents of that folder to see the matching items.

For most of Exchange Server’s history, approaches 2 and 3 were implemented basically the same way. Using either approach caused a table to be created in the database. These tables contained a small amount of information for each item that matched the search, and the tables would hang around in the database for some amount of time. These tables were called cached restrictions or cached views. I’m going to call them cached restrictions, because that was the popular terminology when I started supporting Exchange.

Recorded history basically starts with Exchange 5.5, so let’s start there. Exchange 5.5 saved every single restriction for a certain amount of time. This meant that the first time you performed an IMAPITable::Restrict() on a certain folder, you would observe a delay while Exchange built the table. The second time you performed a IMAPITable::Restrict() on the same folder with the same restriction criteria, it was fast, because the restriction had been cached - that is, we now had a table for that restriction in the database, ready to be reused.

Exchange 5.5 continued keeping the cached restriction up to date as the content of the mailbox changed, just in case the client asked for that same search again. Every time a new item came into the Inbox, Exchange would update every cached restriction which was scoped to that folder. Unfortunately, this created a problem. If you had a lot of users sharing a mailbox, or you had an application that performed searches for lots of different criteria, you ended up with lots of different cached restrictions - possibly hundreds. Updating hundreds of cached restrictions every time a new email arrived got expensive and caused significant performance issues. As Exchange matured, changes were introduced to deal with this issue.

In Exchange 2003, a limit was put in place so Exchange would only cache 11 restrictions for a given folder (adjustable with msExchMaxCachedViews or PR_MAX_CACHED_VIEWS). This prevented hundreds of cached restrictions from accumulating for a folder, and neatly avoided that perf hit. However, this meant that if you had a user or application creating a bunch of one-off restrictions, the cache would keep cycling and no search would ever get satisfied from a cached restriction unless you adjusted these values. If you set the limit too high, then you reintroduced the performance problems that the limit had fixed.

In Exchange 2010, cached restrictions were changed to use dynamic updates instead of updating every time the mailbox changed. This made it less expensive to cache lots of restrictions, since they didn’t all have to be kept up to date all the time. However, you could still run into situations where an application performed a bunch of one-off searches which were only used once but were then cached. When it came time to clean up those cached restrictions, the cleanup task could impact performance. We saw a few cases where Exchange 2010 mailboxes would be locked out for hours while the Information Store tried to clean up restrictions that were created across hundreds of folders.

In Exchange 2013 and 2016, the Information Store is selective about which restrictions it caches. As a developer of a client, you can’t really predict whether your restriction is going to get cached, because this is a moving target. As Exchange 2013 and 2016 continue to evolve, they may cache something tomorrow that they don’t cache today. If you’re going to use the same search repeatedly in modern versions of Exchange, the only way to be sure the restriction is cached is to create a search folder. This is the behavior change described in KB 3077710.

In all versions of Exchange, it was always important to think about how you were searching and try to use restrictions responsibly. Exchange 2013 and 2016 are unique in that they basically insist that you create a search folder if you want your restriction to be cached.

The next post in this series will explore some sample code that illustrates differences between Exchange 2013 and Exchange 2010 restriction behavior.

Share Comments

Use MAPIFolders for the TNEF issue

I’ve written a couple of previous posts on the corrupt TNEF issue that causes this error:

1
Microsoft.Exchange.Data.Storage.ConversionFailedException: The message content has become corrupted. ---> Microsoft.Exchange.Data.Storage.ConversionFailedException: Content conversion: Failed due to corrupt TNEF (violation status: 0x00008000)

For the history, see this post and this post.

Previously, the solution was the Delete-TNEFProps.ps1 script. Unfortunately, that script has some limitations. Most notably, it cannot not fix attachments. This is a big problem for some environments where we have a lot of items with these properties on them.

I attempted to find a way to make the script remove the problem properties from attachments, but I could not figure out how to do it. Either this is impossible with EWS, or I’m missing an obscure trick. I finally gave up and went a different route.

For some time, I’ve been (slowly) working on a new tool called MAPIFolders. It is intended as a successor to PFDAVAdmin and ExFolders, though it is still fairly limited compared to those tools. It is also a command-line tool, unlike the older tools. However, it does have some advantages, such as the fact that it uses MAPI. This means it is not tied to deprecated APIs and frameworks like PFDAVAdmin was, and it doesn’t rely on directly loading the Exchange DLLs like ExFolders does. It can be run from any client machine against virtually any version of Exchange, just like any other MAPI client.

Also, because it’s MAPI, I can make it do almost anything, such as blowing away the properties on nested attachments and saving those changes.

Thanks to a customer who opened a case on the TNEF problem, I was able to test MAPIFolders in a significantly large public folder environment with a lot of corrupted TNEF items. After a bit of debugging and fixing things, MAPIFolders is now a far better solution to the TNEF issue than the Delete-TNEFProps script. It can remove the properties from attachments and even nested attachments.

The logging is very noisy and needs work, but that will have to wait. Writing C++ is just too painful. If you are running into the corrupt TNEF issue, you can grab MAPIFolders from GitHub: https://github.com/bill-long/mapifolders/releases. The syntax for fixing the TNEF problem is described here: https://github.com/bill-long/mapifolders/wiki/Check-Fix-Items-Operations.

Share Comments

Database bloat in Exchange 2010

I keep deciding not to write this post, because Exchange 2010 is out of mainstream support. And yet, we are still getting these cases from time to time, so I suppose I will finally write it, and hopefully it helps someone.

In Exchange 2010, we had several bugs that led to the database leaking space. These bugs had to do with the cleanup of deleted items. When a client deletes items, those items go into a cleanup queue. A background process is supposed to come through and process the items out of the cleanup queue to free up that space. Unfortunately, that didn’t consistently work, and sometimes the space would be leaked.

There was an initial attempt to fix this, which did resolve many of the open cases I had at the time, but not all of them. Later, another fix went in, and this resolved the issue for all my remaining cases at the time. Both of those fixes were included in Exchange 2010 SP3 RU1.

After that, we occasionally still see a case where space is leaking even with these fixes in place. But every time we try to trace it so we can fix the problem, the act of turning on tracing fixes the behavior. I’ve been back and forth through that code, and there’s no apparent reason that the tracing should affect the way cleanup actually behaves. Nonetheless, in these rare cases where the fixes didn’t totally fix the problem, tracing fixes it every time. I wish I knew why.

The tracing workaround has its limitations, though. The cleanup queue is not persisted in the database, so tracing only works for an active leak where the database has not yet been dismounted. After the database is dismounted, any leaked space is effectively permanent at that point, and your best bet is to move the mailboxes off. When the entire mailbox is moved, that leaked space will be freed, since it was still associated with the mailbox.

So, how can you tell if you’re being affected by this problem? One option is to just turn on tracing:

  1. Launch ExTRA.
  2. Choose Trace Control. You’ll get a standard warning. Click OK.
  3. Choose a location for the ETL file and choose the option for circular logging. You can make the file as large or as small as you want. It doesn’t really matter, since our goal here isn’t to look at the trace output.
  4. Click the Set manual trace tags button.
  5. At the top, check all eight Trace Type boxes.
  6. Under Components to Trace, highlight the Store component (but don’t check it).
  7. In the Trace Tags for store on the right, check the box next to tagCleanupMsg. We only need this one tag.
  8. Click Start Tracing at the bottom.

Let the trace run for a day or two and observe the effect on database whitespace. If you see significant amounts of space being freed with tracing on, then you’re hitting this problem. Again, this only works if the database has not been dismounted since the space leaked.

Another option is to analyze the database space to see if you’re hitting this problem. Here’s how you do that.

  1. Dismount the database and run
    eseutil /ms /v "C:\databases\somedatabase.edb" > C:\spacereport.txt
  2. For the same database, launch Exchange Management Shell and run
    Get-MailboxStatistics -Database SomeDatabase | Export-Csv C:\mailboxstatistics.csv
  3. Use my Analyze-SpaceDump.ps1 script to parse the spacereport.txt:
    .\Analyze-SpaceDump.ps1 C:\spacereport.txt
  4. Look for the “Largest body tables” at the bottom of the report. These are the largest mailboxes in terms of the actual space they use in the database. These numbers are in megabytes, so if it reports that a body table owns 7000, that means that mailbox owns 7 GB of space in the database.
  5. Grab the ID from the body table. For example, if the table is Body-1-ABCD, then the ID is 1-ABCD. This will correspond to the MailboxTableIdentifier in the mailboxstatistics.csv.
  6. Find that mailbox in the statistics output and add up the TotalItemSize and TotalDeletedItemSize. By comparing that against how much space the body table is using in the database, you know how much space has leaked.

It’s often normal to have small differences, but when you see that a mailbox has leaked gigabytes, then you’re hitting this problem.

You can also compare the overall leaked size with some quick Powershell scripting. When I get these files from a customer, I run the following to add up the mailbox size from the mailbox statistics csv:

1
2
3
4
5
6

$stats = Import-Csv C:\mailboxstatistics.csv
$deletedItemBytes = ($stats | foreach { $bytesStart = $_.TotalDeletedItemSize.IndexOf("(") ; $bytes = $_.TotalDeletedItemSize.Substring($bytesStart + 1) ; $bytesEnd = $bytes.IndexOf(" ") ; $bytes = $bytes.Substring(0, $bytesEnd) ; $bytes } | Measure-Object –Sum).Sum
$totalItemBytes = ($stats | foreach { $bytesStart = $_.TotalItemSize.IndexOf("(") ; $bytes = $_.TotalItemSize.Substring($bytesStart + 1) ; $bytesEnd = $bytes.IndexOf(" ") ; $bytes = $bytes.Substring(0, $bytesEnd) ; $bytes } | Measure-Object –Sum).Sum
($deletedItemBytes + $totalItemBytes) / 1024 / 1024

This gives you the size in megabytes as reported by Get-MailboxStatistics. Then, you can go look at the Analyze-SpaceDump.ps1 output and compare this to the “Spaced owned by body tables”, which is also in megabytes. The difference between the two gives you an idea of how much total space has leaked across all mailboxes.

Ultimately, the resolution is usually to move the mailboxes. If the database has not been dismounted, you can turn on tagCleanupMsg tracing to recover the space.

The SP3 RU1 fixes made this problem extremely rare in Exchange 2010, and the store redesign in Exchange 2013 seems to have eliminated it completely. As of this writing, I haven’t seen a single case of this on Exchange 2013.

Share Comments

DsQuerySitesByCost and public folder referrals

In Exchange 2010 and older, when you mount a public folder database, the Information Store service asks Active Directory for the costs from this site to every other site that contains a public folder database. This is repeated about every hour in order to pick up changes. If a client tries to access a public folder which has no replica in the local site, Exchange uses the site cost information to decide where to send the client. This means that, as with so many other features, public folder referrals will not work properly if something is wrong with AD.

There are several steps involved in determining these costs.

  1. Determine the name of the site we are in, via DsGetSiteName.
  2. Determine the names of all other sites that contain PFs.
  3. Bind to the Inter-Site Topology Generator, via DsBindToISTG.
  4. Send the list of site names for which we want cost info, via DsQuerySitesByCost.
  5. From the sites in the response, we will only refer clients to those between cost 0 and 500.

This gives us a lot of opportunities to break. For example:

  • Can’t determine the local site name.
  • Can’t bind to the ISTG.
  • The costs returned are either infinite (-1) or greater than 500.

I recently had a case where we were fighting one of these issues, and I could not find a tool that would let me directly test DsQuerySitesByCost. So, I created one. The code lives on GitHub, and you can download the binary by going to the Release tab and clicking DsQuerySitesByCost.zip:

https://github.com/bill-long/DsQuerySitesByCost

Typically, you would want to run this from an Exchange server with a public folder database, so that you can see the site costs from that database’s point of view. The tool calls DsGetSiteName, DsBindToISTG, and DsQuerySitesByCost, so it should expose any issues with these calls and make it easy to test the results of configuration changes.

You can run the tool with no parameters to return costs for all sites, or you can pass each site name you want to cost as a separate command-line argument.

Thanks to the Active Directory Utils project, which got me most of the way there.

Share Comments

Automating data collection with Powershell

One of the challenges with analyzing complex Exchange issues is data collection. Once the server goes into the failed state, any data collection at that point only shows us what the failed state looks like. It doesn’t show us how it went from working to failing, and sometimes, that’s what we need to see in order to solve the problem.

Certain types of data collection are fairly easy to just leave running all the time so that you can capture this transition from the working state to the failing state. For instance, you can typically start a perfmon and let it run for days until the failure occurs. Similarly, event logs can easily be set to a size that preserves multiple days worth of events.

Other types of data are not so easy to just leave running. Network traces produce so much data that the output needs to be carefully managed. You can create a circular capture, but then you have to be sure to stop the trace quickly at the right time before it wraps. The same applies to ExTRA traces, LDAP client traces, etc.

In several cases over the past year, I’ve solved this problem with a Powershell script. My most recent iteration of the script appears below, but I usually end up making small adjustments for each particular case.

In its current version, running the script will cause it to:

  • Start a chained nmcap. Note that it expects 3.4 to be present so it can use the high performance capture profile.
  • Start a circular ExTRA trace.
  • Start a circular LDAP client trace.
  • Wait for the specified events to occur.

While it waits, it watches the output folder and periodically deletes any cap files beyond the most recent 5. When the event in question occurs, it then:

  • Collects a procdump.
  • Stops the nmcap.
  • Stops the LDAP client trace.
  • Stops the ExTRA trace.
  • Saves the application and system logs.

All of these features can be toggled at the top of the script. You can also change the number of cap files that it keeps, the NIC you want to capture, the PID you want to procdump, etc.

The script will almost certainly need some slight adjustments before you use it for a particular purpose. I’m not intending this to be a ready-made solution for all your data collection needs. Rather, I want to illustrate how you can use Powershell to make this sort of data collection a lot easier, and to give you a good start on automating the collection of some common types of logging that we use for Exchange.

Enjoy!

Share Comments

Directory Name Must Be Less Than 248 Characters

Over the holiday weekend, I was deleting some old projects out of my coding projects folder when Powershell returned an error stating, “The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters.” I found that attempting to delete the folder from explorer or a DOS prompt also failed.

This error occurred while I was trying to remove a directory structure that was created by the yeoman/grunt/bower web development tools. Apparently npm or bower, or both, have no problem creating these deep directory structures on Windows, but when you later try to delete them, you can’t.

A little searching turned up several blog posts and a Stack Overflow question. The workaround of appending “\\?\“ to the beginning of the path didn’t seem to work for me.

I found some tools that claimed to be able to delete these files, but as usual, I was annoyed at the idea of having to install a tool or even just download an exe to delete some files.

Edit: Thanks to AlphaFS, this is much easier now. I’ve removed the old script. With AlphaFS, you can delete the folder with a single Powershell command. First, you need to install the AlphaFS module into Powershell, and the easiest way to do that is with PsGet.

So first, if you don’t have PsGet, run the command shown on their site:

1
(new-object Net.WebClient).DownloadString("http://psget.net/GetPsGet.ps1") | iex

Once it’s installed, import the PsGet module, and use it to install AlphaFS. Note the following command refers to what is currently the latest release of AlphaFS, but you might want to check for a later one:

1
2
3
Import-Module PsGet

Install-Module -ModuleUrl "https://github.com/alphaleonis/AlphaFS/releases/download/v2.0.1/AlphaFS.2.0.1.zip"

Now you can use AlphaFS to delete the directory. You only need to point it the top folder, and it will automatically recurse:

1
[Alphaleonis.Win32.Filesystem.Directory]::Delete("C:\some\directory", $true, $true)

Here’s how these commands look when you run them in the shell:

This is a lot simpler than the original script I posted using the Experimental IO Library. Thanks AlphaFS!

Share Comments

TNEF property problem update

Back in January, I wrote a blog post about PF replication failing due to corrupt TNEF. The problem is caused by the presence of a couple of properties that have been deprecated and shouldn’t be present on items anymore. At the time I wrote that post, we thought you could run the cleanup script to remove the properties and live happily ever after. So much for that idea.

We found that, in some environments, the problem kept coming back. Within hours of running the script, public folder replication would break again, and we would discover new items with the deprecated properties.

We recently discovered how that was happening. It turns out that there is a code path in Exchange 2013 where one of the properties is still being set. This means messages containing that property will sometimes get delivered to an Exchange 2013 mailbox. The user can then copy such an item into a public folder. If the public folders are still on Exchange 2010 or 2007, replication for that folder breaks with the corrupt TNEF error:

1
Microsoft.Exchange.Data.Storage.ConversionFailedException: The message content has become corrupted. ---> Microsoft.Exchange.Data.Storage.ConversionFailedException: Content conversion: Failed due to corrupt TNEF (violation status: 0x00008000)

Now that we know how this is happening, an upcoming release of Exchange 2013 will include a fix that stops it from setting this property. You’ll need to continue using the script from the previous post to clean up affected items for now, but there is light at the end of the tunnel.

Share Comments

MfcMapi error when opening public folders

There are a lot of little problems I run across that I never investigate, simply because there’s no impact and no one seems to care. I have my hands full investigating issues that are impacting people, so investing time to chase down something else is usually not a good use of my time.

One of those errors is something that MfcMapi returns when you open public folders. In many environments, including some of my own lab environments, if you open MfcMapi and double-click Public Folders, you get a dialog box stating that error 0x8004010f MAPI_E_NOT_FOUND was encountered when trying to get a property list.

If you click OK, a second error indicates that GetProps(NULL) failed with the same error.

After clicking OK on that error, and then double-clicking on the public folders again, you get the same two errors, but then it opens. At this point you can see the folders and everything appears normal.

I’ve been seeing this error for at least five years - maybe closer to ten. It’s hard to say at this point, but I’ve been seeing it for so long, I considered it normal. I never looked into it, because no one cared.

That is, until I got a case on it recently.

Some folks use MfcMapi as the benchmark to determine if things are working. If MfcMapi doesn’t work, then the problem is with Exchange, and their own product can’t be expected to work.

This was the basis for a recent case of mine. A third-party product wasn’t working, so they tried to open the public folders with MfcMapi, and got this error. Therefore, they could not proceed with troubleshooting until we fixed this error.

Of course, as far as I knew, this error was totally normal, and I told them so, but they still wanted us to track it down. Fortunately, this provided a perfect opportunity to chase down one of those little errors that has bothered me for years, but that I never investigated.

By debugging MfcMapi (hey, it’s open source, anyone can debug it) and taking an ExTRA trace on the Exchange side, we discovered that MfcMapi was trying to call GetPropList on an OAB that did not exist. Looking in the NON_IPM_SUBTREE, we only saw the EX: OAB, which Exchange hasn’t used since Exchange 5.5.

In Exchange 2000 and later, we use the various OABs created through the Exchange management tools. The name will still have a legacy DN, but it won’t start with EX:, so it’s easy to distinguish the real OABs from an old unused legacy OAB folder. Here’s what a real OAB looks like in the public folders, when it’s present:

In this case, we didn’t see the real OAB. We only saw the site-based OAB from the Exchange 5.5 days.

It turned out that the real OAB was set to only allow web-based distribution, not PF distribution. That explained why the OAB could not be seen in the NON_IPM_SUBTREE. Despite that fact, MfcMapi was still trying to call GetPropList on it. Since the folder didn’t exist, it failed with MAPI_E_NOT_FOUND.

Thus, one of the great mysteries of the universe (or at least my little Exchange Server universe) is finally solved!

In the customer environment, we fixed the error by enabling PF distribution for the OAB. I doubt this had anything to do with the issue the third-party tool was having, but who knows? At the very least, we were able to move the troubleshooting process forward by solving this, and maybe this blog post will save people from chasing their tails over this error in the future.

Share Comments