Bulk CSV Downloading and Welcoming Back IE9 Users - Changelog: Week of January 20, 2014

Bulk CSV download

  • Bulk CSV downloads have been added to the document view page. This massively speeds up workflows which depend on extracting multiple tables per document.

Multi File Upload

  • Upload multiple documents at the same time.

Internet Explorer support

  • Added support for Internet Explorer 9.
  • Improved visual feedback for pending operations in IE 9.

Security improvements

  • All administrative access must go through separate routing from user routes.

Bug Fixes

  • Corrected bounds checking on dataset heading depth.
  • Upload errors are correctly displayed.

Follow Up To a Great Launch - Beta Changelog: Week of January 13, 2014

After successfully launching our beta product to a limited audience last week, we are happy to follow up with several improvements which will further enhance our user experience.

Updates and fixes in the week preceding Monday, January 13, 2014:

Bug Fixes

  • Selecting multiple datasets works in Firefox again. Internet Explorer 9 still has some issues which we are as yet unresolved.

  • Information for newly uploaded documents is now correctly displayed and no longer requiring a page reload.

UI Improvements

  • Dataset controls in the full document view are now fixed and no longer scroll out of view.

  • Pending document upload widget has improved visibility, and does not appear as an error.

Document Handling Improvements

  • Table extraction algorithms improved to better detect non-tabular elements such as graphs and charts.

re: In defense of Excel

I came across this post from a friend of Docmunch earlier today: In Defense of Excel and wanted to second Dan's point. Excel, although not the object of my respect in the Silicon Valley bubble is a shockingly useful data analysis tool for the millions (or billions?) of non-technical, knowledge workers around the world.

At Docmunch we are focused on building tools to complement, not replace, Excel. One missing piece of the Excel/Analysis puzzle is the extraction, collection, and organization of data. We look forward to sharing our vision of the solution for this missing piece as we realize it.

Haskell version freezing

We received some great feedback on our previous post about locking down Haskell dependencies to create a reliable build with Haskell's cabal installer. The most important piece of feedback was that the cabal.config file has a constraints: field that can be used to lock down dependencies. cabal will check that file whenever installing.

With that in mind, Ben Armston is taking up the Haskell dependency lockdown torch. He created a package cabal-constraints that lists out exact versions of dependencies. So you can lock down your dependencies with this:

cabal-constraints > cabal.config

It is on github right now, and I found the initial revision to be more reliable than later changes.

git clone https://github.com/benarmston/cabal-constraints
git checkout aa2e306b1a096c3a9032df7e7b7961cc18397888
cd cabal-constraints
cabal install

cd my-project-dir
cabal-constraints > cabal.config

Of course, this is not a fully automated solution. Ben already opened a pull request to add dependency locking to cabal, and it will be known as freezing. So you will be able to freeze your dependencies with

cabal freeze

Docmunch is going to try to make sure this gets in the next major cabal release. There are some things I don't like about the initial implementation, but our main goal is to just get some freezing functionality into cabal as soon as possible. When we talk with other industry users, many express frustration that cabal does not meet their needs. I view freezing as the first obvious and relatively easy step in making cabal work well for more users.

Haskell Version Lockdown


This post contains all the details you need about freezing dependencies. We published an update focusing on an existing solution you may find easier to use than the one listed here and how this is being integrated into cabal.


Cabal is a great tool for library authors. As a library author we could give you a few minor nitpicks but we have few substantial complaints.

What we often forget is that the needs of library authors are different from application builders. It has been said before that Cabal is not a package manager. But cabal-install's constraints are greater than not managing user dependencies properly: it still does not provide basic essential tools for application builders.

Application builders need to produce reliable, re-playable builds. Haskellers will often attempt to do this with a Cabal file. But .cabal file versioning is meant for library authors to specify maximum version ranges that a library author hopes will work with their package. Pegging packages to specific versions in a .cabal file will eventually fail because there are dependencies of dependencies that are not pegged.

Solution Brief

At docmunch we came up with a simple solution to this problem: write out a file containing the exact versions of all packages being used and check it in to version control. The file looks like this:

  executable 'foo':
    - deepseq-
    - base-
  test suite 'foo-test':
    - HUnit-

Our build server will then use this lock file to guarantee that the binary that gets shipped to our application servers uses the same versions of dependencies as we used in development and testing.

Related Work

or, just because Ruby does it does not mean that it is criminally unsafe.

Our solution is essentially the same as Ruby gems Gemfile.lock but not as heavy-weight since Haskell does not decide dependencies at application startup like Ruby (no bundle exec is required).

Talking with some other industrial users also validated what we are doing: most adopt techniques for limiting what can be installed, and attempt to achieve the same end result.

Solution Details

Generating a .lock.yaml file

We build myProject.lock.yaml by setting cabal-version: >=1.8 and Build-Type: custom in myProject.cabal. Then, we add this Setup.hs file to the root of our project directory:

Every time cabal runs the configure step, it will write out a new lock file. You will want to run your configure step with all components enabled (cabal configure --enable-tests, etc).

Just adding the myProject.lock.yaml file to your project will make dependency differences between users visible in code diffs which is good, but ends up creating conflicts if you don't actually use it for installation.

The Setup.hs code is slow and uses partial functions where we are not even sure if they are safe. We are hoping the community can start forking the code and do a better job of figuring out cabal APIs and help improve it.

Installing dependencies with your .lock.yaml file

The real gains from the .lock.yaml file approach, however, are from replayable builds. Here's a somewhat ugly but perfectly functional command to do exactly that:

cat myProject.lock.yaml | grep ' -' | cut -d ' ' -f 6 | sort | uniq | xargs cabal install -j --force-reinstalls;

The file is in a yaml format to make it easier to install dependencies from individual cabal components. If you install the latest version of yaml with cabal install yaml, there is a yaml2json executable we created that is available. yaml2json myProject.lock.yaml will produce JSON, and there are many available command-line JSON tools (perhaps a commenter can point out a Haskell tool).

Library authors vs. Application developers

Library authors have no need for lock files since they need to build across as many versions as possible. However, they may find it useful to publish lock files of successfully built versions.

Application developers no longer need to*peg packages to specific versions in their cabal file. Instead they specify version ranges that they want to install from when they change or upgrade their dependencies.

Usage with cabal-meta

This versioning solution is similar to how cabal-meta functions. cabal-meta keeps a separate list of special packages to install and feeds that to cabal-install. cabal-meta was mainly designed to deal with building multiple local packages at once, which to a certain extent you can use the add-source command for from cabal-dev or the new cabal sandbox feature. cabal-meta also helps automatically build remote dependencies. One problem with cabal-meta is that it is a separate executable and it does nothing to stop you from accidentally not using it and just using cabal/cabal-dev. We will look into cabal-meta integration with lock files in the future.

Usage with cabal 1.18 sandboxes

If you aren't using cabal sandboxes, please immediately stop what you are doing, read Mikhail's awesome introduction to Cabal Sandboxes, upgrade cabal, and type cabal sandbox init in your project. We have been using the new sandboxes for weeks now, and it is great. We don't have to remember to use cabal-dev instead of cabal, w just have to type cabal sandbox init once.

Another very important feature of Cabal 1.18 is that it adds the ability to build individual targets. So if you have a library foo and a test-suite foo-test, you can type cabal build foo to build the library and cabal build foo-test to build the test suite. This makes using a lock file easier because you can always use cabal configure --enable-tests to start with to write out the lock file and then you can choose your build target later.

Using Cabal 1.18 combined with our lock files means we now spend none of our time on installation issues that tools should easily solve for us. Right before we rolled out our lock file one of our team members had an installation failure issue on the continuous integration server. This is the kind of build issue that is still common place with Haskell. After rolling out the lock file we had him re-merge his branch and install from the lock file, and the build passed, proving the value of what we had done.

It was just posted that Hackage2 is going to be officially release soon. In the span of a few months Haskell is changing from ridiculously harder to manage packages than mainstream programming languages to being on par.