* P2 elements in 3D (called E2, using p-hierarchical basis)
* Lame in 3D (even with Neumann BC)
  test_lame3D
* OpenMP parallelisation of all assembly routines, matrix multiply 
  and Gauss-Seidel (smoother for multigrid)
  --> the most time consuming parts of the code scale well, 
      tested up to 32 threads 
  --> several parts still serial 
      (most important: mesh refinement, interpolation/restriction)
  --> overall speedup is good up to 4 threads, 
      then Amdahls law becomes significant
  --> if enough processors are available, there is almost no time
      difference between P1 and P2 elements with the same total number
      of degrees of freedom (both in 2D and 3D)
  this required extensive changes in the flexible sparse matrix
  storage format
* two eigenvalue solvers in lin_solver.c
* eigenvalue solver (experimental)
  test_lame3Deig
  test_assem    (can be switched on/off)
* mutligrid coarse matrix solves now automatically use UMFPACK if
  avialable, the old LAPACK band-matrix solver remains as fallback
* changed boundary meshing (prior to call of "triangle" mesh
  generator) to be curvature adaptive for shape segments that are
  Bezier-curves, such that the deviation of the line segment to the
  Bezier-curve is small in comparison to local mesh-width
