December 24, 2019

Building (Old Version) Caffe in Conda

A note of problems and solutions to compilation of an old version of caffe to verify 3Dpose_ssl project.

Building (Old Version) Caffe in Conda

All these problems happened when trying to compile caffe to verify 3Dpose_ssl#661b5d1 project.

chanyn/3Dpose_ssl
3D Human Pose Machines with Self-supervised Learning - chanyn/3Dpose_ssl

Include and Library Paths

In Makefile.config, include and libirary paths need to be adjusted as per environment. In Conda virtual environment, package headers are install to $CONDA_PREFIX/include and packages libiraries to $CONDA_PREFIX/lib. Therefore, theses paths need to be added.

....
# Paths for Conda packages
INCLUDE_DIRS += ${CONDA_PREFIX}/include
LIBRARY_DIRS += ${CONDA_PREFIX}/lib
....
Makefile inserted lines

Additionally, the author have specify ANACONDA_HOME in Makefile.config. Since Conda is used here and corresponsing paths are configured, this line needs to be commented out to make sure the compile process works well.

Protobuf Version Misty

Protobuf is used to serializing structured data in caffe. It is hightly version-sensitive, and if incorrect version is installed, errors would be thrown out during the compilation.

caffe/include/caffe/proto/caffe.pb.h:17:2: error: #error This file was generated by an older version of protoc which is
#error This file was generated by an older version of protoc which is
caffe/include/caffe/proto/caffe.pb.h:18:2: error: #error incompatible with your Protocol Buffer headers. Please
#error incompatible with your Protocol Buffer headers. Please
caffe/include/caffe/proto/caffe.pb.h:19:2: error: #error regenerate this file with a newer version of protoc.
#error regenerate this file with a newer version of protoc.

Dig into generated caffe/include/caffe/proto/caffe.pb.h and search for the error messages, more details could be found

...
#if 3006000 < GOOGLE_PROTOBUF_MIN_PROTOC_VERSION
#error This file was generated by an older version of protoc which is
#error incompatible with your Protocol Buffer headers.  Please
#error regenerate this file with a newer version of protoc.
#endif
...
caffe/include/caffe/proto/caffe.pb.h

The version number 3006000 gives a hint that protobuf 3.6.0 is used to generate the headers. So installing protobuf 3.6.0 in Conda will solve the problem.

(conda)$ conda install protobuf=3.6.0

Missing cblas.h

BLAS is an essential dependency of caffe.

Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran.

BLAS could be implemented by any project, including OpenBLAS as a popular open-source one. Conda provides OpenBLAS prebuilt package.

(conda)$ conda install -c anaconda openblas

Incorrect Number of cuDNN Parameters

Newer versions of cuDNN feature a change that cudnnSetConvolution2dDescriptor function needs 2 more parameters. This will cause compilation errors.

CXX src/caffe/data_transformer.cpp
In file included from ./include/caffe/util/device_alternate.hpp:40:0,
                 from ./include/caffe/common.hpp:19,
                 from ./include/caffe/blob.hpp:8,
                 from ./include/caffe/data_transformer.hpp:6,
                 from src/caffe/data_transformer.cpp:8:
./include/caffe/util/cudnn.hpp: In function ‘const char* cudnnGetErrorString(cudnnStatus_t)’:
./include/caffe/util/cudnn.hpp:21:10: warning: enumeration value ‘CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING’ not handled in switch [-Wswitch]
   switch (status) {
          ^
./include/caffe/util/cudnn.hpp: In function ‘void caffe::cudnn::setConvolutionDesc(cudnnConvolutionStruct**, cudnnTensorDescriptor_t, cudnnFilterDescriptor_t, int, int, int, int)’:
./include/caffe/util/cudnn.hpp:113:70: error: too few arguments to function ‘cudnnStatus_t cudnnSetConvolution2dDescriptor(cudnnConvolutionDescriptor_t, int, int, int, int, int, int, cudnnConvolutionMode_t, cudnnDataType_t)’
       pad_h, pad_w, stride_h, stride_w, 1, 1, CUDNN_CROSS_CORRELATION));
                                                                      ^
./include/caffe/util/cudnn.hpp:15:28: note: in definition of macro ‘CUDNN_CHECK’
     cudnnStatus_t status = condition; \
                            ^
In file included from ./include/caffe/util/cudnn.hpp:5:0,
                 from ./include/caffe/util/device_alternate.hpp:40,
                 from ./include/caffe/common.hpp:19,
                 from ./include/caffe/blob.hpp:8,
                 from ./include/caffe/data_transformer.hpp:6,
                 from src/caffe/data_transformer.cpp:8:
/usr/local/cuda-8.0/include/cudnn.h:500:27: note: declared here
 cudnnStatus_t CUDNNWINAPI cudnnSetConvolution2dDescriptor( cudnnConvolutionDescriptor_t convDesc,
                           ^
Makefile:585: recipe for target '.build_release/src/caffe/data_transformer.o' failed
make: *** [.build_release/src/caffe/data_transformer.o] Error 1

There are some discussions and a solution in caffe issues.

Caffe installation error with CUDNN V6.0 · Issue #5793 · BVLC/caffe
CXX src/caffe/data_transformer.cpp In file included from ./include/caffe/util/device_alternate.hpp:40:0, from ./include/caffe/common.hpp:19, from ./include/caffe/blob.hpp:8, from ./include/caffe/da...
Faced the same problem. It's happening due to cudnn.hpp (Location: include/caffe/util/cudnn.hpp) . Update cudnn.hpp file. It is not considering the current cuDNN versions.

Since 3Dpose_ssl project uses an old version of caffe, the solution is quite simple. Replacing the cudnn.hpp from latest caffe will address the problem.

(conda)$ wget https://github.com/BVLC/caffe/blob/master/include/caffe/util/cudnn.hpp -O include/caffe/util/cudnn.hpp

Cannot Link libjpeg or libpng

(conda)$ make all
CXX/LD -o .build_release/tools/upgrade_solver_proto_text.bin
/usr/bin/ld: warning: libjpeg.so.8, needed by /public/wl4/anaconda3/envs/pose27/lib/libopencv_highgui.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libpng16.so.16, needed by /public/wl4/anaconda3/envs/pose27/lib/libopencv_highgui.so, not found (try using -rpath or -rpath-link)

According to facebookarchive/caffe2 Issue#1693: Cannot link OpenCV because of libjpeg (Anaconda), there are two possible solutions here:

1. Export the LD_LIBRARY_PATH environment variable to equal your Anaconda lib directory
2. Install the OpenCV Anaconda package and make sure that Caffe2 uses it (preferred)
How to tell at run time whether libjpeg-turbo version of libjpeg is used? · Issue #3492 · python-pillow/Pillow
tl;dr: Is there some way to accomplish: PIL.Image.libjpeg_turbo_is_enabled()? The full story: Is there a way to tell from a pre-built Pillow whether it was built against libjpeg-turbo or not? This ...

So the solution is to make sure there are corresponding packages installed in Conda environment and specify LD_LIBRARY_PATH.

(conda)$ conda install -c anaconda libpng jpeg
(conda)$ LD_LIBRARY_PATH=$CONDA_PREFIX/lib make all -j56

Again... Protobuf Linking Failure

(conda)$ protoc --version
libprotoc 3.6.0
(conda)$ LD_LIBRARY_PATH=$CONDA_PREFIX/lib make all
CXX/LD -o .build_release/tools/upgrade_solver_proto_text.bin
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteStringMaybeAliased(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::io::CodedOutputStream::WriteStringWithSizeToArray(std::string const&, unsigned char*)'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::Message::GetTypeName() const'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::MessageFactory::InternalRegisterGeneratedFile(char const*, void (*)(std::string const&))'
.build_release/lib/libcaffe.so: undefined reference to `leveldb::DB::Open(leveldb::Options const&, std::string const&, leveldb::DB**)'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::Message::DebugString() const'
.build_release/lib/libcaffe.so: undefined reference to `google::base::CheckOpMessageBuilder::NewString()'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::internal::OnShutdownDestroyString(std::string const*)'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteBytesMaybeAliased(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::MessageLite::ParseFromString(std::string const&)'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::internal::NameOfEnum(google::protobuf::EnumDescriptor const*, int)'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::internal::fixed_address_empty_string'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::internal::WireFormatLite::WriteString(int, std::string const&, google::protobuf::io::CodedOutputStream*)'
.build_release/lib/libcaffe.so: undefined reference to `leveldb::Status::ToString() const'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::internal::AssignDescriptors(std::string const&, google::protobuf::internal::MigrationSchema const*, google::protobuf::Message const* const*, unsigned int const*, google::protobuf::Metadata*, google::protobuf::EnumDescriptor const**, google::protobuf::ServiceDescriptor const**)'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::internal::WireFormatLite::ReadBytes(google::protobuf::io::CodedInputStream*, std::string*)'
.build_release/lib/libcaffe.so: undefined reference to `google::protobuf::Message::InitializationErrorString() const'
collect2: error: ld returned 1 exit status

This is caused by incompatible compiler version when building protobuf. This happens because the environment, CentOS 7, delivers GCC 4.8.5 while pre-built binary packages in Conda are using GCC 7. There is no compatibility between big versions of GCC.

To solve the problem, a manual rebuild is effective.

(conda)$ git clone https://github.com/protocolbuffers/protobuf.git
(conda)$ cd protobuf
(conda)$ git checkout v3.6.0
(conda)$ git submodule update --init --recursive
(conda)$ ./autogen.sh && ./configure --prefix=$CONDA_PREFIX
(conda)$ make -j56 && make check -j56 && make install
Build and install protobuf

Notice that the same problem also happens when linking to leveldb::DB::Open(leveldb::Options const&, std::string const&, leveldb::DB**)'. Rebuild leveldb with current GCC and add the compilation output into LIBRARY_PATH will solve the problem.

(conda)$ git clone https://github.com/google/leveldb.git $LEVELDB_SRC
(conda)$ cd $LEVELDB_SRC
(conda)$ git checkout v1.10
(conda)$ make -j56
(conda)$ LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LEVELDB_SRC/out-shared make all
Build leveldb and re-compile caffe with libraries

CUDA Driver Version is Insufficient for CUDA Runtime Version

The compilation is over. When running caffe to train a model, it still gives error.

(conda)$ LD_LIBRARY_PATH=$CONDA_PREFIX/lib:/usr/local/cuda/lib64:$LEVELDB_SRC/out-shared ./build/tools/caffe train -gpu=all -solver=$MODEL/solver.prototxt -weights=$WEIGHTS/pose_iter_320000.caffemodel 2>&1 | tee -a train.log
F1228 17:38:43.572676 103290 caffe.cpp:93] Check failed: error == cudaSuccess (35 vs. 0)  CUDA driver version is insufficient for CUDA runtime version
*** Check failure stack trace: ***
    @     0x2b83da5b7a3d  google::LogMessage::Fail()
    @     0x2b83da5bce7a  google::LogMessage::SendToLog()
    @     0x2b83da5b9b20  google::LogMessage::Flush()
    @     0x2b83da5b9e0d  google::LogMessageFatal::~LogMessageFatal()
    @           0x4082fe  get_gpus()
    @           0x409215  train()
    @           0x406adc  main
    @     0x2b83fa54ec05  __libc_start_main
    @           0x407523  (unknown)

Check the CUDA driver version and CUDA runtime version first.

# Check CUDA Toolkit version
(conda)$ /usr/local/cuda/bin/nvcc --veresion
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
#Check NVIDIA driver version
(conda)$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  418.87.00  Thu Aug  8 15:35:46 CDT 2019
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)

According to CUDA Tookit Documentation, running a CUDA application requires the system with a driver that is compatible with the CUDA Toolkit. However, CUDA 10.1 is compatible with NVIDIA driver version 418.87, which is clearly greater than version 418.39. It must be a dynamic library problem.

(conda)$ LD_LIBRARY_PATH=$CONDA_PREFIX/lib:/usr/local/cuda/lib64:$LEVELDB_SRC/out-shared ldd ./build/tools/caffe | grep cuda
        libcudart.so.10.2 => /public/wl4/anaconda3/envs/pose27/lib/libcudart.so.10.2 (0x00002b365d546000)

Check Conda environment about CUDA packages and it appeared that CUDA 10.2 libraries came from cudnn and cudatoolkit package.

(conda)$ conda list | grep cuda
cudatoolkit               10.2.89              hfd86e86_0    anaconda
cudnn                     7.6.5                cuda10.2_0    anaconda

Therefore, installing a cudnn depedent on CUDA 10.1 will solve the problem.

(conda)$ conda search --info cudnn
....
cudnn 7.6.5 cuda10.1_0
----------------------
file name   : cudnn-7.6.5-cuda10.1_0.conda
name        : cudnn
version     : 7.6.5
build       : cuda10.1_0
build number: 0
size        : 179.9 MB
license     : Proprietary
subdir      : linux-64
url         : https://repo.anaconda.com/pkgs/main/linux-64/cudnn-7.6.5-cuda10.1_0.conda
md5         : 48850e851b910b694192f417e860fba3
timestamp   : 2019-12-19 21:21:03 UTC
dependencies:
  - cudatoolkit >=10.1,<10.2
....
(conda)$ conda install  cudnn=7.6.5=cuda10.1_0
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /public/wl4/anaconda3/envs/pose27

  added / updated specs:
    - cudnn==7.6.5=cuda10.1_0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    cudatoolkit-10.1.243       |       h6bb024c_0       347.4 MB  defaults
    cudnn-7.6.5                |       cuda10.1_0       179.9 MB  defaults
    ------------------------------------------------------------
                                           Total:       527.4 MB

The following packages will be SUPERSEDED by a higher-priority channel:

  cudatoolkit        anaconda::cudatoolkit-10.2.89-hfd86e8~ --> pkgs/main::cudatoolkit-10.1.243-h6bb024c_0
  cudnn                    anaconda::cudnn-7.6.5-cuda10.2_0 --> pkgs/main::cudnn-7.6.5-cuda10.1_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
cudnn-7.6.5          | 179.9 MB  | ################################################################################################################################################################################################## | 100%
cudatoolkit-10.1.243 | 347.4 MB  | ################################################################################################################################################################################################## | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Check failed: Caffe::root_solver() || root_net_ root_net_ needs to be set for all non-root solvers

According to Move root_net_ check in net constructor #4806, there is the solution here:

Move root_net_ check in net constructor by junshi15 · Pull Request #4806 · BVLC/caffe
This PR extends recurrent_layer to the multi-gpu settings. Currently, the recurrent_layer constructs a net internally (the unrolled net), https://github.com/BVLC/caffe/blob/master/src/caffe/layers/...

Tips: this problem is caused because the lstm network does not support muti-gpu, you should change the cpp file and remake the caffe.