wiki:DistributedSRM
Last modified 7 years ago Last modified on 08/31/10 22:38:09

Distributed SRM Powered By Terracotta

We are glad to announce that SRM Server now has an experimental support for distributed deployment for achieving higher availability and scalability. The distributed functionality of the SRM server is powered by Terracotta. We recommend that you read terracotta documentation in order to understand possible deployment configurations. Start with "Clustering Web Applications with Terracotta", which presents diagrams and description of the typical distributed application running on top of Terracotta. "Deployment Guide", "Operations Guide", "Configuring Terracotta For High Availability" and "Terracotta Server Arrays" all have important information that might become useful at various stages of evaluation, deployment and operations of the Distributed SRM server.

In order to run SRM with Terracotta, dCache SRM Transfer Managers and SRM Space Manager need to be run in separate from SRM domains. Starting from version 1.9.6-2 dCache includes preconfigured Space Manager and Transfer Managers domains, that are easy to start by node_config modification. You will need to modify srm.batch so that the space manager and transfer managers are not started in srm domain.

Please also note that in case of distributed SRM all SRM servers need to share the same srm postgres database instance.

Installation Checklist

In order to configure and run SRM service of dCache in distributed configuration you need to:

  • Read and understand the configuration of the Terracotta Platform
  • Chose the configuration topology that you plan on implementing. See the diagram bellow of what was used in the tests.
  • Install latest java 1.6 or later on each node that will be running either srm or terracotta server
  • Download and install latest version of Terracotta software on each node that will running either srm or terracotta server
  • For dCache 1.9.7 later with Jetty based SRM
    • Install dCache Server rpms on each of the SRM servers.
    • On each SRM server node edit appropriate layout file so that the variable dcache.terracotta.enabled is set to true and

dcache.terracotta.install.dir points to the location of the terracotta home directory. As dcache.terracotta.enabled will affect how any of the domains is started, please specify true only for srm domain, and make sure that only srm cell is started in that domain.

Bellow is the example of the layout configuration file that only runs srm domain

[srm-${host.name}Domain]
dcache.terracotta.enabled=true
dcache.terracotta.install.dir=/opt/terracotta-3.3.0
[srm-${host.name}Domain/srm]

  • Make sure that there is only one instance of utility, srmspacemanager and transfermanagers cells is running in the dcache, these services do not need to be replicated.
  • On each SRM node edit /opt/d-cacheetc/tc-config.xml so that server host element point to correct location of the terracotta server. You might need more than one value in case of multiple terracotta servers.
    <server host="terracotta-server.example.org">
    
  • It is highly recommended that the high availability configuration recommended for terracotta production is used. Bellow are the parts of /opt/d-cache/etc/tc-config.xml configuration that were customized for SRM Performance evaluation testing conducted at Fermilab for CMS experiment (there were 4 srm front ends and 2 terracotta backends and up to 900 clients used in this experiment)
    • tc-properties
        <tc-properties>
           <property name="l2.l1reconnect.enabled " value="true" />
           <property name="l1.cachemanager.leastCount" value="4" />
           <property name="l1.max.connect.retries" value="-1" />
           <property name="l1.connect.versionMatchCheck.enabled" value="true" />
           <property name="l1.socket.connect.timeout" value="10000" />
           <property name="l1.socket.reconnect.waitInterval" value="1000" />
           <!-- The following settings create a HealthChecker with a higher tolerance for interruptions in network communications and long GC cycles -->
           <property name="l1.healthcheck.l2.ping.enabled" value="true" />
           <property name="l1.healthcheck.l2.ping.idletime" value="5000" />
           <property name="l1.healthcheck.l2.ping.interval" value="1000" />
           <property name="l1.healthcheck.l2.ping.probes" value="3" />
           <property name="l1.healthcheck.l2.socketConnect" value="true" />
           <property name="l1.healthcheck.l2.socketConnectTimeout" value="5" />
           <property name="l1.healthcheck.l2.socketConnectCount" value="10" />
         </tc-properties>
      
      
  • Servers block describing two backend servers and containing ha section
          <server host="cmswn1896.fnal.gov">
             <data>/storage/local/data1/terracotta/data</data>
             <logs>/storage/local/data1/terracotta/logs</logs>
             <statistics>/storage/local/data1/terracotta/statistics</statistics>
             <dso-port>9510</dso-port>
             <dso>
                <client-reconnect-window>120</client-reconnect-window>
                <persistence>
                   <mode> permanent-store </mode>
                </persistence>
                <garbage-collection>
                   <enabled>true</enabled>
                   <verbose>false</verbose>
                 <!-- How often should the Terracotta server perform distributed
                    garbage collection, in seconds?
                    Default: 3600 (60 minutes)
                  -->
                   <interval>60</interval>
                </garbage-collection>
             </dso>
           </server>
           <server host="cmswn1895.fnal.gov">
             <data>/storage/local/data1/terracotta/data</data>
             <logs>/storage/local/data1/terracotta/logs</logs>
             <statistics>/storage/local/data1/terracotta/statistics</statistics>
             <dso-port>9510</dso-port>
             <dso>
                <client-reconnect-window>120</client-reconnect-window>
                <persistence>
                   <mode> permanent-store </mode>
                </persistence>
                <garbage-collection>
                   <enabled>true</enabled>
                   <verbose>false</verbose>
                 <!-- How often should the Terracotta server perform distributed
                    garbage collection, in seconds?
                    Default: 3600 (60 minutes)
                  -->
                   <interval>60</interval>
                </garbage-collection>
             </dso>
           </server>
          <ha>
                <mode>networked-active-passive</mode>
                <networked-active-passive>
                    <election-time>5</election-time>
                </networked-active-passive>
           </ha>
      </servers>
    
    
  • the rest of terracotta configuration remained the same as distributed in dcache rpm
  • For dCache 1.9.6 with Tomcat based SRM
    • Install terracotta Tomcat 5.5 TIM (Terracotta Integration Module) in each instance of terracotta. Remember the version of the tim installed.
      # /opt/terracotta-3.1.1/bin/tim-get.sh install tim-tomcat-5.5
      
    • Install dCache Server rpms on each of the SRM servers.
    • Run dCache install script
    • On each SRM server node edit /opt/d-cache/etc/srm_setup.env so that variable RUN_WITH_TERRACOTTA is set to true and TC_INSTALL_DIR points to the location of the terracotta home directory
      RUN_WITH_TERRACOTTA=true
      TC_INSTALL_DIR=/opt/terracotta-3.1.1
      
    • On each SRM server node, edit srm.batch so that it does not start space manager or transfer managers
      onerror shutdown
      
      exec -shell file:${ourHomeDir}/share/cells/logging.fragment
      exec -shell file:${ourHomeDir}/share/cells/setup.fragment
      exec -shell file:${ourHomeDir}/share/cells/tunnel.fragment
      exec -shell file:${ourHomeDir}/share/cells/threadmanager.cell
      #exec -shell file:${ourHomeDir}/share/cells/spacemanager.cell
      #exec -shell file:${ourHomeDir}/share/cells/transfermanagers.cell
      exec -shell file:${ourHomeDir}/share/cells/srm.cell
      
      
    • On each SRM node edit /opt/d-cacheetc/tc-config.xml so that server host element point to correct location of the terracotta server. You might need more than one value in case of multiple terracotta servers.
      <server host="terracotta-server.example.org">
      
    • Update the version of the tomcat 5.5 TIM in /opt/d-cacheetc/tc-config.xml
      <modules>
             <module name="tim-tomcat-5.5" version="2.0.1"/>
      </modules>
      
  • Start Terracotta server
    # /opt/terracotta-3.1.1/bin/start-tc-server.sh 
    
  • Start SRM severs
    # /opt/d-cache/bin/dcache start srm
    
  • Enjoy distributed srm server

This is the diagram of simple distributed configuration that was used to test SRM

Attachments