SysManager
A system for monitoring many networked machines from a single interface.
Authors: Jason Carlyle and Phil White
Technical Details of the SysManager System
The SysManager system is composed of three main parts, as described in the Users manual. The justification for a three part design is as follows:
The system obviously requires some sort of user interface that can be started and stopped at any time. This is the first component of the system, the SysManager User Interface.
When the user interface is started, all information about all known machines must appear instantly. The interface can not be expected to gather this information from all the machines itself in a reasonable amount of time. This requires a second part to the system, a central 'database' of all known machines that must stay running at all times (which is the only way to guarantee the information to be ready at any time). This component is known as 'Collectord'.
Since this database is running on one machine, it obviously cannot gather detailed information about each machine. This necessitates the third component, known as 'Sysmanagerd', which runs continuously on each machine to be monitored.
Following are detailed technical descriptions of each of the three components of the SysManager System:
Sysmanagerd Technical Description
Purpose:
Sysmanagerd is responsible for the following functions:
- On Startup, Sysmanagerd must "register" itself with the server specified via the command line.
- Once "registered", Sysmanagerd must responsed to request for information with a string of information about the machine.
- If Sysmanagerd is not contacted by the Collectord with a specified amount of time, Sysmanagerd should try to "re-register" itself to the server.

Registration procedure
Upon startup and anytime the Collectord has not contacted Cliented within a specified amount of time, Sysmanagerd goes into this procedure.
-
Sysmanagerd opens a connection to the machine that was specified as Collectord on the command line at startup.
- Once the connection is open, Sysmanagerd sends the NEWCLIENT command.
- Sysmanagerd then sends the QUIT command to let Collectord know that it can close connection.
- Sysmanagerd then closes the connection and considers itself registered
- Since it believes itself to be registered, it moves to the Connection Wait stage.
Connection Wait procedure
The purpose of Connection Wait is, as its name implies, to simply wait on Collectord to contact Sysmanagerd and request information. If contact is not made within a certain time, Connection Wait times out and Sysmanagerd is sent back into the Registration procedure. However, if a connection is made, the timeout counter is reset and Sysmanagerd is sent into Command Mode.
Command Mode procedure
Once in Command Mode, there a four possibilities to exit it.
- POLL Command
- When a POLL command is received by Sysmanagerd, it sets its expected polling time interval to the number directly following the command word POLL. Sysmanagerd's polling time interval is used to determine timeouts that are used throughout Sysmanagerd. After resetting the expected polling interval, control is returned to Command Mode.
- RETRIEVEINFO Command
- When a RETRIEVEINFO command is received, Sysmanagerd returns a string of information that relays to Collectord vital machine statistics. After this, Sysmanagerd returns to the Command Mode state.
- QUIT Command
-
Whenever this command is received, Sysmanagerd closes the open connection and returns to the Connection Wait state.
- Command Timeout
-
Whenever command mode times out (a command is not sent within a certain time limit), it is assumed that Collectord exited for some reason. Thus, Sysmanagerd is then sent to the Registration procedure to re-register itself.
Collectord Technical Description
Purpose:
Collectord is responsible for the following functions:
- Collectord must check for registration requests from Sysmanagerds and handle them appropriately.
- Collectord must handle request from the User Interface such as setting the polling interval or retrieving all of Collectords gathered information.
- Collectord must gather information from all registered Sysmanagerds every specifies polling interval.

Main Data Structures Used:
Machine information for each registered machine is stored in a dynamically allocated list that contains the machines ip address and string of information that was returned by the RETRIEVEINFO command.
Startup Process
On startup, the list of machines is initialized and Collectord is immediately sent into its Mainloop.
Mainloop procedure
Collectord consists of three major pieces: Checking for incoming requests, Gathering machine information, and Checking the Message Queue. The Mainloop consists of the first two of these three pieces plus a wait period so as to not continually poll machines while Checking the Message Queue is done whenever a SIGUSR1 is raised.
Checking for incoming requests
Collectord checks for incoming requests until there is at least one registered machine to gather information from. Once at least one machine is found, control continues down the loop to gather information for the machine. If a request is received by Served while in this state, Collectord is sent into a command mode to handle the request. Once in command mode, the following four commands are valid:
- POLL Command
- When a POLL command is received by Collectord, it sets its poll interval to match the number directly following the command word POLL. This polling interval is used to determine how long the wait state contained in the main loop lasts. Also, this polling interval will be propogated to all Sysmanagerds that this Collectord communicates with. After setting the polling interval, control is returned to Command Mode.
- NEWCLIENT Command
- When a NEWCLIENT command is received, Collectord registers the machine at the connecting IP address. Since the only known information about this machine is its IP address, the User Interface will not know of it until complete information has been gathered for it.
- SPEWFORTHALL Command
- When a SPEWFORTHALL command is received, Collectord steps down its list of known machines and returns the information for each one. This command was designed for the User Interface to gather all the knowledge of Collectord in one command.
- QUIT Command
-
Whenever this command is received, Collectord closes the open connection and returns to checking for a request if no machines are registered yet.
Gathering Machine Information
To allow for scalability, the information gathering phase of Collectord is multithreaded. What this means is that to gather machine information a child is forked and then follows the following procedure:
- The child opens a tcp communication channel to the specified Sysmanagerd.
- Once the channel is open, the child send the RETRIEVEINFO command to gather the machines information.
- This information is packaged up in a message and sent to the parent.
- If there a communication problems, a special message is sent to the parent process instead of the informational message.
- The child then raises a SIGUSR1 in the parent to let it know that a message is ready.
- The child process the sets the polling interval on the Sysmanagerd to which it is connected.
- As a final step, the child closes the communication channel, and exits
Checking the Message Queue
Since Collectord is forked to gather information for each machine, there needs to be some method to communicate gathered information back to the parent process. This was accomplished through the use of message queues. Once the child process gathers information and puts it into a message, it raises a SIGUSR1 to let the parent know to check the message queue. There are two types of messages that the could be sending:
- Communication Problems
- In this case, the child was not able to communicate properly with the Sysmanagerd that is was told to. The parent raises the warn count on the machine in the registered machines list. Once this warn count is above a certain limit, the machine is removed from the registered list and will no longer be contacted for information. That is, if the Sysmanagerd on the machine in question exited, it must re-register itself before Collectord will start to ask for information about it again.
- Updated Machine information
-
A message of this type means that there was successful communication with the Sysmanagerd specified and the information was returned via the message queue. The machine is then located in the list and its information is updated.
SysManager Tcl/Tk Interface Technical Description
Main Data Structures Used:
In order to store the various pieces of information about several machines, the interface uses an array of lists known as machines. Each element in the array represents all of the data relevant to one machine. Two of these arrays are needed to implement warning messages: One to store the current state of all machines, and one to store the previous state of all machines.
Startup Process
When the Interface is launched, the following defaults are set within the interface:
- The default server host is set to localhost and the port is set to 5055
- The polling interval is set to 5 seconds
- Display cycling is turned on
- The display cycling interval is set to 3.0 seconds
- The rapid disk and swap space thresholds are set to 1 percent
- The low disk and swap space warnings are set to 10 percent
- The high load average warning is set to 1.0
- The critical load average warning is set to 2.0
- The high process quantity warning is set to 100
The interface is then constructed, and the message SysManager Interface Started is printed in the message display. The command line is then parsed for the server name, and the procedure mainloop is called, followed by the procedure cycle.
Mainloop procedure
The mainloop procedure is responsible for keeping the information stored by the interface up-to-date. Here is the logic used by mainloop to control the interface:
- Check to see if a machine is currently selected in the machine display. If so:
- Clear the information display.
- Update each field of the information display with the respective information for the selected machine.
- Contact the server:
- Open a socket to the server. (Print an error if the connection fails.)
- Send the SPEWFORTHALL command.
- Receive the information string from the server using non-blocking I/O.
- Send the current polling interval to the server using the POLL command.
- Close the connection to the server.
- Parse the information received by the server into a machines array.
- Compare the machines array just created with the previous machines array.
- If a machine is in the previous array, but not in the current array, remove it from the display, and print a message indicating lost contact.
- If a machine is in the current array, but not in the previous array, add a new icon in the machine display, and print a message indicating a new machine.
- Check the differences in all statistics against those specified in the preferences, and print any appropriate warnings regarding individual machines.
- Update the polling interval to whichever value is greater:
- The number of machines
- The polling interval specified in the preferences
- Schedule mainloop to be executed again one polling interval in the future.
Cycle procedure
This procedure is responsible for selecting different machines in the machine display automatically. The cycle procedure uses the following logic:
- If cycling is turned on and there are machines in the machine display, set the current machine to the next machine in the array (modulo the length of the array).
- Update the information display with the new machine's information.
- Schedule the cycle procedure to be executed again one cycle interval (as specified by the preferences) in the future.
Miscellaneous Events/Procedures
'Show All Warnings' button
When the Show All Warnings button is activated, the current machines array is examined for conditions that require a warning (as specified in the preferences). Any relevant warnings are then printed, with the exception of rapidly changing disk and swap space (these warnings depend on rates, not instantaneous conditions).
'Clear Messages' button
When the Clear Messages button is activated, the message display is cleared, and a message indicating this is then printed.
Button-1 Click on a machine icon
This event is bound to a procedure which updates the currently selected machine in the machine display, and displays the appropriate statistics in the information display.
Button-1 Motion on a machine icon
This event causes the currently selected icon to be moved the same amount the mouse is moving.
Button-1 Double Click on a machine icon
This event causes SysManager to look up the IP number of the machine referred to by the selected icon, and spawn an xterm with the command "telnet x" (with x replaced by the IP number).
Preferences item in the settings menu
This event causes a new window to be created with several scale widgets that manipulate various settings. These settings are described in depth in the Preferences Panel subsection of the SysManager User Interface section of this manual. The OK button at the bottom of the window causes the window and the widgets inside it to be destroyed.
Quit item in the file menu
This event causes all widgets to be destroyed, and exits wish.