Wednesday, March 14, 2007

"TCP Socket (KGAS)" Wait Event

We recently had this issue in our organization. Suddenly our Oracle Database (Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 on AIX 5.3) was experiencing waits on "TCP Socket (KGAS)" event. Upon investigation, we found out that it was because the mail server was down and the procedure tries to open a new connection with the SMTP server and it takes 75 seconds to timeout on IBM-AIX. (On Windows 2003, the default timeout is 20 seconds.)

There is no way of controlling timeout from within Oracle. When you pass the timeout parameter while opening connection with the SMTP Server, it doesn’t mean that it will timeout after the specified time rather it means that the subsequent Read/Write operations will timeout.

mail_conn := Utl_Smtp.open_connection(mailhost,MAILPORT, 10);

From above, it will take 10 seconds timeout time for the Read/Write operations after establishing the connection.

The only way to reduce the timeout from 75 seconds is to set TCP Keepalive parameters at OS level. It’s nice that we can still control it but at an expense of affecting all OS-level TCP connection.

Following parameters can be used to control the timeout in IBM-AIX:

1) net.ipv4.tcp_keepalive_time (seconds)

2) net.ipv4.tcp_keepalive_intvl (seconds)

3) net.ipv4.tcp_keepalive_probes

Be careful when setting the above parameters, as this will have a Global affect. Every new connection will be affected with the new timeout time. Now we are able to:

1) Control timeout while opening connection with the SMTP server using OS-level parameters, and

2) Pass a timeout in seconds for the subsequent Read/Write operations.

Are we done with our job? No, not yet!

What if the mail server goes down soon after establishing a connection and before starting the Read/Write operation, I mean during the handshake and Email verification procedures?

What happens is that, the sessions keep waiting forever for the SMTP server’s reply. Unfortunately, you have to locate and kill them.

We haven’t tried this ourselves but I have learned that using third party Java tool we can control the entire timeout issue.

Following is the procedure we use to send emails:

Procedure Test_Mail(
sender IN VARCHAR2, -- Mail Sender's Name
recipient IN VARCHAR2, -- Primary Mail ID of the Recipient
P_MESSAGE IN LONG, -- Any mail message
mailhost IN VARCHAR2, -- Mail host (IP Address)
MAILPORT IN NUMBER -- Port Number of the Mail Host
)
IS
mail_conn Utl_Smtp.connection;
v_mail_reply Utl_Smtp.reply;
BEGIN

mail_conn := Utl_Smtp.open_connection(mailhost,MAILPORT, 10); -- Timeout after 20 seconds

Utl_Smtp.helo(mail_conn, mailhost);
Utl_Smtp.mail(mail_conn, sender);
v_mail_reply := utl_smtp.vrfy(mail_conn, recipient);

MAIL_REPLY_CODE := v_mail_reply.code ;
IF v_mail_reply.code <> 550 THEN
Utl_Smtp.rcpt(mail_conn, recipient);
utl_smtp.open_data(mail_conn);
utl_smtp.write_data( mail_conn, 'MIME-version: 1.0' utl_tcp.CRLF);
utl_smtp.write_data( mail_conn, 'Content-TYPE: text/plain; charset=Arabic(Windows)' utl_tcp.CRLF);
utl_smtp.write_data( mail_conn, 'Content-Transfer-Encoding: 8bit' utl_tcp.CRLF);
utl_smtp.write_raw_data(mail_conn, utl_raw.cast_to_raw(mesgUTL_TCP.CRLF));
utl_smtp.close_data(mail_conn);
Utl_Smtp.quit(mail_conn);
END IF ;
EXCEPTION
WHEN OTHERS THEN
Proc_Execlog(' Error in Test_mail :' SQLERRM);
Utl_Smtp.quit(mail_conn);
END;

Regards

No comments: