hf_text-generation-inference/server/text_generation_server/interceptor.py

import torch
import grpc

from google.rpc import status_pb2, code_pb2
from grpc_status import rpc_status
from grpc_interceptor.server import AsyncServerInterceptor
from loguru import logger
from typing import Callable, Any


class ExceptionInterceptor(AsyncServerInterceptor):
    def __init__(self, shutdown_callback):
        self.shutdown_callback = shutdown_callback

    async def intercept(
        self,
        method: Callable,
        request_or_iterator: Any,
        context: grpc.ServicerContext,
        method_name: str,
    ) -> Any:
        try:
            response = method(request_or_iterator, context)
            return await response
        except Exception as err:
            method_name = method_name.split("/")[-1]
            logger.exception(f"Method {method_name} encountered an error.")

            # Runtime Error cannot be recovered from
            if isinstance(err, RuntimeError):
                self.shutdown_callback()

            if torch.cuda.is_available():
                torch.cuda.empty_cache()

            await context.abort_with_status(
                rpc_status.to_status(
                    status_pb2.Status(code=code_pb2.INTERNAL, message=str(err))
                )
            )
feat(server): empty cache on errors 2023-07-12 09:05:50 -06:00			`import torch`
feat(launcher): Log server stdout (#19) Co-authored-by: Nick Hill <nickhill@us.ibm.com> 2023-01-05 04:01:23 -07:00			`import grpc`

			`from google.rpc import status_pb2, code_pb2`
			`from grpc_status import rpc_status`
			`from grpc_interceptor.server import AsyncServerInterceptor`
			`from loguru import logger`
			`from typing import Callable, Any`


			`class ExceptionInterceptor(AsyncServerInterceptor):`
feat: prefill chunking (#2600) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> 2024-10-16 04:49:33 -06:00			`def __init__(self, shutdown_callback):`
			`self.shutdown_callback = shutdown_callback`

feat(launcher): Log server stdout (#19) Co-authored-by: Nick Hill <nickhill@us.ibm.com> 2023-01-05 04:01:23 -07:00			`async def intercept(`
			`self,`
			`method: Callable,`
			`request_or_iterator: Any,`
			`context: grpc.ServicerContext,`
			`method_name: str,`
			`) -> Any:`
			`try:`
			`response = method(request_or_iterator, context)`
			`return await response`
			`except Exception as err:`
			`method_name = method_name.split("/")[-1]`
			`logger.exception(f"Method {method_name} encountered an error.")`

v2.0.0 (#1736) 2024-04-12 10:38:34 -06:00			`# Runtime Error cannot be recovered from`
			`if isinstance(err, RuntimeError):`
feat: prefill chunking (#2600) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> 2024-10-16 04:49:33 -06:00			`self.shutdown_callback()`
v2.0.0 (#1736) 2024-04-12 10:38:34 -06:00
feat(server): empty cache on errors 2023-07-12 09:05:50 -06:00			`if torch.cuda.is_available():`
			`torch.cuda.empty_cache()`

feat(launcher): Log server stdout (#19) Co-authored-by: Nick Hill <nickhill@us.ibm.com> 2023-01-05 04:01:23 -07:00			`await context.abort_with_status(`
			`rpc_status.to_status(`
			`status_pb2.Status(code=code_pb2.INTERNAL, message=str(err))`
			`)`
			`)`