hf_text-generation-inference/server/Makefile-selective-scan

selective_scan_commit := 2a3704fd47ba817b415627b06fd796b971fdc137

causal-conv1d:
	rm -rf causal-conv1d
	git clone https://github.com/Dao-AILab/causal-conv1d.git

build-causal-conv1d: causal-conv1d
	cd causal-conv1d/ && git checkout v1.1.1 # known latest working version tag
	cd causal-conv1d/ && CAUSAL_CONV1D_FORCE_BUILD=TRUE python setup.py build

install-causal-conv1d: build-causal-conv1d
	pip uninstall causal-conv1d -y || true
	cd causal-conv1d/ && pip install .

# selective-scan dependends on causal-conv1d
selective-scan:
	rm -rf mamba
	git clone https://github.com/state-spaces/mamba.git mamba

build-selective-scan: selective-scan
	cd mamba/ && git fetch && git checkout $(selective_scan_commit)
	cd mamba && python setup.py build

install-selective-scan: install-causal-conv1d build-selective-scan
	pip uninstall selective-scan-cuda -y || true
	cd mamba && pip install .

build-all: build-causal-conv1d build-selective-scan
Impl simple mamba model (#1480) This draft PR is a work in progress implementation of the mamba model. This PR currently loads weights, and produces correct logits after a single pass. This PR still needs to correctly integrate this model so it produces tokens as expected, and apply optimization to avoid all copies during runtime/unnecessary operations. #### Helpful resources [Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752) https://github.com/johnma2006/mamba-minimal https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs https://github.com/huggingface/transformers/pull/28094 Notes: this dev work is currently targeting `state-spaces/mamba-130m`, so if you want to test please use that model. Additionally when starting the router the prefill needs to be limited: `cargo run -- --max-batch-prefill-tokens 768 --max-input-length 768` ## Update / Current State Integration tests have been added and basic functionality such as model loading is supported. ```bash cd integration-tests pytest -vv models/test_fused_kernel_mamba.py ``` - [x] add tests - [x] load model - [x] make simple request - [ ] resolve warmup issue - [ ] resolve output issues fetching models tested during dev ```bash text-generation-server download-weights state-spaces/mamba-130m text-generation-server download-weights state-spaces/mamba-1.4b text-generation-server download-weights state-spaces/mamba-2.8b ``` The server can be run ```bash cd server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b ``` router ```bash cargo run ``` make a request ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' \| jq ``` response ```json { "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data." } ``` --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> 2024-02-08 02:19:45 -07:00			`selective_scan_commit := 2a3704fd47ba817b415627b06fd796b971fdc137`

			`causal-conv1d:`
			`rm -rf causal-conv1d`
			`git clone https://github.com/Dao-AILab/causal-conv1d.git`

			`build-causal-conv1d: causal-conv1d`
			`cd causal-conv1d/ && git checkout v1.1.1 # known latest working version tag`
			`cd causal-conv1d/ && CAUSAL_CONV1D_FORCE_BUILD=TRUE python setup.py build`

			`install-causal-conv1d: build-causal-conv1d`
			`pip uninstall causal-conv1d -y \|\| true`
			`cd causal-conv1d/ && pip install .`

			`# selective-scan dependends on causal-conv1d`
chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			`selective-scan:`
Impl simple mamba model (#1480) This draft PR is a work in progress implementation of the mamba model. This PR currently loads weights, and produces correct logits after a single pass. This PR still needs to correctly integrate this model so it produces tokens as expected, and apply optimization to avoid all copies during runtime/unnecessary operations. #### Helpful resources [Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752) https://github.com/johnma2006/mamba-minimal https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs https://github.com/huggingface/transformers/pull/28094 Notes: this dev work is currently targeting `state-spaces/mamba-130m`, so if you want to test please use that model. Additionally when starting the router the prefill needs to be limited: `cargo run -- --max-batch-prefill-tokens 768 --max-input-length 768` ## Update / Current State Integration tests have been added and basic functionality such as model loading is supported. ```bash cd integration-tests pytest -vv models/test_fused_kernel_mamba.py ``` - [x] add tests - [x] load model - [x] make simple request - [ ] resolve warmup issue - [ ] resolve output issues fetching models tested during dev ```bash text-generation-server download-weights state-spaces/mamba-130m text-generation-server download-weights state-spaces/mamba-1.4b text-generation-server download-weights state-spaces/mamba-2.8b ``` The server can be run ```bash cd server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b ``` router ```bash cargo run ``` make a request ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' \| jq ``` response ```json { "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data." } ``` --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> 2024-02-08 02:19:45 -07:00			`rm -rf mamba`
			`git clone https://github.com/state-spaces/mamba.git mamba`

			`build-selective-scan: selective-scan`
			`cd mamba/ && git fetch && git checkout $(selective_scan_commit)`
			`cd mamba && python setup.py build`

chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			`install-selective-scan: install-causal-conv1d build-selective-scan`
Impl simple mamba model (#1480) This draft PR is a work in progress implementation of the mamba model. This PR currently loads weights, and produces correct logits after a single pass. This PR still needs to correctly integrate this model so it produces tokens as expected, and apply optimization to avoid all copies during runtime/unnecessary operations. #### Helpful resources [Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752) https://github.com/johnma2006/mamba-minimal https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs https://github.com/huggingface/transformers/pull/28094 Notes: this dev work is currently targeting `state-spaces/mamba-130m`, so if you want to test please use that model. Additionally when starting the router the prefill needs to be limited: `cargo run -- --max-batch-prefill-tokens 768 --max-input-length 768` ## Update / Current State Integration tests have been added and basic functionality such as model loading is supported. ```bash cd integration-tests pytest -vv models/test_fused_kernel_mamba.py ``` - [x] add tests - [x] load model - [x] make simple request - [ ] resolve warmup issue - [ ] resolve output issues fetching models tested during dev ```bash text-generation-server download-weights state-spaces/mamba-130m text-generation-server download-weights state-spaces/mamba-1.4b text-generation-server download-weights state-spaces/mamba-2.8b ``` The server can be run ```bash cd server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b ``` router ```bash cargo run ``` make a request ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' \| jq ``` response ```json { "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data." } ``` --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> 2024-02-08 02:19:45 -07:00			`pip uninstall selective-scan-cuda -y \|\| true`
			`cd mamba && pip install .`

chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			`build-all: build-causal-conv1d build-selective-scan`