IPEX-LLM

IPEX-LLM is a low-bit LLM optimization library on Intel XPU (Xeon/Core/Flex/Arc/Max). It can make LLMs run extremely fast and consume much less memory on Intel platforms. It is open sourced under Apache 2.0 License.

This example goes over how to use LangChain to interact with IPEX-LLM for text generation.

Setup

# Update Langchain

%pip install -qU langchain langchain-community

Install IEPX-LLM for running LLMs locally on Intel CPU.

%pip install --pre --upgrade ipex-llm[all]

Usage

from langchain.chains import LLMChain
from langchain_community.llms import IpexLLM
from langchain_core.prompts import PromptTemplate

template = "USER: {question}\nASSISTANT:"
prompt = PromptTemplate(template=template, input_variables=["question"])

Load Model:

llm = IpexLLM.from_model_id(
    model_id="lmsys/vicuna-7b-v1.5",
    model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

2024-03-27 00:58:43,670 - INFO - Converting the current model to sym_int4 format......

Use it in Chains:

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What is AI?"
output = llm_chain.run(question)

/opt/anaconda3/envs/shane-langchain2/lib/python3.9/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `run` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.
  warn_deprecated(
/opt/anaconda3/envs/shane-langchain2/lib/python3.9/site-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (4096) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
/opt/anaconda3/envs/shane-langchain2/lib/python3.9/site-packages/ipex_llm/transformers/models/llama.py:218: UserWarning: Passing `padding_mask` is deprecated and will be removed in v4.37.Please make sure use `attention_mask` instead.`
  warnings.warn(
/opt/anaconda3/envs/shane-langchain2/lib/python3.9/site-packages/ipex_llm/transformers/models/llama.py:218: UserWarning: Passing `padding_mask` is deprecated and will be removed in v4.37.Please make sure use `attention_mask` instead.`
  warnings.warn(

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
AI stands for "Artificial Intelligence." It refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI can be achieved through a combination of techniques such as machine learning, natural language processing, computer vision, and robotics. The ultimate goal of AI research is to create machines that can think and learn like humans, and can even exceed human capabilities in certain areas.

Setup​

Usage​

Help us out by providing feedback on this documentation page:

Setup

Usage